Atomicity, Interlocked & Memory Operation Optimizations in C#

14 Feb 2015 C#

This article discusses about atomicity, Interlocked, and memory operation optimizations made by compilers and CPU and how they affect multi-threaded programs.

1. Atomicity

I guess most of you might have run across a lock block which consists of a single operation, like this:

lock (someLockObject)
{
  counter++; // counter is an int
}

If counter might be accessed concurrently from another thread, the lock is necessary. This is because even a simple operation like increment is not atomic. It includes several steps: load the value into a register, increment it, save the value back to memory. The lock is there to ensure no other thread accesses counter in the meantime.

There is an alternative to make simple operations like increment atomic: using the Interlocked class. This class provides methods to do various operations atomically. These operations includes increment, decrement, adding/subtracting value, exchanging values, conditionally exchanging values, etc.. Hence, the snippet could be rewritten as:

Interlocked.Increment(ref counter);

Using Interlocked is simpler, and, in most cases, yields better performance. However, as the lock is often used to guard a block of several statements, you should replace lock with Interlocked only if every access to counter is done within a lock block which contains just a single replaceable operation. If that complicates your source code, stay with lock.

Check out Interlocked's documentation for details about what operations it supports.

2. Optimizations

According to section I.12.6.6 of the CLI Standard, read or write of a size no larger than the native word size is atomic. That means read or write of a byte, bool, char, int, float or reference-type value is always atomic. If so, are the two following snippets equivalent?

// Snippet 1
ready = true; // atomic operation
// Snippet 2
Interlocked.Exchange(ref ready, true);

The answer is no.

It turns out that atomicity alone is often not enough. This is because, for performance reason, the CLI Standard allows memory operation (read/write) optimizations, provided that they do not break the program if it is a single-threaded one. However, such optimizations could break a multi-threaded program if a second thread accesses the same memory locations concurrently.

Note: These optimizations might happen at software level (C# compiler, CLR's just-in-time compiler, CLR's native image generator) and/or hardware level (CPU and memory system). In practice, not all permitted optimizations are utilized. Still, you should always stick to the CLI Standard rather than relying on implementation details, otherwise your code might not run correctly on future software and hardware.

Here is an example of optimization and how it breaks a multi-threaded program.

int data;
bool ready;
void OptimizeMe()
{
  // read data for some purpose
  var tmp = data // read 1

  var thread2 = new Thread(() =>
  {
    data = 1; // write 1
    ready = true; // write 2
  });
  thread2.Start();

  while (!ready); // read 2
  Console.Write(data); // read 3
}

If no optimization is applied, one would say:

  1. The while loop at line 16 never makes the program hang, since thread2 will execute write 2 sooner or later.
  2. The program never prints 0.

However, when taking possible optimizations into account, anything can happen.

  1. The JIT compiler might attempt hoisting the read of ready out of the while loop at line 16. In other words, it decides to read ready just once and saves for subsequent iteration instead of repeatedly reading it. This optimization causes update from thread2 is ignored, and the program hangs.
  2. On a multiprocessor machine, each thread might run on a different processor. For performance reason, each processor has its own private cache. To keep caches consistent between processors, caches have to talk not only to main memory but also to caches on other processors. That takes time and cause the CPUs to wait. To reduce CPU's wait time and utilize bus traffic, modern multiprocessor systems might buffer or queue tasks for processing later when more resources are available. This makes operations executed in a specific order by one processor are visible to other processors in a different order. Back to our example, let's consider 2 scenarios:
    • On thread2's CPU, write 1 is buffered and write 2 is completed. Main thread running on a different CPU then observes that ready is true while data is 0 (default value), and it prints 0.
    • Both write 1 and write 2 are completed. Main thread then fetches the fresh value of ready but use the stale value of data from its cache (this cached copy was stored by read 1). Thus, it prints the stale value, 0.

3. Back to Safety

Fortunately, memory operation optimizations could be disabled when needed.

Memory barriers (also called memory fences) are the lowest-level primitives designed to control memory operation reordering. A full memory barrier guarantees memory operations issued prior to the barrier are executed before the operations issued after it. On a multiprocessor system, a full memory barrier causes the processor executing it to complete all buffered writes and discard or refresh all stale cached values before going on to execute the next operation. C# supports 2 types of barrier: full barrier (via Interlocked.MemoryBarrier) and half barrier (via volatile keyword and Volatile class). However, as memory barriers are very tricky and hard to use correctly (especially the "half" one), we mere mortals should generally stay away from them. If you like to dig deeper into memory barriers, there are some detailed articles on the Internet, like this one.

Note: Traditionally, some common patterns are often implemented using volatile fields, such as lazy initialization patterns and polling for cancellation flag. Since .NET Framework 4.0, you could use Lazy<T>, LazyInitializer, and CancellationToken instead.

Every Interlocked operation, in addition to being atomic, also has the effect of a full memory barrier. The broken example above could be fixed using Interlocked as following:

var thread2 = new Thread(() =>
{
  data = 1; // write 1
  Interlocked.Exchange(ref ready, true); // write 2
});
thread2.Start();

bool b;
do
{
  Interlocked.Exchange(ref b, ready); // read 2
} while (!b)
Console.Write(data); // read 3

The Interlocked operations guarantee write 1 completes before write 2, and read 2 completes before read 3, thus, the program never prints 0. It also guarantees that read 2 will obtain the fresh value of ready immediately whenever thread2 set it to true. As a result, the problem is fixed.

Note: if you declare ready as volatile, the program will never print 0. However, it is still possible that thread2 already set ready to true but main thread keep using the stale cached value of ready (in other words, write 2 get reordered with some calls of read 2). This does not break our specific program as main thread repeatedly polling for ready which should becomes true eventually (hardware buffers and queues should get processed sooner or later). However, it might be a problem in a more complicated application. That is why you should be very careful and better avoid memory barriers and volatile fields altogether unless you are a real expert in that field.

The lock statement and other higher-level concurrency primitives provided by .NET framework use memory barriers and/or Interlocked operations internally when necessary. Therefore, you don't need to worry about memory operation optimizations when using them correctly.

4. Recap

The article is getting quite long, so to summarize the main points:

  • When it comes to memory access, atomicity is not enough because of the effect of software and hardware's memory operation optimization.
  • The Interlocked class's operations are both atomic and optimization-free. You could use it instead of a lock block which contains just a single simple operation. Avoid Interlocked in more complex circumstances.
  • Don't use memory barriers (including volatile fields) at all unless you are a real expert in that field. Instead, use lock and other high-level concurrency primitives.
  • Finally, just to remind, you don't need to worry about the issues discussed in this article unless your program is multi-threaded and two threads access a shared memory location concurrently.

Happy coding!

 

To get informed about new posts, follow my Twitter or subscribe to RSS.

Recent posts:


About the Author
My photo

My name is Truong Hong Thi. I am a software engineer at FPT Software. I live in Hanoi, Vietnam with my wife and two little daughters. In my free time, I make some free Android games for my kids.

Contact: Twitter or Google+.