Þ   briarpig  » thorn  » demos  » atomic


demos are explained here; a menu at top column right indexes actual topic demos. Here we demo atomic integers.

problem

     In threadsafe apps Wil sometimes wants to change integers concurrently accessed by more than one thread, but without lock overhead using mechanisms like a pthread mutex.

     Basically, using a pthread mutex is heavyweight compared to just incrementing or decrementing an integer atomically. Since code for a typical pthread implementation needs to atomically change integer values, you can look in your local pthread code for inline assembler for your processor.

     The code shown below works for Intel processors. If you have another processor, you need to get someone else's inline assembler to handle atomic increment and decrement. (Or if you aren't using multiple threads and/or shared memory, you can also undefine MU_INTEL to get the non-atomic inline code.)

platform

     Here's how Wil indicates the platform is Intel:

#define MU_INTEL 1 /* Intel assembler okay to use « */

     Pretty simple, eh? Feel free to do otherwise.

#ifdef MU_INTEL #define MU_ATOMIC_ASM /* use Intel inline asm « */ #endif

     Actual use of inline assembler for atomic increment and decrement is controlled by the switch above.

assembler

     This section shows the code for atomic increment and decrement using inline Intel assembler. Almost no description of the assembler is given. Wil doesn't really grok Intel assembler. (Years of debugging C++ in 68K assembler are not the right background for reading Intel assembler.) But test code is shown, which demonstrates the effect matches the description.

     Empirically, you can treat the effect of the assembler as whatever pattern of result is shown by the test code. Wil added alternative C versions of the inline methods when MU_ATOMIC_ASM is not defined. The test code listed prints exactly the same result whether or not MU_ATOMIC_ASM is defined, so the C version matches the assembler — except it's not thread safe.

increment «

#ifdef MU_ATOMIC_ASM static inline i32 yinc_vi32(volatile i32* p) { « int v = 1; __asm__ __volatile__ ( "lock; xaddl %0, %1; " : "+r" (v), /* arg %0 */ "=m" (*p) /* arg %1 */ : "m" (*p) ); return v; }

#else /*MU_ATOMIC_ASM*/ static inline i32 yinc_vi32(volatile i32* p) { « return (*p)++; // return value BEFORE increment } #endif /*MU_ATOMIC_ASM*/

     The meaning of yinc_vi32() is what appears in the C version shown second above, which returns (*p)++.

     Note the previous value of the atomic integer is returned, instead of the current value after increasing by one. Why?

     Because for shared circular array applications (see top column right), the last value of the counter is the index allocated, and the value afterward is the next index to allocate. That's the only application of atomic integers where the value matters. (In refcounting, only zero and nonzero matter.)

     Here is sample code showing the effect:

volatile i32 ai = 0; // atomic integer « yout << "# ascending:" << yendl; unsigned j = 0; for ( ; j < 5; ++j) { int ret = yinc_vi32(&ai); yout.of("(i=%d r=%d) ", ai, ret); }

     The loop above writes the following on stdout. (The same thing appears whether or not MU_ATOMIC_ASM is defined.)

# ascending: (i=1 r=0) (i=2 r=1) (i=3 r=2) (i=4 r=3) (i=5 r=4)

     Note the return value each time is one less than the current value. (If other threads were also changing ai, we could not safely read its current value.)

decrement «

     Atomic decrement is only needed by refcounting, which only wants to know when refs hits zero. So ydec_vi32() only returns zero or nonzero — zero means the value hit zero:

#ifdef MU_ATOMIC_ASM static inline int ydec_vi32(volatile i32* p) { « int32_t ret; __asm__ __volatile__( "lock ; decl (%%edx)\n\t" "setne %%al\n\t" "movzbl %%al,%%eax\n\t" :"=a" (ret) :"d" (p) :"memory" ); return (int) ret; // return (--*p != 0); }

     The following C version returns the same answer when MU_ATOMIC_ASM is undefined, but not atomically:

#else /*MU_ATOMIC_ASM*/ static inline i32 ydec_vi32(volatile i32* p) { « return (--*p != 0); // zero ONLY if *p becomes zero } #endif /*MU_ATOMIC_ASM*/

     Sample code for ydec_vi32() assumes the sample code shown above ran immediately before.

yout << yendl << yendl << "# descending:" << yendl; « for (j = 0 ; j < 6; ++j) { int ret = ydec_vi32(&ai); yout.of("(i=%d r=%d) ", ai, ret); } yout << yendl << ynow; // flush yout to stdout

     Which prints the following on stdout whether or not MU_ATOMIC_ASM is defined:

# descending: (i=4 r=1) (i=3 r=1) (i=2 r=1) (i=1 r=1) (i=0 r=0) (i=-1 r=1)

     Clearly ydec_vi32() returns 0 when the count goes to zero and one otherwise. In refcounting code, a zero return means the object has no more refs of the kind being counted.

     Since a negative value should not occur in refcounting, you might assert the current value is never negative. (The sample code here forces the value to go negative to show the corresponding return value; but this isn't expected in normal refcounting.)

A submenu for demos appears below, letting you go to the page on a topic written as a demo (as the demos page defines it).

menu

     thorn: todo, names, fd, iovec, assert, log, run, hex, crc, buf, in, out, quote, escape, compare, file, deck, cow, arc, blob, tree, slice, rand, time, stat, hash, heap, node, primes, page, book, pile, stack, atomic « Þ, lock, mutex, thread, map, meter, list, iter, ctype

     (mu: toy, peg, imm, tag, box, symbol, token, number, bigint, class, method, reader, writer, eval, env, vm, gc, world, pcode, compiler, asm, lathe, lisp, smalltalk, design, weight, jar, card, harp, debug, profile)

     Some demos are stubs: todo is a demo guide. See toy for mu updates on language pages; names introduces naming schemes.

applications

     Wil has two main applications for atomic increment and decrement: refcounting and circular buffer allocation. Only refcounting appears in demos: see the node demo for þ refcounting apis. But both are explained very briefly below. Let's discuss allocation first.

shared circular arrays

     Suppose you have a fix-sized capacity storage space composed of N elements at indexes zero to N-1, arranged in a circular manner so element zero is allocated after the element at index N-1. Allocation uses clock arithmetic so N is understood to mean index zero. That is, the index is always C%N for atomically incremented counter C.

     If N is a power of two (definitely a good idea in this case) then you need never change counter C in any way besides atomic increment, because you can simply mask off the low bits to get an index. When the counter wraps on integer overflow, this should not disturb the desired index sequence.

     (If N is a not a power of two — and why would you make this choice? — something more interesting might be needed: a thread allocating index N can understand this as zero, then reset counter C to one for the next thread. Any thread allocating an index over N will see the index is invalid, but can wait for the thread who received N to reset the counter. This implies a spin loop to wait for the counter to reset — that's why N should be a power of two instead.)

     Now add one more assumption: suppose your circular buffer array is in shared memory, and more then one process wants to allocate entries. Does atomic integer increment still work? Yes it does. At least on Intel it does. The cache line locking involved operates on physical memory, not logical, so two processes incrementing an atomic int in shared memory will contend for the same cache line even if the shared memory segment is loaded in different parts of address space. (Note I haven't verified this empirically, but it's the word so far.)

     So if the atomic integer describing the next circular array slot to allocate is located in shared memory, along with the array itself, then you can use atomic integer increment to track allocation in shared memory without any other signalling between processes. For example, you can implement a message queue in shared memory that needs no more elaborate mechanism to maintain than atomic integers.

     But this is exotic. Refcounting is much more common.

atomic refcounts

     Refcounting is much more common in applications. When a refcounted object is used by more than one thread, great care must be taken to ensure increments and decrements in ref counts are threadsafe. A mutex works, but this is heavyweight. It's much cheaper to make the refcount an atomic integer, so changes by more than one thread remain consistent without data loss.

     Note an atomic integer increment or decrement is much more expensive than plain vanilla integer addition and subtraction — so you shouldn't do it any more than necessary. But it's far less expensive than a context switch, and atomic integer update will never cause a context switch. It should even be substantially faster than a mutex lock and unlock without any contention. Atomic integers just win compared to using heavier weight synchronization.

     Note when using atomic integers for refcounts, it's tempting to think you can safely see the current value of a counter. But race conditions are impossible to control without real mutex. Any time you try to look at the current count of refs, you will only see a snapshot that might already be up or down from a value you saw.

     Except for one value: zero. When a count of references is zero, you can assume your thread is the only thread that still sees the counted object involved. No one else can be about to increment the count of refs because everyone else has already given up all references — if refcounting is done correctly, an increment should never occur unless the count is nonzero. So zero is the only stable refcount value, and means a counted object is now single threaded, used only by the thread that just decremented the count from one to zero.

     This is why it's safe to destroy a refcounted object when refs go to zero: a destructor can be run safely without thread safety because safety won't be needed given only one accessing thread.

     But a refcount of one? There's no easy way to tell one ref isn't already on the way to becoming two. It would be nice to tell when a refcount is exactly one, such as in copy-on-write applications (see the future cow demo), because that would tell you an object can safely be modified because it's not shared — otherwise you would need to copy before modification.

     For this reason, when Wil wants to implement cow (copy-on-write) optimizations, he sometimes adds another bit of state to a refcounted object: a single-byte flag acting as a one-way-boolean (it doesn't change once it reaches the true state) to say shared. This byte is set when a refcounted object is shared with another thread, or published somewhere another thread can find it. The name of this byte is usually atomic, and when true it implies thread races are present, and atomic increment is required and cow-motivated predicates like has-one-owner are forbidden because they can't be safe.

aternate increment

     An alternate version of atomic increment returning no value is fine for use in refcounting:

#ifdef MU_ATOMIC_ASM static inline void yinc_vi32_alt(volatile i32* p) { « __asm__ __volatile__( "lock ; incl (%%eax)" : :"a"(p) :"memory" ); }

#else /*MU_ATOMIC_ASM*/ static inline void yinc_vi32_alt(volatile i32* p) { « ++*p; // increment but don't return value } #endif /*MU_ATOMIC_ASM*/

     Sample code below resembles code shown earlier:

yout << yendl << yendl << "# alt no return:" << yendl; « for (j = 0, ai = 0 ; j < 5; ++j) { yinc_vi32_alt(&ai); yout.of("(i=%d r=.) ", ai); }

     And the output on stdout resembles what appeared in yinc_vi32() sample output (except here there's no return):

# alt no return: (i=1 r=.) (i=2 r=.) (i=3 r=.) (i=4 r=.) (i=5 r=.)

license

     All this code is available only under the BriarPig mu-babel license described fully on the rights page. You do not have permission to reprint this page in any way. Neither feeds nor repackaging is allowed. You can link this page if you want folks to read it.