7/08/2011

07-08-11 - Who ordered Event Count -

Say you have something like a lock-free queue , and you wish to go into a proper OS-level thread sleep when it's empty (eg. not just busy wait).

Obviously you try something like :


popper :

item = queue.pop();
if ( item == NULL )
{
  Wait(handle);
}


pusher :

was_empty = queue.push();
if ( was_empty )
{
  Signal(handle);
}

where we have extended queue.push to atomically check if the queue was empty and return a bool (this is easy to do for most lock-free queue implementations).

This doesn't work. The problem is that between the pop() and the Wait(), somebody could push something on the queue. That means the queue is non-empty at that point, but you go to sleep anyway, and nobody will ever wake you up.

Now, one obvious solution is to put a Mutex around the whole operation and use a "Condition Variable" (which is just a way of associating a sleep/wakeup with a mutex). That way the mutex prevents the state of the queue from changing while you decide to go to sleep. But we don't want to do that today because the whole point of our lock-free queue is that it's lock-free, and a mutex would spoil that. (actually I suppose the classic solution to this problem would be to use a counting semaphore and inc it on each push and dec it on each pop, but that has even more overhead if it's a kernel semaphore). Basically we want these specific fast paths and slow paths :


fast path :
  when popper can get an item without sleeping
  when pusher adds an item and there are no waiters

slow path :
  when popper has no work to do and goes to sleep
  when pusher needs to wake a popper

we're okay with the sleep/wakeup part being slow because that involves thread transitions anyway, so it's always slow and complicated. But in the other case where a core is just sending a message to another running core it should be mutex-free.

So, the obvious answer is to do a kind of double-check, something like :


popper :

item = queue.pop(); 
if ( item ) return item;  // *1
atomic_waiters ++;   // *2
item = queue.pop();  // *3
if ( item ) { atomic_waiters--; return item; }
if ( item == NULL )
{
  Wait(handle);
}
atomic_waiters--;


pusher :

queue.push();
if ( atomic_waiters > 0 )
{
  Signal(handle);
}

this gets the basics right. First popper has a fast path at *1 - if it gets an item, it's done. It then registers the fact that it's going to wait to a shared variable at *2. Then it double checks at *3. The double check is not an optimization to avoid sleeping your thread, it's crucial for correctness. The issue is that the popper could swap out between *1 and *2, and the pusher could then run completely, and it will see waiters == 0 and not do a signal. So the double-check at *3 catches this.

There's a performance bug with this code as written - if the queue goes empty then you do lots of pushes, all those pushes send the signal. You might be tempted to fix that by moving the "atomics_waiters--" line to the pusher side (before the signal), but that creates a race. You could fix that but then you spot a bigger problem :

This code doesn't work at all if you have multiple pushers or poppers. The problem is "lost wakeups". Basically if there are multiple poppers going into wait at the same time, the pusher may think it's done the wakeups it needs to, but it hasn't, and a popper goes into a wait forever.

To fix this you need a real proper "event count". What a proper event_count does is register the waiter at a certain point in time. The usage is like this :


popper :

item = queue.pop(); 
if ( item ) return item;
count = event_count.get_event_count();
item = queue.pop();
if ( item ) { event_count.cancel_wait(count); return item; }
if ( item == NULL )
{
  event_count.wait_on_count(count);
}


pusher :

queue.push();
event_count.signal();

Now, as before get_event_count() registers in an atomic visible variable that I want to wait on something (most people call this function prepare_wait()), but it also records the current "event count" to identify the wait (this is just the number of pushes, or the number of signals if you like). Then wait_on_count() only actually does the wait if the event_count is still on the same count as when I did get_wait_count - if the internal counter has advanced the wait is not done. signal() is a fast nop if there are no waiters, and increments the internal count.

This eliminates the lost wakeup problem, because if the "event_count" has advanced (and signaled some other popper, and won't signal again) then you will simply not go into the Wait.

Basically it's exactly like a Windows Event in that you can wait and signal, but with the added feature that you can record a place in time on the event, and then only do the Wait if that time has not advanced between your recording and the call to Wait.

It turns out that event_count and condition variables are closely related; in particular, one can very easily be implemented in terms of the other. (I should note that the exact semantics of pthread cond_var are *not* easy to implement in this way, but a "condition variable" in the broader sense need not comply with their exact specs).

Maybe in the future I'll get into how to implement event_count and condition_var.

BTW eventcount is the elegant solution to the problem of Waiting on Thread Events discussed previously.


ADDENDUM : if you like, eventcount is a way of doing Windows' PulseEvent correctly.

"Event" on Win32 is basically a "Gate" ; it's either open or closed. SetEvent makes it open. When you Wait() on the Event, if the gate is open you walk through, if it's closed you stop. The normal way to use it is with an auto-reset event, which means when you walk through the gate you close it behind yourself (atomically).

The idea of the Win32 PulseEvent API is to briefly open the gate and let through someone who was previously waiting on the gate, and then close it again. Unfortunately, PulseEvent is horribly broken by design and almost always causes a race, which leads most people to recommend against ever using PulseEvent (or manual reset events). (I'm sure it is possible to write correct code using PulseEvent, for example the race may be benign and just be a performance bug, but it is wise to follow this advice and not use it).

For example the standard Queue code using PulseEvent :

popper :
  node = queue.pop();
  if ( ! node ) Wait(event);

pusher :
  queue.push(node);
  PulseEvent(event);

is a totally broken race (if the popper is between getting a null node and enterring the wait, it doesn't see the event), and most PulseEvent code is similarly broken.

eventcount's Signal is just like PulseEvent - it only signals people who were previously waiting, it doesn't change the eventcount into a signalled state. But it doesn't suffer from the race of PulseEvent because it has a consistent way of defining the moment in "time" when the event fires.

2 comments:

Unknown said...

Hi Charles. What's the earliest mention of the term "event count" that you know of (for this technique)? Do you know who came up with the name?

cbloom said...

Good question. I don't know the original source. I learned it from Thomasson, Vjukov, and Joe Seigh.

Lots of posts on eventcount here :

http://cbloomrants.blogspot.com/2011/08/08-09-11-threading-links.html

but I don't see any that do early source attribution.

old rants