9/21/2010

09-21-10 - Waiting on Thread Events

A small note on a common threading pattern.

You have some worker thread which takes messages and does something or other and then puts out results. You implement the message passing in some way or other, either with a mutex or a lock-free FIFO or whatever, it doesn't matter for our purposes here.

So your main thread makes little work messages, sends them to the worker, he does them, sends them back. The tricky bit comes when the main thread wants to wait on a certain message being done. Let's draw the structure :


   main      ---- send --->     worker
  thread    <--- receive ---    thread

First assume our messages have some kind of GUID. If the pointers are never recycled it could be the pointer, but generally that's a bad idea, but a pointer + a counter would be fine. So the main thread says Wait on this GUID being done. The first simple implementation would be a spin loop :


WaitOnMessage_1( GUID )
{

    for(;;)
    {
        pop everything on the receive FIFO
        mark everything I received as done
    
        is the GUID I wanted in the done set?
           if so -> return

        yield processor for a while
    }

}

and that works just fine. But now you want to be a little nicer and not just spin, you want to make your thread actually go to sleep.

Well you can do it in Win32 using Events. (on most other OS'es you would use Semaphores, but it's identical here). What you do is you have the worker thread set an event when it pushes a message, and now we can wait on it. We'll call this ReceiveEvent, and our new WaitOnMessage is :


WaitOnMessage_2( GUID )
{

    for(;;)
    {
        Clear( ReceiveEvent );

        pop everything on the receive FIFO
        mark everything I received as done
    
        is the GUID I wanted in the done set?
           if so -> return

        Wait( ReceiveEvent );
    }

}

that is, we clear the receive event so it's unsignalled, we see if our GUID is done, if not we wait until the worker has done something, then we check again. We aren't sleeping on our specific work item, but we are sleeping on the worker doing something.

In the Worker thread it does :


    pop from "send FIFO"

    do work

    push to "receive FIFO"
    
    Signal( ReceiveEvent );

note the order of the push and the Signal (Signal after push) is important or you could have a deadlock due to a race because of the Clear. While the code as-is works fine, the Clear() is considered dangerous and is actually unnecessary - if you remove it, the worst that will happen is you will run through the Wait() one time that you don't need to, which is not a big deal. Also Clear can be a little messy to implement for Semaphores in a cross-platform way. In Semaphore speak, Wait() is Down() , Signal() is Up(), and Clear() is trylock().

Okay, so this is all fine and we're happy with ourselves, until one day we try to WaitOnMessage() from two different threads. I've been talking as if we have one main thread and one worker thread, but the basic FIFO queues work fine for N "main" threads, and we may well might want to have multiple threads generating and waiting on work. So what's the problem ?

First of all, there's a race in the status check, so we put a Mutex around it :


WaitOnMessage_3A( GUID )
{

    for(;;)
    {
        Mutex 
        {
        pop everything on the receive FIFO
        mark everything I received as done
        if ( the GUID I wanted is done )
            break;
        }

        Wait( ReceiveEvent );
    }

}

you need that because one thread could come in and pop the queue but not process it, and then another thread comes in and sees its GUID is not done and goes to sleep incorrectly. We need the pop and the flagging done to act like they are atomic.

Now you might be inclined to make this safe by just putting the Mutex around the whole function. In fact that works, but it means you are holding the mutex while you go into your Wait sleep. Then if another thread comes in to check status, it will be forced to sleep too - even if its GUID is already done. So that's very bad, we don't want to block status checking while we are asleep.

Why is the _3A version still broken ? Consider this case : we have two threads, thread 1 and 2, they make their own work and send it off then each call WaitOnMessage which is :


WaitOnMessage_3( GUID )
{

    for(;;)
    {
        Mutex 
        {
        pop everything on the receive FIFO
        mark everything I received as done
        if ( the GUID I wanted is done )
            break;
        }

        [A]

        Wait( ReceiveEvent );

        [B]
    }

}

Thread 1 runs through to [A] and then swaps out. Thread 2 runs through to [B], which waits on the event, the worker thread then does all the work, sets the event, then thread 2 gets the signal and clears the event. Now thread 1 runs again at [A] and immediately goes into an infinite Wait.

D'oh !

I think this is long enough just describing the problem, so I'll get at some solutions in the next post.

No comments:

old rants