4/23/2009

04-23-09 - Telling Time

.. telling time is a huge disaster on windows.

To start see Jon Watte's old summary that's still good .

Basically you have timeGetTime() , QPC, or TSC.

TSC is fast (~ 100 clocks) and high precision. The problems I know of with TSC :

TSC either tracks CPU clocks, or time passing. On older CPUs it actually increments with each cpu cycle, but on newer CPUs it just tracks time (!). The newer "constant rate" TSC on Intel chips runs at some frequency which so far as I can tell you can't query.

If TSC tracks CPU cycles, it will slow down when the CPU speedsteps. If the CPU goes into a full sleep state, the TSC may stop running entirely. These issues are bad on single core, but they're even worse on multi-proc systems where the cores can independently sleep or speedstep. See for example these linux notes or tsc.txt .

Unfortunately, if TSC is constant rate and tracking real time, then it no longer tracks cpu cycles, which is actually what you want for measuring performance (you should always report speeds of micro things in # of clocks, not in time).

Furthermore on some multicore systems, the TSC gets out of sync between cores (even without speedsteps or power downs). If you're trying to use it as a global time, that will hose you. On some systems, it is kept in sync by the hardware, and on some you can get a software patch that makes rdtsc do a kernel interrupt kind of thing which forces the TSC's of the cores to sync.

See this email I wrote about this issue :

Apparently AMD is trying to keep it hush hush that they fucked up and had to release a hotfix. I can't find any admission of it on their web site any more ;

this is the direct download of their old utility that forces the cores to TSC sync : TscSync

they now secretly put this in the "Dual Core Optimizer" : Dual Core Optimizer Oh, really AMD? it's not a bug fix, it's an "optimizer". Okay.

There's also a seperate issue with AMD C&Q (Cool & Quiet) if you have multiple cores/processors that decide to clock up & down. I believe the main fix for that now is just that they are forbidden from selecting different clocks. There's an MS hot fix related to that : MS hotfix 896256

I also believe that the newest version of the "AMD Processor Driver" has the same fixes related to C&Q on multi-core systems : AMD Driver I'm not sure if you need both the AMD "optimizer" and processor driver, or if one is a subset of the other.

Okay, okay, so you decide TSC is too much trouble, you're just going to use QPC, which is what MS tells you to do anyway. You're fine, right?

Nope. First of all, on many systems QPC actually is TSC. Apparently Windows evaluates your system at boot and decides how to implement QPC, and sometimes it picks TSC. If it does that, then QPC is fucked in all the ways that TSC is fucked.

So to fix that you can apply this : MS hotfix 895980 . Basically this just puts /USEPMTIMER in boot.ini which forces QPC to use the PCI clock instead of TSC.

But that's not all. Some old systems had a bug in the PCI clock that would cause it to jump by a big amount once in a while.

Because of that, it's best to advance the clock by taking the delta from previous and clamping that delta to be in valid range. Something like this :


U64 GetAbsoluteQPC()
{
    static U64 s_lastQPC = GetQPC();
    static U64 s_lastAbsolute = 0;

    U64 curQPC = GetQPC();

    U64 delta = curQPC - s_lastQPC;

    s_lastQPC = curQPC;

    if ( delta < HUGE_NUMBER )
        s_lastAbsolute += delta;

    return s_lastAbsolute;
}

(note that "delta" is unsigned, so when QPC jumps backwards, it will show up as as very large positive delta, which is why we compare vs HUGE_NUMBER ; if you're using QPC just to get frame times in a game, then a reasonable thing is to just get the raw delta from the last frame, and if it's way out of reasonable bounds, just force it to be 1/60 or something).

Urg.

BTW while I'm at I think I'll evangelize a "best practice" I have recently adopted. Both QPC and TSC have problems with wrapping. They're in unsigned integers and as your game runs you can hit the end and wrap around. Now, 64 bits is a lot. Even if your TSC frequency is 1000 GigaHz (1 THz), you won't overflow 64 bits for 194 days. The problem is they don't start at 0. (

Unsigned int wrapping works perfectly when you do subtracts and keep them in unsigned ints. That is :


in 8 bits :

U8 start = 250;
U8 end = 3;

U8 delta = end - start;
delta = 8;

That's cool, but lots of other things don't work with wrapping :


U64 tsc1 = rdtsc();

... some stuff ...

U64 tsc2 = rdtsc();

U64 avg = ( tsc1 + tsc2 ) /2;

This is broken because tsc may have wrapped.

The one that usually gets me is simple compares :


if ( time1 < time2 )
{
    // ... event1 was earlier
}

are broken when time can wrap. In fact with unsigned times that wrap there is no way to tell which one came first (though you could if you put a limit on the maximum time delta that you consider valid - eg. any place that you compare times, you assume they are within 100 days of each other).

But this is easily fixed. Instead of letting people call rdtsc raw, you bias it :


uint64  Timer::GetAbsoluteTSC()
{
    static uint64 s_first = rdtsc();
    uint64 cur = rdtsc();
    return (cur - s_first);
}

this gives you a TSC that starts at 0 and won't wrap for a few years. This lets you just do normal compares everywhere to know what came before what. (I used the TSC as an example here, but you mainly want QPC to be the time you're passing around).

7 comments:

won3d said...

Depending on overflow can be funky on some compilers, which can assume that overflow never happens.

Also, to compute the average of two numbers without overflow, you build the adder yourself using bitwise ops. Something like:

(a & b) + ((a^b) >> 1)

cbloom said...

Hmm either I'm not understanding you or you're not understanding me.

The issue is with numbers like :

tsc1 = 250
tsc2 = 2

(in 8 bits)

The correct answer for "average of tsc1 and tsc2" is 254.

If you do (tsc1+tsc2)/2 you get the wrong thing (126).

To get that one valid way is :

U8 delta = tsc2 - tsc1;
U8 avg = tsc1 + (delta/2);

Are you saying that subtracting a larger unsigned int from a smaller one is not always defined as steps around the ring [0,max] ?

BTW I am explicitly using the knowledge here that tsc2 is after tsc1, and it is not more than one ring loop after.

cbloom said...

But BTW of course this is exactly the whole point of rebasing to zero. You don't want to have code scattered around all over your app that deals with this stuff.

This is exactly one of those pitfalls where expert programmers will think "I'll just return the TSC as a U64 it will be fine people can deal with the wrapping at the client site it's easy" - and they are wrong. Maybe it is easy, but there are subtle bugs that are hard to spot (and rarely happen and hard to repro because they only happen when the counter wraps), and even if you do deal with everything right, it means every time you look at any code that does math on timer values, you have to waste brain energy thinking "hmm is that right?"

cbloom said...

.. and I should also mention there's a huge disadvantage to ever using the TSC as a timer :

you have to calibrate it. To my knowledge there's no reliable way to get the TSC frequency.

jwatte_food said...

[quote]Depending on overflow can be funky on some compilers, which can assume that overflow never happens.
[/quote]

I think that's a red herring. Any system and compiler you're likely to write game code for will do the same thing when wrapping unsigned integers using addition and subtraction arithmetic.

Anonymous said...

Raymond Chen actually had a post on his blog many years ago about the need to compute the average that way.

(As I recall it was written in that annoying Raymond Chen way that seemed to poo-poo the issue that there was all this voodoo knowledge and no actual way for developers to actually have the giant list of all the things they needed to know.)

won3d said...

Yeah, I didn't understand you, because I was just thinking of the problem where if you want to find the average of 128 and 128 the naive way you get 0 if you only have 8-bit internal precision.

old rants