2/24/2017

Oodle Perf with Chunking and Dictionary Size

I get a lot of customers that want to cut their data into small blocks for paging, who ask "what's the benefit of using larger blocks" ?

The larger the block = more compression, and can help throughput (decode speed).

Obviously larger block = longer latency (to load & decode one whole block).

(though you can get data out incrementally, you don't have to wait for the whole decode to get the first byte out; but if you only needed the last byte of the block, it's strictly longer latency).

If you need fine grain paging, you have to trade off the desire to get precise control of your loading with small blocks & the benefits of larger blocks.

(obviously always follow general good paging practice, like amortize disk seeks, combine small resources into paging units, don't load a 256k chunk and just keep 1k of it and throw the rest away, etc.)

As a reference point, here's Kraken on Silesia with various chunk sizes :


Silesia : (Kraken Normal -z4)

 16k : ooKraken    : 211,938,580 ->75,624,641 =  2.855 bpb =  2.803 to 1 
 16k : decode           : 264.190 millis, 4.24 c/b, rate= 802.22 mb/s

 32k : ooKraken    : 211,938,580 ->70,906,686 =  2.676 bpb =  2.989 to 1 
 32k : decode           : 217.339 millis, 3.49 c/b, rate= 975.15 mb/s

 64k : ooKraken    : 211,938,580 ->67,562,203 =  2.550 bpb =  3.137 to 1 
 64k : decode           : 195.793 millis, 3.14 c/b, rate= 1082.46 mb/s

128k : ooKraken    : 211,938,580 ->65,274,250 =  2.464 bpb =  3.247 to 1 
128k : decode           : 183.232 millis, 2.94 c/b, rate= 1156.67 mb/s

256k : ooKraken    : 211,938,580 ->63,548,390 =  2.399 bpb =  3.335 to 1 
256k : decode           : 182.080 millis, 2.92 c/b, rate= 1163.99 mb/s

512k : ooKraken    : 211,938,580 ->61,875,640 =  2.336 bpb =  3.425 to 1 
512k : decode           : 182.018 millis, 2.92 c/b, rate= 1164.38 mb/s

1024k: ooKraken    : 211,938,580 ->60,602,177 =  2.288 bpb =  3.497 to 1 
1024k: decode           : 181.486 millis, 2.91 c/b, rate= 1167.80 mb/s

files: ooKraken    : 211,938,580 ->57,451,361 =  2.169 bpb =  3.689 to 1 
files: decode           : 206.305 millis, 3.31 c/b, rate= 1027.31 mb/s


16k   :  2.80:1 ,   15.7 enc mbps ,  802.2 dec mbps
32k   :  2.99:1 ,   19.7 enc mbps ,  975.2 dec mbps
64k   :  3.14:1 ,   22.8 enc mbps , 1082.5 dec mbps
128k  :  3.25:1 ,   24.6 enc mbps , 1156.7 dec mbps
256k  :  3.34:1 ,   25.5 enc mbps , 1164.0 dec mbps
512k  :  3.43:1 ,   25.4 enc mbps , 1164.4 dec mbps
1024k :  3.50:1 ,   24.6 enc mbps , 1167.8 dec mbps
files :  3.69:1 ,   18.9 enc mbps , 1027.3 dec mbps

(note these are *chunks* not a window size; no carry-over of compressor state or dictionary is allowed across chunks. "files" means compress the individual files of silesia as whole units, but reset compressor between files.)

You may have noticed that the chunked files (once you get past the very small 16k,32k) are somewhat faster to decode. This is due to keeping match references in the CPU cache in the decoder.

Limitting the match window (OodleLZ_CompressOptions::dictionarySize) gives the same speed benefit for staying in cache, but with a smaller compression win.


window 128k : ooKraken    : 211,938,580 ->61,939,885 =  2.338 bpb =  3.422 to 1 
window 128k : decode           : 181.967 millis, 2.92 c/b, rate= 1164.71 mb/s

window 256k : ooKraken    : 211,938,580 ->60,688,467 =  2.291 bpb =  3.492 to 1 
window 256k : decode           : 182.316 millis, 2.93 c/b, rate= 1162.48 mb/s

window 512k : ooKraken    : 211,938,580 ->59,658,759 =  2.252 bpb =  3.553 to 1 
window 512k : decode           : 184.702 millis, 2.97 c/b, rate= 1147.46 mb/s

window 1M : ooKraken    : 211,938,580 ->58,878,065 =  2.222 bpb =  3.600 to 1 
window 1M : decode           : 184.912 millis, 2.97 c/b, rate= 1146.16 mb/s

window 2M :  ooKraken    : 211,938,580 ->58,396,432 =  2.204 bpb =  3.629 to 1 
window 2M :  decode           : 182.231 millis, 2.93 c/b, rate= 1163.02 mb/s

window 4M :  ooKraken    : 211,938,580 ->58,018,936 =  2.190 bpb =  3.653 to 1 
window 4M : decode           : 182.950 millis, 2.94 c/b, rate= 1158.45 mb/s

window 8M : ooKraken    : 211,938,580 ->57,657,484 =  2.176 bpb =  3.676 to 1 
window 8M : decode           : 189.241 millis, 3.04 c/b, rate= 1119.94 mb/s

window 16M: ooKraken    : 211,938,580 ->57,525,174 =  2.171 bpb =  3.684 to 1 
window 16M: decode           : 202.384 millis, 3.25 c/b, rate= 1047.21 mb/s

files     : ooKraken    : 211,938,580 ->57,451,361 =  2.169 bpb =  3.689 to 1 
files     : decode           : 206.305 millis, 3.31 c/b, rate= 1027.31 mb/s

window 128k:  3.42:1 ,   20.1 enc mbps , 1164.7 dec mbps
window 256k:  3.49:1 ,   20.1 enc mbps , 1162.5 dec mbps
window 512k:  3.55:1 ,   20.1 enc mbps , 1147.5 dec mbps
window 1M  :  3.60:1 ,   20.0 enc mbps , 1146.2 dec mbps
window 2M  :  3.63:1 ,   19.7 enc mbps , 1163.0 dec mbps
window 4M  :  3.65:1 ,   19.3 enc mbps , 1158.5 dec mbps
window 8M  :  3.68:1 ,   18.9 enc mbps , 1119.9 dec mbps
window 16M :  3.68:1 ,   18.8 enc mbps , 1047.2 dec mbps
files      :  3.69:1 ,   18.9 enc mbps , 1027.3 dec mbps

WARNING : tuning perf to cache size is obviously very machine dependent; I don't really recommend fiddling with it unless you know the exact hardware you will be decoding on. The test machine here has a 4 MB L3, so speed falls off slightly as window size approaches 4 MB.


If you do need to use tiny chunks with Oodle ("tiny" being 32k or smaller; 128k or above is in the normal intended operating range) here are a few tips to consider :

1. Consider pre-allocating the Decoder object and passing in the memory to the OodleLZ_Decompress calls. This avoids doing a malloc per call, which may or may not be significant overhead.

2. Consider changing OodleConfigValues::m_OodleLZ_Small_Buffer_LZ_Fallback_Size . The default is 2k bytes. Buffers smaller than that will use LZB16 instead of the requested compressor, because many of the new ones don't do well on tiny buffers. If you want to have control of this yourself, you can set this to 0.

3. Consider changing OodleLZ_CompressOptions::spaceSpeedTradeoffBytes . This is the number of bytes that must be saved from the compressed output size before the encoder will choose a slower decode mode. eg. it controls decisions like whether literals are sent raw or with entropy coding. This number is scaled for full size buffers (128k bytes or more). When using tiny buffers, it will choose to avoid entropy coding more often. You may wish to dial down this value to scale to your buffers. The default is 256 ; I recommend trying 128 to see what the effect is.

No comments:

old rants