9/25/2011

09-25-11 - More on LZ String Matching

This might be a series until I get angry at myself and move on to more important todos.

Some notes :

1. All LZ string matchers have to deal with this annoying problem of small files vs. large ones (and small windows vs large windows). You really want very different solutions, or at least different tweaks. For example, the size of the accelerating hash needs to be tuned for the size of data or you can spend all your time initializing a 24 bit hash to find matches in 10 byte file.

2. A common trivial case degeneracy is runs of the same character. You can of course add special case handling of this to any string matcher. It does help a lot on benchmarks of course, because this case is common, but it doesn't help your worst case in theory because there are still bad degenerate cases. It's just very rare to have long degenerate matches that aren't simple runs.

One easy way to do this is to special case just matches that start with a degenerate char. Have a special index of [256] slots which correspond to starting with >= 4 of that char.

3. A general topic that I've never seen explored well is the idea of approximate string matching.

Almost every LZ string matcher is approximate, they consider less than the full set of matches. Long ago someone referred to this as "amortized hashing" , which refers to the specific implemntation of a hash chain (hash -> linked list) in which you simply stop searching after visiting some # of links. (amortize = minimize the damage from the worst case).

Another common form of approximate string searching is to use "cache tables" (that is, hash tables with overwrites). Many people use a cache tables with a few "ways".

The problem with both these approaches is that the penalty is *unbounded*. The approximate match can be arbitrarily worse than the best match. That sucks.

What would be ideal is some kind of tuneable and boundable approximate string match. You want to set some amount of loss you can tolerate, and get more speedup for more loss.

(there are such data structures for spatial search, for example; there are nice aproximate-nearest-neighbors and high-dimensional-kd-trees and things like that which let you set the amount of slop you tolerate, and you get more speedup for more slop. So far as I know there is nothing comparable for strings).

Anyhoo, the result is that algorithms with approximations can look very good in some tests, because they find 99% of the match length but do so much faster. But then on another test they suddenly fail to find even 50% of the match length.

5 comments:

Shelwien said...

Do you have any comments about applying BWT suffix arrays for LZ matchfinding? For example, its used in bsdiff.

cbloom said...

That's actually what I use for my optimal parsing LZ at the moment. I wrote a note about it here :

http://cbloomrants.blogspot.com/2010/06/06-17-10-suffix-array-neighboring-pair.html

The big advantage of it is that "divsufsort" by Yuta Mori is really superb.

It will be part of my test set.

doynax said...

I know this is a little out of left-field but do you happen know of any algorithms for static dictionary coding? I usually work on embedded system with rather limited RAM where such algorithms would be useful.

Articles on dictionary coding usually mention such methods in passing, and there's plenty of papers on searching existing dictionaries of course, but I'm having trouble finding much of anything on generating them in the first place.

Sorry for bugging you but I'd appreciate a nudge in the right direction. I never did get the hang of this literature search thing and I usually manage to get stuck behind some paywall besides ;)

cbloom said...

@doynax :

"Off-Line Dictionary-Based Compression"

Larsson & Moffat

doynax said...

Thank you!

I had considered stripping off ineffective codewords from a LZ-78 dictionary in an ad-hoc fashion, but this approach should actually work ;)

It'll be interesting to see what sort of compression I'll get out of it on our data, especially without using much in the way of entropy coding.

old rants