9/26/2011

09-26-11 - Tiny Suffix Note

Obviously there are lots of analogies between suffix tries and suffix arrays.

This old note about suffix arrays which provides O(N) neighbor pair match lengths is exactly analogous to using "follow pointers" for O(N) string matching in suffix tries.

(their paper also contains a proof of O(N)'ness , though it is obvious if you think about it a bit; see comments on previous post about this).

Doing Judy-ish stuff for a suffix tree is exacly analogous to the "introspective" stuff that's done in good suffix array sorters like divsufsort.

By Judy-ish I mean using a variety of tree structures and selecting one for the local area based on its properties. (eg. nodes with > 100 children switch to just using a radix array of 256 direct links to kids).

Suffix tries are annoying because it's easy to slide the head (adding nodes) but hard to slide the tail (removing nodes). Suffix arrays are even worse in that they don't slide at all.

The normal way to adapt suffix arrays to LZ string matching is just to use chunks of arrays (possibly a power-of-2 cascade). There are two problems I haven't found a good solution to. One is how to look up a string in the chunk that it is not a member of (eg. a chunk that's behind you). The other is how to deal with offsets that are in front of you.

If you just put your whole file in one suffix array, I believe that is unboundedly bad. If you were allowed to match forwards, then finding the best match would be O(1) - you only have to look at the two slots before you and after you in the sort order. But since we can't match forward, you have to scan. The pseudocode is like this :


do both forward and backward :
start at the sort position of the string I want to match
walk to the next closest in sort order (this is an O(1) table lookup)
if it's a legal match (eg. behind me) - I'm done, it's the best
if not, keep walking

the problem is the walk is unbounded. When you are somewhere early in the array, there can be an arbitrary number (by which I mean O(N)) of invalid matches between you and your best match in the sort order.

Other than these difficulties, suffix arrays provide a much simpler way of getting the advantages of suffix tries.

Suffix arrays also have implementation advantages. Because you separate the suffix string work from the rest of your coder it makes it easier to optimize each one in isolation, you get better cache use and better register allocation. Also, the suffix array can use more memory during the sort, or use scratch space, while a trie has to hold its structure around all the time. For example some suffix sorts will do things like use a 2-byte radix in parts of the sort where that makes sense (and then they can get rid of it and use it on another part of the sort), and that's usually impossible for a tree that you're holding in memory as you scan.

8 comments:

Shelwien said...

> Suffix arrays are even worse in that they don't slide at all.

They kinda do. Its pretty simple to add or remove a symbol actually.
But these simple operations have O(N) complexity because parts of
a table have to be moved.
And when we try to optimize it for speed, trees appear, and it becomes
hard to recognize any relation to BWT :)

> One is how to look up a string in the chunk that it is not a member of

I'd do BWT on adjacent pairs of chunks (overlapped),
then we'd automatically know where to look next.
But this requires 8N memory, and 8N structures like MMC suddendly
become more attractive.

Btw, there's also that - http://encode.ru/threads/1278-Nibble-MMC-matchfinder

Shelwien said...

Ah, so to use suffix arrays for matchfinding, I'd just compute BWT
of overlapped 2*winsize blocks. It would take the same memory as
MMC or binary tree, but with up to 2x window size.

cbloom said...

Yeah, so the problem with 2*winsize blocks is the second problem - you can have find lots of matches that are ahead of you instead of behind you.

Shelwien said...

No, half of them would be certainly behind. Also the best match would be nearest in SA, so we can apply iteration limits like in hash chains to avoid O(N).

It might be also possible to add some links and skip the "future" pointers, but that would be again at least 8N which doesn't make sense.

cbloom said...

No, read my post again.

ryg said...

"No, half of them would be certainly behind."
Nope. Half of the *data* you want to compress is in the second half of the block, but it can contain way more than half of all potential matches.

Simple example: say your data consists of two blocks, each having winsize bytes. The first block consists of groups of 4 characters each - 50% of them are 4 random bytes, the other 50% are always "abab". The second block is just "ababababab"...

With high likelihood, almost all of your matches are going to be "abab", and most of the positions for "abab" in your suffix array are going to be in second block. (In the first block, there's going to be about winsize/8 locations for "abab"; in the second block, there's winsize/2-1 of them).

This may sound constructed, but you actually get patterns very much like that on highly repetitive structured data.

cbloom said...

Yup. Unfortunately on a file of size N and window size W, the suffix array matcher is O(N*W). (or you can limit the walk but suffer an unknown amount of missed matches)

This is rare in practice, but I can't say that suffix arrays are the awesome solution when they have such a bad degeneracy possible.

cbloom said...

Booya. I got it. Post later today.

old rants