9/29/2011

09-29-11 - Suffix Tries 3 - On Follows with Path Compression

Some subtle things that it took me a few days to track down. Writing for my reference.

1. Follows updates should be a bit "lazy". With path compression you aren't making all the nodes on a suffix. So when you match at length 5, the follow at length 4 might not exist. (I made a small note on the consequences of this previously . Even if the correct follow node doesn't exist, you should still link in to the next longest follow node possible (eg. length 3 if a 4 doesn't exist). Later on the correct follow might get made, and then if possible you want to update it. So you should consider the follows links to be constantly under lazy update; just because a follow link exists it might not be the right one, so you may want to update it.

eg. say you match 4 bytes of suffix [abcd](ef..) at the current spot. You want to follow to [bcd] but there is no length 3 node of that suffix currently. Instead you follow to [bc] (the next best follow available) , one of whose children is [dxy], you now split the [dxy] to [d][xy] and add [ef] under [d]. You can then update the follow from the previous node ([abcd]) to point at the new [bc][d] node.

2. It appears that you only need to update one follow per byte to get O(N). I don't see that this is obvious from a theoretical standpoint, but all my tests pass. Say you trace down a long suffix. You may encounter several nodes that don't have fully up to date follow pointers. You do not have to track them all and update them all at the next byte. It seems you can just update the deepest one (not the deepest node, but the deepest node that needs an update). (*)

3. Even if your follow is not up to date, you can still use the gauranteed (lastml-1) match len to good advantage. This was a big one that I missed. Say you match 4096 bytes and you take the follow pointer, and it takes you to a node of depth 10. You've lost a lot of depth - you know you must match at least 4095 bytes and you only have 10 of them. But you still have an advantage. You can descend the tree and skip all string compares up to 4095 bytes. In particular, when you get to a leaf you can immediately jump to matching 4095 of the leaf pointer.

4. Handling of EOF in suffix algorithms is annoying; it needs to act like a value outside the [0,255] range. The most annoying case is when you have a degenerate suffix like aaaa...aaaEOF , because the "follow" for that suffix might be itself (eg. what follows aaa... is aa..) depending on how you handle EOF. This can only happen with the degenerate RLE case so just special casing the RLE-to-EOF case avoids some pain.

ADDED : There's a trivial solution to handling EOF easily : 09-10-14 - Suffix Trie EOF handling

(* = #2 is the thing I have the least confidence in; I wonder if there could be a case where the single node update doesn't work, or if maybe you could get non-O(N) behavior unless you have a more clever/careful update node selection algorithm)

No comments:

old rants