cbloom rants: 09-30-11 - String Match Results Part 1

9/30/2011

09-30-11 - String Match Results Part 1

I was hoping to make some charts and graphs, but it's just not that interesting. Anyhoo, let's get into it.

What am I testing? String matching for an LZ-type compressor. Matches must start before current pos but can run past current pos. I'm string matching only, not compressing. I'm counting the total time and total length of matches found.

I'm testing match length >= 4. Matches of length 2 & 3 can be found trivially by table lookup (though on small files this is not a good way to do it). Most of the matchers can handle arbitrary min lengths, but this is just easier/fairer for comparison.

I'm testing both "greedy" (when you find a match step ahead its length) and "optimal" (find matches at every position). Some matchers like the suffix tree ones don't really support greedy parsing, since they have to do all the work at every position even if you don't want the match there.

I'm testing windowed and non-windowed matchers.

I'm testing approximate and non-approximate (exact) matchers. Exact matchers find all matches possible, approximate matchers find some amount less. I'm not sure the best way to show the approximation vs. speed trade off. I guess you want a "pareto frontier" type of graph, but what should the axes be?

Also, while I'm at it, god damn it!

MAKE YOUR CODE FREE PEOPLE !!

(and GPL is not free). And some complicated personal license is a pain in the ass. I used to do this myself, I know it's tempting. Don't fucking do it. If you post code just make it 100% free for all uses. BSD license is an okay choice.

Matchers I'm having trouble with :


Tornado matchers from FreeArc - seem to be GPL (?)

MMC - GPL

LzFind from 7zip appears to be public domain. divsufsort is free. Larsson's slide is free.

cbloom rants

9/30/2011

09-30-11 - String Match Results Part 1

No comments:

old rants