mikeash.com pyblog/algorithmic-optimizations-a-case-study.html comments

mikeash - 2007-12-28 01:02:23

Fri, 28 Dec 2007 01:02:23 GMT

I can see where that would be confusing. The big-O analysis is for a single search of the bucket, and in that instance you have a large string, and then you're searching for a smaller string inside it. In other words, the large string is the haystack, and the small string is the needle. In this case O(n*m) has n as the length of the haystack and m as the length of the needle. The worst-case running time for the full algorithm, building the string table from scratch, is going to be around O(n^2*m^2), where n is the number of strings and m is the average length. But the running time on good data is going to be more like O(n), since a roughly constant number of locations needs to be checked for each string, and on good data the string matching will fail early.

Rincewind - 2007-12-28 00:35:29

Fri, 28 Dec 2007 00:35:29 GMT

Then I misunderstood what you described as M initially, as I understood it to be the length of strings that fell out of the hash table (too short for it), but your expansion implies that it is the length of the shortest string ever presented to the algorithm.

mikeash - 2007-12-27 15:18:26

Thu, 27 Dec 2007 15:18:26 GMT

Nope, it's O(n*m).

Consider what happens when running this algorithm on a collection of large strings of identical lengths which contain the letter "x" for all but the last few characters. They will all hash to the same value. That one value will contain offsets pointing to nearly all of the large string. At each offset a string compare must be performed, which will fail on average halfway through the length of the string. (When comparing at the beginning of an existing string it must compare nearly the entire thing, and at the end it compares almost none; the average is one half.) Ignoring fenceposting, the total runtime is therefore (1 - epsilon) * n * m / 2, where epsilon is the proportion of the strings which is unique. And of course the constants fall out, so the big-O is O(n*m). Recall that big-O is worst case, not average case, so how it performs on non-pathological data is irrelevant.

To show this another way: consider that a naive string search (i.e. not Boyer-Moore or something else that's clever) is O(n*m). This just adds a hash table on top of that naive search algorithm, but of course hash tables fail on pathological data, so the worst case stays O(n*m).

Fancier algorithms could certainly improve this for pathological data, and probably for the data I actually deal with. However since this is code which is meant for my own personal use (it compiles dictionaries and users get the compiled data directly), the cost of implementing something fancier outweights any benefits I'd get.

Rincewind - 2007-12-27 04:37:29

Thu, 27 Dec 2007 04:37:29 GMT

Actually, your final algorithm is O(N). Since you define M as the size of your short string (aka the size of the hash) that effectively makes M a constant and such if falls out of the running time equation.

Although I wonder if a string-tree implementation might have given you similar speed up, although all the pointer chasing might have made it worthwhile.

Ahruman - 2007-11-26 19:53:16

Mon, 26 Nov 2007 19:53:16 GMT

[This adds five new three-character prefixes to the string table, three in "bonus itself" and two for the last two characters in thes tring table, and these new prefixes and their offsets are added to the offsets table.] -> [This adds five new three-character prefixes to the string table, three in "bonus" itself and two for the last two characters in the string table, and these new prefixes and their offsets are added to the offsets table.]

Sorry, been copy editing at work. ;-)