Wednesday, October 21, 2009

Probabilistic Counting Algorithms for Data Base Applications

Probabilistic Counting Algorithms for Data Base Applications Philippe Flajolet, Nigel Martin, Journal of Computer and System Sciences, Vol. 31, No. 2, October 1985.

This seems to be a classic paper. It was pointed as a major reference in a review request, so I went to read it. I'm glad I did, it is very cute. It basically describes a counting schemes for knowing the number of distinct items in a set. It is a probabilistic mechanism with a single pass and low storage requirement.

It works as follows: for every item in the set, it computes a hash. The hash maps to a random variable with a geometric distribution, such that it is equal 0 with probability 1/2, 1 wp 1/4, 2 wp 1/8, etc. up to 2^L-1. Consider a bit vector of length 2^L, and flip the i-th bit to 1 if the geometric random variable points to i (keep it to 0 otherwise). Since two items in the data base that are identical will hash to the same value, then the number of times that the geometric random variables is generated is a function of the number of distinct items.

For a large number, the values of 0, 1, ..etc will be one in the bit vector. The first non-1 bit is used to estimate the number of different items. Denote by R the left-most entry in the bit vector which has value 0. Then the paper shows that R is about log(\phi n), with \phi = 0.77351, with a relatively small variance, such that the number n can be estimated from R relatively accurately. The size of the counter is this bitvector L, which has size of O(log(n)).

The paper makes some of the proofs formal, but the analysis is relatively complex. But I like the idea of looking at some properties of a sequence of random variables to estimate the number of elements in the sequence. I believe I have seen somewhere a reference where the longest run of a particular value is used to estimate the number of items (for instance, for head/tail coin toss, the longest run of heads will grow as O(log(n)), if I recall well, which would lead to a log(log(n)) bit size for such scheme).

The interesting question is: what other proxy can be used to compress the count, and can be inverted relatively easily to yield a good estimate of the original data.

No comments: