Reservoir Sampling - Sampling from a stream of elements

Sampling from a stream of elements

Oct 8, 2007

If you don't find programming algorithms interesting, this post is not for you.

Problem Statement

Reservoir Sampling is an algorithm for sampling elements from a stream of data. Imagine you are given a really large stream of data elements, for example:

Queries on google searches in May
Products bought at Walmart during the Christmas season
Names in a phone book

Your goal is to efficiently return a random sample of 1,000 elements evenly distributed from the original stream. How would you do it?

The right answer is generating random integers between 0 and N - 1, then retrieving the elements at those indices and you have your answer. If you need to be generate unique elements, then just throw away indices you've already generated.

So, let me make the problem harder. You don't know N (the size of the stream) in advance and you can't index directly into it. You can count it, but that requires making 2 passes of the data. You can do better. There are some heuristics you might try: for example to guess the length and hope to undershoot. It will either not work in one pass or will not be evenly distributed.

Simple Solution

A relatively easy and correct solution is to assign a random number to every element as you see it in the stream, and then always keep the top 1,000 numbered elements at all times. This is similar to how a SQL Query with ORDER BY RAND() works. This strategy works well, and only requires additionally storing the randomly generated number for each element.

Reservoir Sampling

Another, more complex option is reservoir sampling. First, you want to make a reservoir (array) of 1,000 elements and fill it with the first 1,000 elements in your stream. That way if you have exactly 1,000 elements, the algorithm works. This is the base case.

Next, you want to process the 'th element (starting with ) such that at the end of processing that step, the 1,000 elements in your reservoir are randomly sampled amongst the elements you've seen so far.

How can you do this? Start with . With what probability after the 1001'th step should element 1,001 (or any element for that matter) be in the set of 1,000 elements? The answer is easy: Generate a random number between 0 and 1, and if it is less than you should take element 1,001. In other words, choose to add element 1,001 to your reservoir with probability . If you choose to add it, then replace any element in the reservoir chosen randomly.

I've shown that this produces a chance of selecting the 1,001'th element, but what about the 2nd element in the list? The 2nd element is in the reservoir at step 1,000 with 100% probability. The probability of removing it is the probability of element 1,001 getting selected multiplied by the probability of Element #2 getting randomly chosen as the replacement candidate from the 1,000 elements in the reservoir. That probability is:

So, the probability that the 2nd element survives this round is:

This is the probability we want.

This can be extended for the 'th round: keep the 'th element with probability and if it is kept, replace a random element from the reservoir.

It is pretty easy to prove that this works for all values of using induction: It obviously works for the 'th element based on the way the algorithm selects the 'th element with the correct probability outright. The probability any element before this step being in the reservoir is . The probability that they are removed is the probability that the 'th element is selected multiplied by the probability that the prior element was selected to drop from the reservoir:

This is the probability that a given element will be removed in that round given that it has made it to the reservoir so far. The probabilty that it isn't removed is the inverse. The final probability that it survives to the 'th round is the probability that it survived to the previous round multiplied by the probability that it isn't removed:

This is the probability we want.

Weighted Reservoir Sampling Variation

Now take the same problem above but add an extra challenge: How would you sample from a weighted distribution where each element has a given weight associated with it in the stream? This is sorta tricky. Pavlos S. Efraimidis figured out the solution in 2005 in a paper titled Weighted Random Sampling with a Reservoir. It works similarly to the assigning a random number solution above.

As you process the stream, assign each item a "key". For each item in the stream , let be the item's "key", let be the weight of that item and let be a random number between 0 and 1. The "key" () is a random number to the 'th root where is weight of that item in the stream:

Now, simply keep the top elements ordered by their keys (), where is the size of your sample. Easy.

To see how this works, lets start with non-weighted elements (ie: weight = 1). is always 1, so the key is simply a random number and this algorithm degrades into the simple algorithm mentioned above.

Now, how does it work with weights? The probability of choosing over is the probability that > . can have any value from 0 - 1. However, it's more likely to be closer to 1 the higher is. We can see what the distribution of this looks like when comparing to a weight 1 element by integrating over all values of random numbers from 0 - 1. You get something like this:

If = 1, = 1 / 2. If = 9, = 9 / 10. When replacing against a weight 1 element, an item of weight 5 would have a 50% chance and an element of weight 9 would have a 90% chance. Similar math works for two elements of non-zero weight.

Distributed Reservoir Sampling Variation

This is the problem that got me researching the weighted sample above. In both of the above algorithms, I can process the stream in O(N) time where N is length of the stream, in other words: in a single pass. If I want to break break up the problem on say 10 machines and solve it close to 10 times faster, how can I do that?

The answer is to have each of the 10 machines take roughly 1/10th of the input to process and generate their own reservoir sample from their subset of the data using the weighted variation above. Then, a final process must take the 10 output reservoirs and merge them.

The trick is that the final process must use the original "key" weights computed in the first pass. For example, If one of your 10 machines processed only 10 items in a size-10 sample, and the other 10 machines each processed 1 million items, you would expect that the one machine with 10 items would likely have smaller keys and hence be less likely to be selected in the final output. If you recompute keys in the final process, then all of the input items would be treated equally when they shouldn't.