If you don't find programming algorithms interesting, stop reading. This post is not for you.
Problem Statement
Now that that is out of the way, Reservoir Sampling is a fun algorithm. Imagine you are given a really large stream of data elements (queries on google searches in May, products bought at Walmart during the Christmas season, names in a phone book, whatever). Your goal is to efficiently return a random sample of 1,000 elements evenly distributed from the original stream. How would you do it?
The right answer is generating random integers between 0 and N-1, then retrieving the elements at those indices and you have your answer. (Update: reader Martin astutely points out that this is sampling with replacement. To make this sampling without replacement, one simply needs to note whether or not your sample already pulled that random number and if so, choose a new random number. This can make the algorithm pretty expensive if the sample size is very close to N though).
So, let me make the problem harder. You don't know N (the size of the stream) in advance and you can't index directly into it. You can count it, but that requires making 2 passes of the data. You can do better. There are some heuristics you might try: guess the length and hope to undershoot. It will either not work in one pass or not be evenly distributed.
Simple Solution
A relatively easy and correct solution is to assign a random number to every element as you see them in the stream, and then always keep the top 1,000 numbered elements at all times. This is similar to how mysql does "ORDER BY RAND()" calls.
Classic Reservoir Sampling
Another option is reservoir sampling. In the example I've given you, the algorithm above is a little simpler, but reservoir sampling can be extended in other ways which I will get to, but first the basic algorithm:
First, you want to make a reservoir (array) of 1,000 elements and fill it with the first 1,000 elements in your stream. That way if you have exactly 1,000 elements, the algorithm works. This is the base case.
Next, you want to process the i'th element (starting with i = 1,001) such that at the end of processing that step, the 1,000 elements in your reservoir are randomly sampled amongst the i elements you've seen so far. How can you do this? Start with i = 1,001. With what probability after the 1001'th step should element 1,001 (or any element for that matter) be in the set of 1,000 elements? The answer is easy: 1,000/1,001. So, generate a random number between 0 and 1, and if it is less than 1,000/1,001 you should take element 1,001. In other words, choose to add element 1,001 to your reservoir with probability 1,000/1,001. If you choose to add it (which you likely will), then replace any element in the reservoir chosen randomly. I've shown that this produces a 1,000/1,001 chance of selecting the 1,001'th element, but what about the 2nd element in the list? The 2nd element is definitely in the reservoir at step 1,000 and the probability of it getting removed is the probability of element 1,001 getting selected multiplied by the probability of #2 getting randomly chosen as the replacement candidate. That probability is 1,000/1,001 * 1/1,000 = 1/1,001. So, the probability that #2 survives this round is 1 - that or 1,000/1,001.
This can be extended for the i'th round - keep the i'th element with probability 1,000/i and if you choose to keep it, replace a random element from the reservoir. It is pretty easy to prove that this works for all values of i using induction. It obviously works for the i'th element based on the way the algorithm selects the i'th element with the correct probability outright. The probability any element before this step being in the reservoir is 1,000/(i-1). The probability that they are removed is 1,000/i * 1/1,000 = 1/i. The probability that each element sticks around given that they are already in the reservoir is (i-1)/i and thus the elements' overall probability of being in the reservoir after i rounds is 1,000/(i-1) * (i-1)/i = 1,000/i.
Weighted Reservoir Sampling Variation
Now take the same problem above but add an extra challenge: How would you sample from a weighted distribution where each element is given a weight? This is sorta tricky. The above algorithm can be found on the web easily, but I don't know where I can find a weighted reservoir sampling version, so I made one up. I'm 100% sure someone has figured this out before me though, I don't take credit.
Start the same way by filling in the first 1,000 elements of the reservoir except keep a sum of all of the weights seen so far. Call this seen_weight. Also, make sure to store the weights of the individual elements stored in the reservoir. Call these weights a different array named weight[]. Lastly, to keep things efficient, store a variable for the total weight of the elements in the reservoir called reservoir_weight. On step i (where i > 1,000), generate a random number |R| between 0 and 1 and use this pseudocode:
R = randon_number_between(0, 1);
prob_sum = 0;
for(x = 0; x < reservoir_size; ++x) {
prob_sum += (1-(weight[x] *
seen_weight /
reservoir_weight /
(seen_weight + input_weight)
)))/reservoir_size;
if (prob_sum > R) {
// replace
reservoir[x] = input_element;
reservoir_weight += input_weight - weight[x];
weight[x] = input_weight;
break;
}
}
seen_weight += input_weight;
The only magic here is: (1-(weight[x] * seen_weight / reservoir_weight / (seen_weight + input_weight))))/reservoir
Distributed Reservoir Sampling Variation
This is the problem that got me looking at the weighted sample above. In both of the above algorithms, I can process the stream in O(N) time where N is length of the stream, in other words: in a single pass. If I want to break break up the problem on say 10 machines and solve it close to 10 times faster, how can I do that?
The answer is to have each of the 10 machines take roughly 1/10th of the input to process and generate their own reservoir sample from their subset of the data. Then, a final process must take the 10 output reservoirs and merge them with another sample. The trick is that the final process must take into account the reservoir weights of each of the input reservoirs. Basically, the final process should reservoir sample every element in the 10 input reservoirs but assign each element a weight of weight[x] * seen_weight / reservoir_weight where weight[x] is the element's original weight, seen_weight is the total weight counted in generating that particular reservoir and reservoir_weight is the sum of the weights of the elements in this particular reservoir. Proving this works is a little harder, so I will frustrate you by leaving it as an exercise for the dear reader.
2007-10-08
Reservoir Sampling
Posted by
Greg
at
10:27 PM
Labels: algorithms
Subscribe to:
Post Comments (Atom)

13 comments:
Great post!
I used a similar technique in a paper of mine years ago. We needed a way of converting a stream of examples to a manageable batch and hit upon the same random replacement idea.
I didn't realise it was called reservoir sampling and didn't spend any time doing a proper analysis. Good to see someone has though!
Interesting article. It's hard to find good blogs on practical algorithms. Thanks!
i don't really see how pseudocode of weighted case works. you compare x, the index in reservoir, to some ratio that is always less than 1. so, x will always be 0.
additionally, variable prob_sum is used, but never defined or assigned.
sorry, it doesn't look like this code was checked thoroughly
To reduce the number of steps in the uniform case (1000 out of i), it could be easier to
- draw a random number R from the interval [1,i] (i > 1000, and assuming we index from 1),
- if R <= 1000, replace element R from your current sample with the new element
Great post, I didn't know this had a name, or about the weighted version.
For the "batch" version, you say:
The right answer is generating random integers between 0 and N-1, then retrieving the elements at those indices and you have your answer.
That's sampling with replacement; the reservoir sampling algorithm is without replacement. So they're not quite equivalent, but in the case you talked about where the dataset is much bigger than the sample size, they're pretty close.
Another tiny (yet important) extension is given by Knuth that if you can't put all 1000 samples in memory, you can store the sample in external file as reservior and store the index of them in the memory. Finally sort the index and retrive the samples from reservior. This result is given at TAoCP, Vol 2, page 144, algorithm R. This result can be extended to the distributed sampling you mentioned above. I met this problem as an interview problem at Google.
In practice, one problem with reservoir sampling is that the cost of generating a random number for each element in the stream may greatly outweigh the cost of counting the length of the stream.
Interesting post. Your weighted selection method reminds me of "remainder stochastic sampling" for weighted selection when using genetic algorithms.
Regarding the distributed solution: "reservoir_weight is the sum of the weights of the elements in the reservoir"
Is that the sum of the weights for that particular reservoir or for all of them in toto? The phrasing is a bit imprecise since you used "that particular reservoir" previous to this. Good article, but you might want to tighten that little bit of the text if you aren't going to offer pseudocode :)
You don't actually need 1000 elements (for the basic version of the algorithm).
You can always use just one current element, and in each step, keep it with probability N/N+1, or replace it with the next element from your scan, with the probability 1/N+1 (where N == number of already scanned elements). Using induction it's easy to show it gives correct probabilities for each step of the way (assuming the random number generator you use is good).
I don't know if this variation is also called Reservoir Sampling or can be considered another algorithm, but looks conceptially the same as the one you described (I figured it out when trying to efficiently get a random quote from IRC logs, so I don't know if it has a name).
A friend of mine just pointed me at http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6V0F-4HPK8WS-2&_user=10&_rdoc=1&_fmt=&_orig=search&_sort=d&view=c&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=49898584b0eb78d9fabc012e11e1963f as another good reference on the topic.
It's not necessary to flip a coin to decide whether to keep each element. Rather, you can generate a random number of elements to skip.
Example: say you're sampling 1000 elements out of a billion. Towards the end of the stream, you're sampling with probability 1/1,000,000. Rather than generating a million random numbers to decide whether to keep each element, you can decide which of the next million elements to keep, saving yourself a million random number generations. Getting the distribution right is a bit tricky, but it can be done.
See http://www.cs.umd.edu/~samir/498/vitter.pdf for details.
Dan, nice observation. You can definitely do that if you know that you have at least X elements remaining in the stream.
If you know how many elements are in the entire set in advance, you can just select random indexes into the set and go from there though.
I'm not convinced you can use your trick if you don't know how many elements are remaining. I could be wrong.
Post a Comment