tag:blogger.com,1999:blog-35584545.post876548025762569322..comments2014-09-16T13:10:23.203-07:00Comments on Gregable: Reservoir Sampling - Sampling from a stream of elementsGreg Grothaushttp://www.blogger.com/profile/06692328337754346540noreply@blogger.comBlogger23125tag:blogger.com,1999:blog-35584545.post-76993289867910838142014-03-16T18:16:37.170-07:002014-03-16T18:16:37.170-07:00http://www.cs.umd.edu/~samir/498/vitter.pdfhttp://www.cs.umd.edu/~samir/498/vitter.pdfMaverick(Pramit)http://www.blogger.com/profile/10052321846039747456noreply@blogger.comtag:blogger.com,1999:blog-35584545.post-20517228322777082912014-03-16T18:15:03.793-07:002014-03-16T18:15:03.793-07:00Nice explanation. I enjoyed it.Nice explanation. I enjoyed it.Maverick(Pramit)http://www.blogger.com/profile/10052321846039747456noreply@blogger.comtag:blogger.com,1999:blog-35584545.post-11667555753110729862013-05-03T08:29:17.806-07:002013-05-03T08:29:17.806-07:00We have a sub-linear algorithm for reservoir sampl...We have a sub-linear algorithm for reservoir sampling with replacement. See Byung-Hoon Park, George Ostrouchov, Nagiza F. Samatova, Sampling streaming data with replacement, Computational Statistics & Data Analysis, Volume 52, Issue 2, 15 October 2007, Pages 750-762, ISSN 0167-9473, 10.1016/j.csda.2007.03.010.<br />(http://www.sciencedirect.com/science/article/pii/S0167947307001089)<br /><br />We know the population size up to any point so to maintain a reservoir for any point we can compute skipping probabilities.George Ohttp://www.blogger.com/profile/05388390070877859430noreply@blogger.comtag:blogger.com,1999:blog-35584545.post-78115773188567638892013-02-15T05:55:56.428-08:002013-02-15T05:55:56.428-08:00Thanks. This post gave me the thought to implement...Thanks. This post gave me the thought to implement my distributed sampling.<br /><br />Just wanted to say if one wants to do a weighted random sampling with replacement, he can concurrently run k instances of the reservoir sampling algorithm without replacement (but with reservoir size=1) over the data stream.<br /><br />However, it could be slow, since for each item it needs to decide for k times. The time complexity is thus O(kN), where k is the size of the sample and N is the length of the stream.<br /><br />So I would like to ask if it is possible to reduce this time complexity? <br />Though the paper "Weighted random sampling with a reservoir" proposes a good way called 'AExpJ', it still needs to decide k times for each item.Zheyi RONGhttp://www.blogger.com/profile/15784067882343518638noreply@blogger.comtag:blogger.com,1999:blog-35584545.post-55926434962040512092013-02-07T12:52:55.091-08:002013-02-07T12:52:55.091-08:00Matthias, that's a good point. I didn't t...Matthias, that's a good point. I didn't think of using a heap.Greghttp://www.blogger.com/profile/06692328337754346540noreply@blogger.comtag:blogger.com,1999:blog-35584545.post-31083410191683571962013-02-07T04:25:26.322-08:002013-02-07T04:25:26.322-08:00Greg, why do you think it is insertion sort? Just...Greg, why do you think it is insertion sort? Just keep your reservoir as a heap, and make it equivalent to heap sort.Matthias GĂ¶rgenshttp://www.blogger.com/profile/13665641102508209349noreply@blogger.comtag:blogger.com,1999:blog-35584545.post-16278214632564715432013-01-22T12:51:10.018-08:002013-01-22T12:51:10.018-08:00Just wanted to say thanks. This blog post pointed...Just wanted to say thanks. This blog post pointed me in the right direction for integrating weighted reservoir sampling into my Clojure sampling library. For any other Clojure users out there:<br /><a href="https://github.com/bigmlcom/sampling" rel="nofollow">https://github.com/bigmlcom/sampling</a>Adam Ashenfelterhttp://www.blogger.com/profile/17499355093104727120noreply@blogger.comtag:blogger.com,1999:blog-35584545.post-92087459197688519732012-06-15T20:06:29.347-07:002012-06-15T20:06:29.347-07:00Miguel, I'm not sure what a "weighted shu...Miguel, I'm not sure what a "weighted shuffle" would be, but yeah it would probably not be the best way to do a random ordering. It's insertion sort of N items which is O(N^2). At the very least just assign random numbers to your entire set and run mergesort or such to get O(NlgN), though there is probably a cheaper way to randomly reorder elements if I thought about it.Greghttp://www.blogger.com/profile/06692328337754346540noreply@blogger.comtag:blogger.com,1999:blog-35584545.post-30385164723885882162012-05-30T04:03:17.323-07:002012-05-30T04:03:17.323-07:00I was helping a friend solve a shuffling problem, ...I was helping a friend solve a shuffling problem, and immediately remembered this post. Am I correct in thinking that:<br /><br />1. Assuming it all fits in memory, we can make some variant of this that allows us to shuffle the items (rather than just sample them) as they come in (possibly by starting with a reservoir of size 1, and increasing the reservoir size as elements come in?)<br /><br />2. This would actually be a terrible way to do a weighted random <b>shuffle</b>?Miguelhttp://www.blogger.com/profile/15936727617025003142noreply@blogger.comtag:blogger.com,1999:blog-35584545.post-62776137375069826212012-03-19T22:12:00.952-07:002012-03-19T22:12:00.952-07:00Greg, as DanVK pointed out, the paper describes an...Greg, as DanVK pointed out, the paper describes an algorithm (well, 3 in fact) that DOES work when you don't know the length of the stream. You "simply" calculate how many elements to skip, then skip them. If you reach the end, you're done... I encourage you to read it!<br /><br />But of course, I haven't verified the math behind the paper...Tommyhttp://www.blogger.com/profile/04665101046726236164noreply@blogger.comtag:blogger.com,1999:blog-35584545.post-60772810632684928502012-03-10T00:31:55.587-08:002012-03-10T00:31:55.587-08:00Came here from a Stackexchange theoretical compute...Came here from a Stackexchange theoretical computer science Q&A. Great explanation, Greg, far better than Wikipedia entry on the topic. I am also curious whether your offer of helping on Google application still stand.edwardwhttp://www.blogger.com/profile/05063629293336351610noreply@blogger.comtag:blogger.com,1999:blog-35584545.post-26193955286385155412011-05-04T14:26:08.868-07:002011-05-04T14:26:08.868-07:00Hi Mark,
I was wondering, how the approach be if y...Hi Mark,<br />I was wondering, how the approach be if your weight has floating point precisions?liminescencehttp://www.blogger.com/profile/03882714826080310847noreply@blogger.comtag:blogger.com,1999:blog-35584545.post-28096656810945839242009-01-09T11:40:00.000-08:002009-01-09T11:40:00.000-08:00Dan, nice observation. You can definitely do that...Dan, nice observation. You can definitely do that if you know that you have at least X elements remaining in the stream. <BR/><BR/>If you know how many elements are in the entire set in advance, you can just select random indexes into the set and go from there though.<BR/><BR/>I'm not convinced you can use your trick if you don't know how many elements are remaining. I could be wrong.Greghttp://www.blogger.com/profile/06692328337754346540noreply@blogger.comtag:blogger.com,1999:blog-35584545.post-69810039851804371492009-01-08T09:00:00.000-08:002009-01-08T09:00:00.000-08:00It's not necessary to flip a coin to decide whethe...It's not necessary to flip a coin to decide whether to keep each element. Rather, you can generate a random number of elements to skip.<BR/><BR/>Example: say you're sampling 1000 elements out of a billion. Towards the end of the stream, you're sampling with probability 1/1,000,000. Rather than generating a million random numbers to decide whether to keep each element, you can decide which of the next million elements to keep, saving yourself a million random number generations. Getting the distribution right is a bit tricky, but it can be done.<BR/><BR/>See http://www.cs.umd.edu/~samir/498/vitter.pdf for details.danvkhttp://www.blogger.com/profile/00276432906221497571noreply@blogger.comtag:blogger.com,1999:blog-35584545.post-16391604747028852882008-11-11T13:27:00.000-08:002008-11-11T13:27:00.000-08:00A friend of mine just pointed me at http://www.sci...A friend of mine just pointed me at http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6V0F-4HPK8WS-2&_user=10&_rdoc=1&_fmt=&_orig=search&_sort=d&view=c&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=49898584b0eb78d9fabc012e11e1963f as another good reference on the topic.Greghttp://www.blogger.com/profile/06692328337754346540noreply@blogger.comtag:blogger.com,1999:blog-35584545.post-81095491030720607032008-06-25T23:34:00.000-07:002008-06-25T23:34:00.000-07:00You don't actually need 1000 elements (for the bas...You don't actually need 1000 elements (for the basic version of the algorithm).<BR/>You can always use just one current element, and in each step, keep it with probability N/N+1, or replace it with the next element from your scan, with the probability 1/N+1 (where N == number of already scanned elements). Using induction it's easy to show it gives correct probabilities for each step of the way (assuming the random number generator you use is good).<BR/><BR/>I don't know if this variation is also called Reservoir Sampling or can be considered another algorithm, but looks conceptially the same as the one you described (I figured it out when trying to efficiently get a random quote from IRC logs, so I don't know if it has a name).senkohttp://www.blogger.com/profile/02871690938716392627noreply@blogger.comtag:blogger.com,1999:blog-35584545.post-44098341387300809352008-06-25T08:39:00.000-07:002008-06-25T08:39:00.000-07:00Interesting post. Your weighted selection method r...Interesting post. Your weighted selection method reminds me of "remainder stochastic sampling" for weighted selection when using genetic algorithms.Orthopteroidhttp://www.blogger.com/profile/16830974448481473911noreply@blogger.comtag:blogger.com,1999:blog-35584545.post-68840538060843380752008-06-25T07:20:00.000-07:002008-06-25T07:20:00.000-07:00In practice, one problem with reservoir sampling i...In practice, one problem with reservoir sampling is that the cost of generating a random number for each element in the stream may greatly outweigh the cost of counting the length of the stream.Scott Turnerhttp://www.blogger.com/profile/03393071448515738228noreply@blogger.comtag:blogger.com,1999:blog-35584545.post-43408197093246924092008-06-25T06:37:00.000-07:002008-06-25T06:37:00.000-07:00Another tiny (yet important) extension is given by...Another tiny (yet important) extension is given by Knuth that if you can't put all 1000 samples in memory, you can store the sample in external file as reservior and store the index of them in the memory. Finally sort the index and retrive the samples from reservior. This result is given at TAoCP, Vol 2, page 144, algorithm R. This result can be extended to the distributed sampling you mentioned above. I met this problem as an interview problem at Google.Erichttp://www.blogger.com/profile/03739040742799095022noreply@blogger.comtag:blogger.com,1999:blog-35584545.post-53750375249053076652008-06-25T06:17:00.000-07:002008-06-25T06:17:00.000-07:00Great post, I didn't know this had a name, or abou...Great post, I didn't know this had a name, or about the weighted version.<BR/><BR/>For the "batch" version, you say:<BR/><BR/>The right answer is generating random integers between 0 and N-1, then retrieving the elements at those indices and you have your answer.<BR/><BR/>That's sampling with replacement; the reservoir sampling algorithm is without replacement. So they're not quite equivalent, but in the case you talked about where the dataset is much bigger than the sample size, they're pretty close.martinhttp://www.blogger.com/profile/12071484944219339765noreply@blogger.comtag:blogger.com,1999:blog-35584545.post-30306709791017083352008-06-25T04:53:00.000-07:002008-06-25T04:53:00.000-07:00To reduce the number of steps in the uniform case ...To reduce the number of steps in the uniform case (1000 out of i), it could be easier to<BR/><BR/>- draw a random number R from the interval [1,i] (i > 1000, and assuming we index from 1),<BR/><BR/>- if R <= 1000, replace element R from your current sample with the new elementMichaelhttp://www.blogger.com/profile/18206004840589797613noreply@blogger.comtag:blogger.com,1999:blog-35584545.post-72449844103992753142008-06-25T03:55:00.000-07:002008-06-25T03:55:00.000-07:00Interesting article. It's hard to find good blogs ...Interesting article. It's hard to find good blogs on practical algorithms. Thanks!Georgehttp://www.blogger.com/profile/09639830462050573931noreply@blogger.comtag:blogger.com,1999:blog-35584545.post-28439243049984035552008-06-25T03:48:00.000-07:002008-06-25T03:48:00.000-07:00Great post!I used a similar technique in a paper o...Great post!<BR/><BR/>I used a similar technique in a <A HREF="http://users.rsise.anu.edu.au/~mreid/files/pubs/ilp00.pdf" REL="nofollow">paper of mine</A> years ago. We needed a way of converting a stream of examples to a manageable batch and hit upon the same random replacement idea.<BR/><BR/>I didn't realise it was called reservoir sampling and didn't spend any time doing a proper analysis. Good to see someone has though!Markhttp://www.blogger.com/profile/08108637082632675639noreply@blogger.com