<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/'><id>tag:blogger.com,1999:blog-35584545.post876548025762569322..comments</id><updated>2011-05-05T00:19:21.538-07:00</updated><category term='environment'/><category term='bay area'/><category term='algorithms'/><category term='software'/><category term='outdoors'/><category term='google'/><title type='text'>Comments on Gregable.: Reservoir Sampling</title><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://gregable.com/feeds/876548025762569322/comments/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/35584545/876548025762569322/comments/default'/><link rel='alternate' type='text/html' href='http://gregable.com/2007/10/reservoir-sampling.html'/><author><name>Greg</name><uri>http://www.blogger.com/profile/06692328337754346540</uri><email>noreply@blogger.com</email><gd:image xmlns:gd='http://schemas.google.com/g/2005' rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>12</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>25</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-35584545.post-2619395528638515541</id><published>2011-05-04T14:26:08.868-07:00</published><updated>2011-05-04T14:26:08.868-07:00</updated><title type='text'>Hi Mark,
I was wondering, how the approach be if y...</title><content type='html'>Hi Mark,&lt;br /&gt;I was wondering, how the approach be if your weight has floating point precisions?</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/35584545/876548025762569322/comments/default/2619395528638515541'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/35584545/876548025762569322/comments/default/2619395528638515541'/><link rel='alternate' type='text/html' href='http://gregable.com/2007/10/reservoir-sampling.html?showComment=1304544368868#c2619395528638515541' title=''/><author><name>liminescence</name><uri>http://www.blogger.com/profile/03882714826080310847</uri><email>noreply@blogger.com</email><gd:image xmlns:gd='http://schemas.google.com/g/2005' rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:in-reply-to xmlns:thr='http://purl.org/syndication/thread/1.0' href='http://gregable.com/2007/10/reservoir-sampling.html' ref='tag:blogger.com,1999:blog-35584545.post-876548025762569322' source='http://www.blogger.com/feeds/35584545/posts/default/876548025762569322' type='text/html'/><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='blogger.itemClass' value='pid-1300425396'/></entry><entry><id>tag:blogger.com,1999:blog-35584545.post-2809665681094583924</id><published>2009-01-09T11:40:00.000-08:00</published><updated>2009-01-09T11:40:00.000-08:00</updated><title type='text'>Dan, nice observation.  You can definitely do that...</title><content type='html'>Dan, nice observation.  You can definitely do that if you know that you have at least X elements remaining in the stream.  &lt;BR/&gt;&lt;BR/&gt;If you know how many elements are in the entire set in advance, you can just select random indexes into the set and go from there though.&lt;BR/&gt;&lt;BR/&gt;I'm not convinced you can use your trick if you don't know how many elements are remaining.  I could be wrong.</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/35584545/876548025762569322/comments/default/2809665681094583924'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/35584545/876548025762569322/comments/default/2809665681094583924'/><link rel='alternate' type='text/html' href='http://gregable.com/2007/10/reservoir-sampling.html?showComment=1231530000000#c2809665681094583924' title=''/><author><name>Greg</name><uri>http://www.blogger.com/profile/06692328337754346540</uri><email>noreply@blogger.com</email><gd:image xmlns:gd='http://schemas.google.com/g/2005' rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:in-reply-to xmlns:thr='http://purl.org/syndication/thread/1.0' href='http://gregable.com/2007/10/reservoir-sampling.html' ref='tag:blogger.com,1999:blog-35584545.post-876548025762569322' source='http://www.blogger.com/feeds/35584545/posts/default/876548025762569322' type='text/html'/><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='blogger.itemClass' value='pid-1752602699'/></entry><entry><id>tag:blogger.com,1999:blog-35584545.post-6981003985180437149</id><published>2009-01-08T09:00:00.000-08:00</published><updated>2009-01-08T09:00:00.000-08:00</updated><title type='text'>It's not necessary to flip a coin to decide whethe...</title><content type='html'>It's not necessary to flip a coin to decide whether to keep each element. Rather, you can generate a random number of elements to skip.&lt;BR/&gt;&lt;BR/&gt;Example: say you're sampling 1000 elements out of a billion. Towards the end of the stream, you're sampling with probability 1/1,000,000. Rather than generating a million random numbers to decide whether to keep each element, you can decide which of the next million elements to keep, saving yourself a million random number generations. Getting the distribution right is a bit tricky, but it can be done.&lt;BR/&gt;&lt;BR/&gt;See http://www.cs.umd.edu/~samir/498/vitter.pdf for details.</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/35584545/876548025762569322/comments/default/6981003985180437149'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/35584545/876548025762569322/comments/default/6981003985180437149'/><link rel='alternate' type='text/html' href='http://gregable.com/2007/10/reservoir-sampling.html?showComment=1231434000000#c6981003985180437149' title=''/><author><name>danvk</name><uri>http://www.blogger.com/profile/00276432906221497571</uri><email>noreply@blogger.com</email><gd:image xmlns:gd='http://schemas.google.com/g/2005' rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:in-reply-to xmlns:thr='http://purl.org/syndication/thread/1.0' href='http://gregable.com/2007/10/reservoir-sampling.html' ref='tag:blogger.com,1999:blog-35584545.post-876548025762569322' source='http://www.blogger.com/feeds/35584545/posts/default/876548025762569322' type='text/html'/><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='blogger.itemClass' value='pid-1893197412'/></entry><entry><id>tag:blogger.com,1999:blog-35584545.post-1639160474702885288</id><published>2008-11-11T13:27:00.000-08:00</published><updated>2008-11-11T13:27:00.000-08:00</updated><title type='text'>A friend of mine just pointed me at http://www.sci...</title><content type='html'>A friend of mine just pointed me at http://www.sciencedirect.com/science?_ob=ArticleURL&amp;amp;_udi=B6V0F-4HPK8WS-2&amp;amp;_user=10&amp;amp;_rdoc=1&amp;amp;_fmt=&amp;amp;_orig=search&amp;amp;_sort=d&amp;amp;view=c&amp;amp;_acct=C000050221&amp;amp;_version=1&amp;amp;_urlVersion=0&amp;amp;_userid=10&amp;amp;md5=49898584b0eb78d9fabc012e11e1963f as another good reference on the topic.</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/35584545/876548025762569322/comments/default/1639160474702885288'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/35584545/876548025762569322/comments/default/1639160474702885288'/><link rel='alternate' type='text/html' href='http://gregable.com/2007/10/reservoir-sampling.html?showComment=1226438820000#c1639160474702885288' title=''/><author><name>Greg</name><uri>http://www.blogger.com/profile/06692328337754346540</uri><email>noreply@blogger.com</email><gd:image xmlns:gd='http://schemas.google.com/g/2005' rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:in-reply-to xmlns:thr='http://purl.org/syndication/thread/1.0' href='http://gregable.com/2007/10/reservoir-sampling.html' ref='tag:blogger.com,1999:blog-35584545.post-876548025762569322' source='http://www.blogger.com/feeds/35584545/posts/default/876548025762569322' type='text/html'/><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='blogger.itemClass' value='pid-1752602699'/></entry><entry><id>tag:blogger.com,1999:blog-35584545.post-8109549103072060703</id><published>2008-06-25T23:34:00.000-07:00</published><updated>2008-06-25T23:34:00.000-07:00</updated><title type='text'>You don't actually need 1000 elements (for the bas...</title><content type='html'>You don't actually need 1000 elements (for the basic version of the algorithm).&lt;BR/&gt;You can always use just one current element, and in each step, keep it with probability N/N+1, or replace it with the next element from your scan, with the probability 1/N+1 (where N == number of already scanned elements). Using induction it's easy to show it gives correct probabilities for each step of the way (assuming the random number generator you use is good).&lt;BR/&gt;&lt;BR/&gt;I don't know if this variation is also called Reservoir Sampling or can be considered another algorithm, but looks conceptially the same as the one you described (I figured it out when trying to efficiently get a random quote from IRC logs, so I don't know if it has a name).</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/35584545/876548025762569322/comments/default/8109549103072060703'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/35584545/876548025762569322/comments/default/8109549103072060703'/><link rel='alternate' type='text/html' href='http://gregable.com/2007/10/reservoir-sampling.html?showComment=1214462040000#c8109549103072060703' title=''/><author><name>senko</name><uri>http://www.blogger.com/profile/02871690938716392627</uri><email>noreply@blogger.com</email><gd:image xmlns:gd='http://schemas.google.com/g/2005' rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:in-reply-to xmlns:thr='http://purl.org/syndication/thread/1.0' href='http://gregable.com/2007/10/reservoir-sampling.html' ref='tag:blogger.com,1999:blog-35584545.post-876548025762569322' source='http://www.blogger.com/feeds/35584545/posts/default/876548025762569322' type='text/html'/><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='blogger.itemClass' value='pid-1707135330'/></entry><entry><id>tag:blogger.com,1999:blog-35584545.post-4409834138730080935</id><published>2008-06-25T08:39:00.000-07:00</published><updated>2008-06-25T08:39:00.000-07:00</updated><title type='text'>Interesting post. Your weighted selection method r...</title><content type='html'>Interesting post. Your weighted selection method reminds me of "remainder stochastic sampling" for weighted selection when using genetic algorithms.</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/35584545/876548025762569322/comments/default/4409834138730080935'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/35584545/876548025762569322/comments/default/4409834138730080935'/><link rel='alternate' type='text/html' href='http://gregable.com/2007/10/reservoir-sampling.html?showComment=1214408340000#c4409834138730080935' title=''/><author><name>Orthopteroid</name><uri>http://www.blogger.com/profile/16830974448481473911</uri><email>noreply@blogger.com</email><gd:image xmlns:gd='http://schemas.google.com/g/2005' rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:in-reply-to xmlns:thr='http://purl.org/syndication/thread/1.0' href='http://gregable.com/2007/10/reservoir-sampling.html' ref='tag:blogger.com,1999:blog-35584545.post-876548025762569322' source='http://www.blogger.com/feeds/35584545/posts/default/876548025762569322' type='text/html'/><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='blogger.itemClass' value='pid-1262267556'/></entry><entry><id>tag:blogger.com,1999:blog-35584545.post-6884053806084338075</id><published>2008-06-25T07:20:00.000-07:00</published><updated>2008-06-25T07:20:00.000-07:00</updated><title type='text'>In practice, one problem with reservoir sampling i...</title><content type='html'>In practice, one problem with reservoir sampling is that the cost of generating a random number for each element in the stream may greatly outweigh the cost of counting the length of the stream.</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/35584545/876548025762569322/comments/default/6884053806084338075'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/35584545/876548025762569322/comments/default/6884053806084338075'/><link rel='alternate' type='text/html' href='http://gregable.com/2007/10/reservoir-sampling.html?showComment=1214403600000#c6884053806084338075' title=''/><author><name>Scott Turner</name><uri>http://www.blogger.com/profile/03393071448515738228</uri><email>noreply@blogger.com</email><gd:image xmlns:gd='http://schemas.google.com/g/2005' rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:in-reply-to xmlns:thr='http://purl.org/syndication/thread/1.0' href='http://gregable.com/2007/10/reservoir-sampling.html' ref='tag:blogger.com,1999:blog-35584545.post-876548025762569322' source='http://www.blogger.com/feeds/35584545/posts/default/876548025762569322' type='text/html'/><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='blogger.itemClass' value='pid-1712799472'/></entry><entry><id>tag:blogger.com,1999:blog-35584545.post-4340819709324692409</id><published>2008-06-25T06:37:00.000-07:00</published><updated>2008-06-25T06:37:00.000-07:00</updated><title type='text'>Another tiny (yet important) extension is given by...</title><content type='html'>Another tiny (yet important) extension is given by Knuth that if you can't put all 1000 samples in memory, you can store the sample in external file as reservior and store the index of them in the memory. Finally sort the index and retrive the samples from reservior. This result is given at TAoCP, Vol 2, page 144, algorithm R. This result can be extended to the distributed sampling you mentioned above. I met this problem as an interview problem at Google.</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/35584545/876548025762569322/comments/default/4340819709324692409'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/35584545/876548025762569322/comments/default/4340819709324692409'/><link rel='alternate' type='text/html' href='http://gregable.com/2007/10/reservoir-sampling.html?showComment=1214401020000#c4340819709324692409' title=''/><author><name>Eric</name><uri>http://www.blogger.com/profile/03739040742799095022</uri><email>noreply@blogger.com</email><gd:image xmlns:gd='http://schemas.google.com/g/2005' rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='14' src='http://lh5.google.com/image/xu.mathena/RaFSYvNVheI/AAAAAAAAAc0/iVmxi7W8s2Q/web23.png'/></author><thr:in-reply-to xmlns:thr='http://purl.org/syndication/thread/1.0' href='http://gregable.com/2007/10/reservoir-sampling.html' ref='tag:blogger.com,1999:blog-35584545.post-876548025762569322' source='http://www.blogger.com/feeds/35584545/posts/default/876548025762569322' type='text/html'/><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='blogger.itemClass' value='pid-1556904808'/></entry><entry><id>tag:blogger.com,1999:blog-35584545.post-5375037524905307665</id><published>2008-06-25T06:17:00.000-07:00</published><updated>2008-06-25T06:17:00.000-07:00</updated><title type='text'>Great post, I didn't know this had a name, or abou...</title><content type='html'>Great post, I didn't know this had a name, or about the weighted version.&lt;BR/&gt;&lt;BR/&gt;For the "batch" version, you say:&lt;BR/&gt;&lt;BR/&gt;The right answer is generating random integers between 0 and N-1, then retrieving the elements at those indices and you have your answer.&lt;BR/&gt;&lt;BR/&gt;That's sampling with replacement; the reservoir sampling algorithm is without replacement.  So they're not quite equivalent, but in the case you talked about where the dataset is much bigger than the sample size, they're pretty close.</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/35584545/876548025762569322/comments/default/5375037524905307665'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/35584545/876548025762569322/comments/default/5375037524905307665'/><link rel='alternate' type='text/html' href='http://gregable.com/2007/10/reservoir-sampling.html?showComment=1214399820000#c5375037524905307665' title=''/><author><name>martin</name><uri>http://www.blogger.com/profile/12071484944219339765</uri><email>noreply@blogger.com</email><gd:image xmlns:gd='http://schemas.google.com/g/2005' rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:in-reply-to xmlns:thr='http://purl.org/syndication/thread/1.0' href='http://gregable.com/2007/10/reservoir-sampling.html' ref='tag:blogger.com,1999:blog-35584545.post-876548025762569322' source='http://www.blogger.com/feeds/35584545/posts/default/876548025762569322' type='text/html'/><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='blogger.itemClass' value='pid-1056147649'/></entry><entry><id>tag:blogger.com,1999:blog-35584545.post-3030670979101708335</id><published>2008-06-25T04:53:00.000-07:00</published><updated>2008-06-25T04:53:00.000-07:00</updated><title type='text'>To reduce the number of steps in the uniform case ...</title><content type='html'>To reduce the number of steps in the uniform case (1000 out of i), it could be easier to&lt;BR/&gt;&lt;BR/&gt;- draw a random number R from the interval [1,i] (i &gt; 1000, and assuming we index from 1),&lt;BR/&gt;&lt;BR/&gt;- if R &lt;= 1000, replace element R from your current sample with the new element</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/35584545/876548025762569322/comments/default/3030670979101708335'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/35584545/876548025762569322/comments/default/3030670979101708335'/><link rel='alternate' type='text/html' href='http://gregable.com/2007/10/reservoir-sampling.html?showComment=1214394780000#c3030670979101708335' title=''/><author><name>Michael</name><uri>http://www.blogger.com/profile/18206004840589797613</uri><email>noreply@blogger.com</email><gd:image xmlns:gd='http://schemas.google.com/g/2005' rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:in-reply-to xmlns:thr='http://purl.org/syndication/thread/1.0' href='http://gregable.com/2007/10/reservoir-sampling.html' ref='tag:blogger.com,1999:blog-35584545.post-876548025762569322' source='http://www.blogger.com/feeds/35584545/posts/default/876548025762569322' type='text/html'/><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='blogger.itemClass' value='pid-177016054'/></entry><entry><id>tag:blogger.com,1999:blog-35584545.post-7244984410399275314</id><published>2008-06-25T03:55:00.000-07:00</published><updated>2008-06-25T03:55:00.000-07:00</updated><title type='text'>Interesting article. It's hard to find good blogs ...</title><content type='html'>Interesting article. It's hard to find good blogs on practical algorithms. Thanks!</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/35584545/876548025762569322/comments/default/7244984410399275314'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/35584545/876548025762569322/comments/default/7244984410399275314'/><link rel='alternate' type='text/html' href='http://gregable.com/2007/10/reservoir-sampling.html?showComment=1214391300000#c7244984410399275314' title=''/><author><name>George</name><uri>http://www.blogger.com/profile/09639830462050573931</uri><email>noreply@blogger.com</email><gd:image xmlns:gd='http://schemas.google.com/g/2005' rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:in-reply-to xmlns:thr='http://purl.org/syndication/thread/1.0' href='http://gregable.com/2007/10/reservoir-sampling.html' ref='tag:blogger.com,1999:blog-35584545.post-876548025762569322' source='http://www.blogger.com/feeds/35584545/posts/default/876548025762569322' type='text/html'/><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='blogger.itemClass' value='pid-1113569690'/></entry><entry><id>tag:blogger.com,1999:blog-35584545.post-2843924304998403555</id><published>2008-06-25T03:48:00.000-07:00</published><updated>2008-06-25T03:48:00.000-07:00</updated><title type='text'>Great post!&lt;br&gt;&lt;br&gt;I used a similar technique in a...</title><content type='html'>Great post!&lt;BR/&gt;&lt;BR/&gt;I used a similar technique in a &lt;A HREF="http://users.rsise.anu.edu.au/~mreid/files/pubs/ilp00.pdf" REL="nofollow"&gt;paper of mine&lt;/A&gt; years ago. We needed a way of converting a stream of examples to a manageable batch and hit upon the same random replacement idea.&lt;BR/&gt;&lt;BR/&gt;I didn't realise it was called reservoir sampling and didn't spend any time doing a proper analysis. Good to see someone has though!</content><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/35584545/876548025762569322/comments/default/2843924304998403555'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/35584545/876548025762569322/comments/default/2843924304998403555'/><link rel='alternate' type='text/html' href='http://gregable.com/2007/10/reservoir-sampling.html?showComment=1214390880000#c2843924304998403555' title=''/><author><name>Mark</name><uri>http://www.blogger.com/profile/08108637082632675639</uri><email>noreply@blogger.com</email><gd:image xmlns:gd='http://schemas.google.com/g/2005' rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:in-reply-to xmlns:thr='http://purl.org/syndication/thread/1.0' href='http://gregable.com/2007/10/reservoir-sampling.html' ref='tag:blogger.com,1999:blog-35584545.post-876548025762569322' source='http://www.blogger.com/feeds/35584545/posts/default/876548025762569322' type='text/html'/><gd:extendedProperty xmlns:gd='http://schemas.google.com/g/2005' name='blogger.itemClass' value='pid-319330662'/></entry></feed>
