Oct 29, 2007

Sentinel Dome, Jeffrey Pine in Yosemite

Yosemite Valley in the Fall

Last night I returned from my first visit to Yosemite. I went up with Cristin and a group of Googlers, ate dinner in the dark atop Glacier Point, spent the night in a tent on the valley floor, and then hiked Sentinel Dome. Cristin took this amazing photo (and several others) from the valley of the mountain formation that I think was called "Three Brothers". The entire trip was memorable, but I'll write a little more detail about Sentinel Dome, as it is a little less of the "been there done that" story.
Jeffrey Pine on Sentinel Dome in Yosemite

Yosemite's Sentinel Dome is the second highest point in the main Yosemite Valley (Update: I had said the park, but this is untrue as an astute reader pointed out), other than Half Dome. While Half Dome gets hundreds of visitors daily, we hiked the far easier trail to Sentinel Dome and saw maybe a half dozen people on the trail total. It is definitely not a popular spot despite the fact that it has a 360 degree view of the valley, and a short 2 mile (one way) hike from the parking lot. It is considered by some to be the best place in the continental US to see the night stars. It is also from here that Ansel Adams took one his most famous photograph of the Sentinel Dome Jeffrey Pine. It has since become one of the most photographed trees in the country.

Jeffrey Pine on Sentinel Dome in Yosemite

The left image is Ansel Adams' photo, the right image was one taken in 2002. In the intervening years, there was a severe drought. Visitors to Sentinel Dome hauled buckets of water to the dome to water the tree, to no avail. The right picture is the tree after the drought. Some time in August 2003, the tree fell over in the wind. The image above of myself and Pedro is what it looks like today, just a log really. Sad, in fact there is some pocketknife carving on the side you can't see in the image. Still, it is quite a sight on top of the bare granite that is Sentinel Dome.

I have to wonder why this trail isn't more popular. It seems like one of Yosemite's rare secrets.

Oct 8, 2007

Standing an Egg on Equinox

Continuing a theme from the last post of playing MythBusters using science. When I was a little kid, my schoolteachers would tell me about the equinox - it is one of 2 days of the year where the day and night are the same length. It also is the 2 days where the earth's axis is perpendicular to the sun. To demonstrate how this is special, they wave their hands about the sun's affect on gravity and tell me that on this day and this day alone, I can stand a raw egg on it's end!

Even better, as a good science experiment should be, the teacher either demonstrates with their own egg or hand me one and have me try. After alot of patience, it works. The egg stands on it's end. Science is proven, the egg experiment proves that the sun is doing somehing hand wavey with gravity.

Teacher, I'm sorry but you missed a step. If you happen to try to stand on one foot on the equinox and succeed, that doesn't necessarily mean you can only stand on one foor on that day out of the year. Turns out that you can stand an egg on end any friggin time. Really. Try it NOW. Don't trust me, do your own experiment. The trick is you need a little patience. On non-eqiunox days, you don't believe it can be done, so you don't try very hard. On equinox days you will fidget with this egg for god-knows how long until it stands, dammit! There are things called blind studies that would fix this confounding variable in your experiment, if you are still not convinced.

Reservoir Sampling - Sampling from a stream of elements

If you don't find programming algorithms interesting, stop reading. This post is not for you.

On the other hand, if you do find algorithms interesting, in addition to this post, you might also want to read my other posts with the algorithms tag.

Problem Statement

Reservoir Sampling is an algorithm for sampling elements from a stream of data. Imagine you are given a really large stream of data elements (queries on google searches in May, products bought at Walmart during the Christmas season, names in a phone book, whatever). Your goal is to efficiently return a random sample of 1,000 elements evenly distributed from the original stream. How would you do it?

The right answer is generating random integers between 0 and N-1, then retrieving the elements at those indices and you have your answer. (Update: reader Martin astutely points out that this is sampling with replacement. To make this sampling without replacement, one simply needs to note whether or not your sample already pulled that random number and if so, choose a new random number. This can make the algorithm pretty expensive if the sample size is very close to N though).

So, let me make the problem harder. You don't know N (the size of the stream) in advance and you can't index directly into it. You can count it, but that requires making 2 passes of the data. You can do better. There are some heuristics you might try: for example to guess the length and hope to undershoot. It will either not work in one pass or will not be evenly distributed.

Simple Solution

A relatively easy and correct solution is to assign a random number to every element as you see them in the stream, and then always keep the top 1,000 numbered elements at all times. This is similar to how mysql does "ORDER BY RAND()" calls. This strategy works well, and only requires additionally storing the randomly generated number for each element.

Reservoir Sampling

Another, more complex option is reservoir sampling. First, you want to make a reservoir (array) of 1,000 elements and fill it with the first 1,000 elements in your stream. That way if you have exactly 1,000 elements, the algorithm works. This is the base case.

Next, you want to process the i'th element (starting with i = 1,001) such that at the end of processing that step, the 1,000 elements in your reservoir are randomly sampled amongst the i elements you've seen so far. How can you do this? Start with i = 1,001. With what probability after the 1001'th step should element 1,001 (or any element for that matter) be in the set of 1,000 elements? The answer is easy: 1,000/1,001. So, generate a random number between 0 and 1, and if it is less than 1,000/1,001 you should take element 1,001. In other words, choose to add element 1,001 to your reservoir with probability 1,000/1,001. If you choose to add it (which you likely will), then replace any element in the reservoir chosen randomly. I've shown that this produces a 1,000/1,001 chance of selecting the 1,001'th element, but what about the 2nd element in the list? The 2nd element is definitely in the reservoir at step 1,000 and the probability of it getting removed is the probability of element 1,001 getting selected multiplied by the probability of #2 getting randomly chosen as the replacement candidate. That probability is 1,000/1,001 * 1/1,000 = 1/1,001. So, the probability that #2 survives this round is 1 - that or 1,000/1,001.

This can be extended for the i'th round - keep the i'th element with probability 1,000/i and if you choose to keep it, replace a random element from the reservoir. It is pretty easy to prove that this works for all values of i using induction. It obviously works for the i'th element based on the way the algorithm selects the i'th element with the correct probability outright. The probability any element before this step being in the reservoir is 1,000/(i-1). The probability that they are removed is 1,000/i * 1/1,000 = 1/i. The probability that each element sticks around given that they are already in the reservoir is (i-1)/i and thus the elements' overall probability of being in the reservoir after i rounds is 1,000/(i-1) * (i-1)/i = 1,000/i.

This ends up a little complex, but works just the same way as the random assigned numbers above.

Weighted Reservoir Sampling Variation
Now take the same problem above but add an extra challenge: How would you sample from a weighted distribution where each element has a given weight associated with it in the stream? This is sorta tricky. Pavlos S. Efraimidis figured out the solution in 2005 in a paper titled Weighted Random Sampling with a Reservoir. It works similarly to the assigning a random number solution above.

As you process the stream, assign each item a "key". For each item in the stream i, let be the item's "key", let be the weight of that item and let be a random number between 0 and 1. The "key", , is a random number to the n'th root where n is weight of that item in the stream: . Now, simply keep the top n elements ordered by their keys, where n is the size of your sample. Easy.

To see how this works, lets start with non-weighted elements (ie: weight = 1). is always 1, so the key is simply a random number and this algorithm degrades into the simple algorithm mentioned above.

Now, how does it work with weights? The probability of choosing i over j is the probability that > . can have any value from 0 - 1. However, it's more likely to be closer to 1 the higher w is. We can see what the distribution of this looks like when comparing to a weight 1 element by integrating k over all values of random numbers from 0 - 1. You get something like this:

k =

If w = 1, k = 1 / 2. If w = 9, k = 9 / 10. When replacing against a weight 1 element, an item of weight 5 would have a 50% chance and an element of weight 9 would have a 90% chance. Similar math works for two elements of non-zero weight.

Distributed Reservoir Sampling Variation
This is the problem that got me researching the weighted sample above. In both of the above algorithms, I can process the stream in O(N) time where N is length of the stream, in other words: in a single pass. If I want to break break up the problem on say 10 machines and solve it close to 10 times faster, how can I do that?

The answer is to have each of the 10 machines take roughly 1/10th of the input to process and generate their own reservoir sample from their subset of the data using the weighted variation above. Then, a final process must take the 10 output reservoirs and merge them.

The trick is that the final process must use the original "key" weights computed in the first pass. For example, If one of your 10 machines processed only 10 items in a size-10 sample, and the other 10 machines each processed 1 million items, you would expect that the one machine with 10 items would likely have smaller keys and hence be less likely to be selected in the final output. If you recompute keys in the final process, then all of the input items would be treated equally when they shouldn't.

Birds of a Feather
If you are one of the handful of people interested in Reservoir Sampling and advanced software algorithms like this, you are the type of person I'd like to see working with me at Google. If you send me your resume (
ggrothau@gmail.com), I can make sure it gets in front of the right recruiters and watch to make sure that it doesn't get lost in the pile that we get every day.  
Update: Despite the fact that this post was published in 2008, this offer still stands today.

Oct 6, 2007

Path of Technology

I was thinking a bit today about technology. Alot of people tend to have the notion that in various tech industries, technology follows something like moore's law - getting exponentially better over time. All that is needed is some quantity of "research" whatever that is. Moore's law is that the speed of computers will double every 18 months.

There are historically other paths that science and technology have taken in the past in different industries, and to ignore these alternate possibilities seems rash:
  1. Chaos: The study of weather in the 60s(?) promised to allow long-term weather forecasting so that we could all schedule our picnic outdoors 3 months in advance and not worry about rain. Then chaos theory showed us why this is more or less an impossible thing to predict in the long term.
  2. Large Capital Cost: The Apollo missions ended with the Cold War. Things may eventually change, but space exploration, especially manned, has reverted. We haven't been to the moon in decades.
  3. Physical/Social Danger: Nuclear Energy promised to give us unlimited cheap clean energy forever. Instead, there has been no new nuclear power plants built in the US in about 15 years. Chernobyl, concerns about Yucca Mountain, public dissent killed off the nuclear industry.
Any one of these possibly scenarios could still play out in technologies of today. Bioinformatics and Drug Design could end up being a chaotic space in the limit. Creating an alternative fuel vehicle requires large capital outlays to make sure that there is a good vehicle and sufficient places to refuel it. A few high profile privacy violations could ultimately have chilling effects on social networking and the internet.

Who knows what the future will bring.

Oct 2, 2007

Santa Cruz Monarch Butterflies

Starting now, the Santa Cruz Butterflies have started to land in Natural Bridges State Park. Every year, Monarch butterflies swarm a really really small area of land during their migration. Really, the area that they land in is probably the size of the infield of a baseball field. The totally amazing part is that these butterflies have never been here before. Their parent's parents have, but since the lifespan of a butterfly is so short, their annual migration spans several generations. Nobody knows how they find their way to the exact same spot every year, but they do and it is amazing to see.

The butterflies start to arrive in October, but really start clustering in late October (a few weeks away) or November. The leave in January. Worth a trip if you are in the bay area. I was just down there on Sunday.