May 16, 2008

Google AppEngine Critical Flaw

Punchline: I feel like there is a critical flaw in Google App Engine. But first, the requisite fine print: I work for Google, the opinions here are my own, not Google's, and I don't work on App Engine.

I feel like there is a critical flaw in Google App Engine. I work with the Google technologies App Engine is based upon, namely massive datacenters and bigtable. The third peg of this at Google is MapReduce. App Engine has no MapReduce equivalent. Let me explain by starting with the limitations of App Engine:

From http://code.google.com/appengine/docs/python/sandbox.html


An App Engine application cannot:
  • spawn a sub-process or thread. A web request to an application must be handled in a single process within a few seconds. Processes that take a very long time to respond are terminated to avoid overloading the web server.



So, the application code must be short-lived and stateless. This is actually a great way to develop a front end. Now, what does the back end support?

From http://code.google.com/appengine/docs/datastore/entitiesandmodels.html:


A data object in the App Engine datastore is known as an entity. An entity has one or more properties, named values of one of several supported data types.

Each entity also has a key that uniquely identifies the entity. The simplest key has a kind and a unique numeric ID provided by the datastore. The ID can also be a string provided by the application. For more information on keys, see Keys and Entity Groups.

An application can fetch an entity from the datastore by using its key, or by performing a query that matches the entity's properties.


I want to create a simple version of Google Analytics using App Engine. I would set up my front end code to get a request from a web page that a user has viewed. I would store as an entity that request associated with the website/account.

The most obvious feature is how many hits did a site get during some time span. This is easy, I just perform a query using filters to define that I want the entities from account X in date range Y-Z. I then use the query to count the results. There is a problem though.

From http://code.google.com/appengine/docs/datastore/queryclass.html:


count() is somewhat faster than retrieving all of the data by a constant factor, but the running time still grows with the size of the result set. It's best to only use count() in cases where the count is expected to be small, or specify a limit.


Reading between the lines, I'm now running a O(N) query on what could be a large N. This doesn't scale. HighScalability.com noticed this and wrote the correct workaround:

Instead of calculating the results at query time, calculate them when you are adding the records. This means that displaying the results is just a lookup, and that the calculation costs are mortized over each record addition.


So we create an entity to store every request and another entity to store the count of requests per day for a web site. And we launch. Everything works fine and scales effortlessly.

Now, I want to add a new feature: Show the user how many requests were done per browser type. I can store these counts in their own entities the same way as the counts per day, but I haven't already been storing them. Even though the data is in my request entities, I'll never be able to retrieve it and count it: I'm limited to a few seconds of processing in my front end code. I'm stuck.

This is where MapReduce would normally come in. I would write some code to process old request entities and build up browser count per day entities. Then, going forward I would collect this continuously. Alternatively, with a competitor like Amazon EC2, I would spawn up a few extra machine instances to do the extra processing once, and then release them back into the cloud pool. Nothing like this exists with App Engine.

There is still plenty that could be created with App Engine. Your standard CRUD (Create Replace Update Delete) application can generally be created with much more ease than trying to set up all the technologies yourself. I probably could create a bare-bones Tag-Board using App Engine in a few hours and I suspect it would scale just fine. But not everything would work this well.

Scaling isn't always as simple as having lots of machines.

May 6, 2008

The Problem with Voting, Part II



More friends privately emailed me about my last blog post (The Problem with Voting) than any post before it, although no blog comments. :(

A few people made a counter argument along the following lines:
One vote doesn't count, but people who think about voting the same way as me likely would think similarly to me on other issues, such as choosing the next president. If lots of people like me decide not to vote the same way I've done, that would be very bad for my interests. If instead we all voted, it would be very good for my interests.

This is a true statement, but it doesn't change the outcome. My voting or not voting has no affect on those other people out there. In statistical terms, each person's decision to vote or not is an independent decision. In economic terms, we see the tragedy of the commons.

The extension is a stronger argument: perhaps my decision is independent, but telling other people my decision and the logic behind it could affect their decision as well. This is certainly possible. If my comments reached alot of people it would have a real effect. You could do even the math: estimate how many people you could reach by discussing my decision, estimate how many might change their mind, and then do the math in my previous post using a larger range of values for the binomial probability density function.

For me, on a very good day, my blog gets about 75 visits. Only about 10% hit my front page, so I may have a 5-person/day reach or so for this post. Between now and November, maybe I'll hit 500 people. Even if all of those people were already going to vote the same way as I would have, and even if I caused them all not to vote, the probability of affecting the election is still tiny. Without doing the full calculation, I could grossly overestimate the probability at 0.000013% x 500 = .0065%, the actual number being much much smaller. If I were a talk show host or a sports star, my chance of having an effect might be more likely, but I'm making very generous assumptions anyway.

The Important Part:

I should have elaborated more on my conclusions before. Voting is participatory, symbolic, and has alot of personal meaning. In the same way that me driving a Prius won't make a dent in climate change, it means something to me to do my little bit. Prius vs. voting isn't that great of a comparison - one vote will have zero affect on the election, one prius will have a tiny but non-zero affect on climate change. I may indeed vote when it comes down to it, but if I end up voting it won't be because I expect to change the outcome, but rather because I want to "feel" that I'm part of the process.

My biggest gripe though is that this is where many people stop (I'm not referring to anyone who emailed me). They vote, and only once every 4 years. They feel that making change is someone else's job, yet they have strong opinions on what that change should be. The real truth is that voting is a pretty ineffective way for anyone to affect change, but there are other very effective ways out there. Not that I couldn't do more myself - I'm certainly black as the kettle, but many kettles are pretending they aren't not black by making *only* symbolic efforts.

My parents are role models in this regard. In 1976, they got involved in the Sierra club in South Carolina, wrote letters and organized and were able to keep Congaree Swamp from being logged, declaring it a National Monument. Later, it became America's 57th National Park. Imagine if all they did was vote for their favorite president and stick a Sierra Club bumper sticker on their car. Things would have been different.