I feel like there is a critical flaw in Google App Engine. I work with the Google technologies App Engine is based upon, namely massive datacenters and bigtable. The third peg of this at Google is MapReduce. App Engine has no MapReduce equivalent. Let me explain by starting with the limitations of App Engine:
From http://code.google.com/appengine/docs/python/sandbox.html
An App Engine application cannot:
- spawn a sub-process or thread. A web request to an application must be handled in a single process within a few seconds. Processes that take a very long time to respond are terminated to avoid overloading the web server.
So, the application code must be short-lived and stateless. This is actually a great way to develop a front end. Now, what does the back end support?
From http://code.google.com/appengine/docs/datastore/entitiesandmodels.html:
A data object in the App Engine datastore is known as an entity. An entity has one or more properties, named values of one of several supported data types.
Each entity also has a key that uniquely identifies the entity. The simplest key has a kind and a unique numeric ID provided by the datastore. The ID can also be a string provided by the application. For more information on keys, see Keys and Entity Groups.
An application can fetch an entity from the datastore by using its key, or by performing a query that matches the entity's properties.
I want to create a simple version of Google Analytics using App Engine. I would set up my front end code to get a request from a web page that a user has viewed. I would store as an entity that request associated with the website/account.
The most obvious feature is how many hits did a site get during some time span. This is easy, I just perform a query using filters to define that I want the entities from account X in date range Y-Z. I then use the query to count the results. There is a problem though.
From http://code.google.com/appengine/docs/datastore/queryclass.html:
count() is somewhat faster than retrieving all of the data by a constant factor, but the running time still grows with the size of the result set. It's best to only use count() in cases where the count is expected to be small, or specify a limit.
Reading between the lines, I'm now running a O(N) query on what could be a large N. This doesn't scale. HighScalability.com noticed this and wrote the correct workaround:
Instead of calculating the results at query time, calculate them when you are adding the records. This means that displaying the results is just a lookup, and that the calculation costs are mortized over each record addition.
So we create an entity to store every request and another entity to store the count of requests per day for a web site. And we launch. Everything works fine and scales effortlessly.
Now, I want to add a new feature: Show the user how many requests were done per browser type. I can store these counts in their own entities the same way as the counts per day, but I haven't already been storing them. Even though the data is in my request entities, I'll never be able to retrieve it and count it: I'm limited to a few seconds of processing in my front end code. I'm stuck.
This is where MapReduce would normally come in. I would write some code to process old request entities and build up browser count per day entities. Then, going forward I would collect this continuously. Alternatively, with a competitor like Amazon EC2, I would spawn up a few extra machine instances to do the extra processing once, and then release them back into the cloud pool. Nothing like this exists with App Engine.
There is still plenty that could be created with App Engine. Your standard CRUD (Create Replace Update Delete) application can generally be created with much more ease than trying to set up all the technologies yourself. I probably could create a bare-bones Tag-Board using App Engine in a few hours and I suspect it would scale just fine. But not everything would work this well.
Scaling isn't always as simple as having lots of machines.
