May 16, 2008

Google AppEngine Critical Flaw

Punchline: I feel like there is a critical flaw in Google App Engine. But first, the requisite fine print: I work for Google, the opinions here are my own, not Google's, and I don't work on App Engine.

I feel like there is a critical flaw in Google App Engine. I work with the Google technologies App Engine is based upon, namely massive datacenters and bigtable. The third peg of this at Google is MapReduce. App Engine has no MapReduce equivalent. Let me explain by starting with the limitations of App Engine:

From http://code.google.com/appengine/docs/python/sandbox.html


An App Engine application cannot:
  • spawn a sub-process or thread. A web request to an application must be handled in a single process within a few seconds. Processes that take a very long time to respond are terminated to avoid overloading the web server.



So, the application code must be short-lived and stateless. This is actually a great way to develop a front end. Now, what does the back end support?

From http://code.google.com/appengine/docs/datastore/entitiesandmodels.html:


A data object in the App Engine datastore is known as an entity. An entity has one or more properties, named values of one of several supported data types.

Each entity also has a key that uniquely identifies the entity. The simplest key has a kind and a unique numeric ID provided by the datastore. The ID can also be a string provided by the application. For more information on keys, see Keys and Entity Groups.

An application can fetch an entity from the datastore by using its key, or by performing a query that matches the entity's properties.


I want to create a simple version of Google Analytics using App Engine. I would set up my front end code to get a request from a web page that a user has viewed. I would store as an entity that request associated with the website/account.

The most obvious feature is how many hits did a site get during some time span. This is easy, I just perform a query using filters to define that I want the entities from account X in date range Y-Z. I then use the query to count the results. There is a problem though.

From http://code.google.com/appengine/docs/datastore/queryclass.html:


count() is somewhat faster than retrieving all of the data by a constant factor, but the running time still grows with the size of the result set. It's best to only use count() in cases where the count is expected to be small, or specify a limit.


Reading between the lines, I'm now running a O(N) query on what could be a large N. This doesn't scale. HighScalability.com noticed this and wrote the correct workaround:

Instead of calculating the results at query time, calculate them when you are adding the records. This means that displaying the results is just a lookup, and that the calculation costs are mortized over each record addition.


So we create an entity to store every request and another entity to store the count of requests per day for a web site. And we launch. Everything works fine and scales effortlessly.

Now, I want to add a new feature: Show the user how many requests were done per browser type. I can store these counts in their own entities the same way as the counts per day, but I haven't already been storing them. Even though the data is in my request entities, I'll never be able to retrieve it and count it: I'm limited to a few seconds of processing in my front end code. I'm stuck.

This is where MapReduce would normally come in. I would write some code to process old request entities and build up browser count per day entities. Then, going forward I would collect this continuously. Alternatively, with a competitor like Amazon EC2, I would spawn up a few extra machine instances to do the extra processing once, and then release them back into the cloud pool. Nothing like this exists with App Engine.

There is still plenty that could be created with App Engine. Your standard CRUD (Create Replace Update Delete) application can generally be created with much more ease than trying to set up all the technologies yourself. I probably could create a bare-bones Tag-Board using App Engine in a few hours and I suspect it would scale just fine. But not everything would work this well.

Scaling isn't always as simple as having lots of machines.

5 comments:

Evan said...

I wrote a bit about how to work around this here:

http://community.livejournal.com/evan_tech/248465.html

Aral said...

Hi Greg,

I ran into the counting issue you mention a little while ago. I messed up the counters and needed to rebuild the counts. What I ended up doing is splitting the long-running process into smaller ones and either timing them out after 2 seconds or limiting each request to 5 put operations.

Each request would then forward to a new URL with the current index/row in the path.

(I had to raise Firefox's redirect max to an arbitrarily large number to actually have it run without errors.)

But it did work.

A little The Incredible Machine like but I'm finding that quite a few things are with App Engine (the app also gets hit every minute via a web-based cron service, for example.)

Postgres said...

Greg,

You run into this same problem with counts even when running on most traditional RDMS systems. Counts simply are not performent. Every time you need to add a new counter, simply write a routine that populates the counter key based on your existing data set, and then update it appropriately in your app going forward.

It's pretty much a hack but I haven't worked with any data storage system that lets you run constant COUNT() operations.

Greg said...

Postgres, I agree with you completely. However my point is the step of "simply write a routine that populates the counter key based on your existing data set" isn't possible in appengine. At least it isn't possible without a workaround like Evan suggests.

There may have been changes in this arena much more recently than my original post. I see some news that appengine offers additional purchasable quota which might satisfy this requirement. I didn't read enough to be sure. I assume that given time, they will offer some better solutions in this area.

Favorites of Tom Nielsen said...

Hey Greg. You posted this a while ago. Do you have any pointers to more discussions about adding sawmill-ish features?

Being an ex-google engineer, I would love to use GAE for projects; however, without mapreduce/sawmill to create reports it just doesn't make sense. Maybe this is why the dev community is struggling so much with GAE's non-relational database approach.