Sep 30, 2010

Goo.gl URL Shortener

Friends and coworkers of mine just launched Goo.gl, an URL Shortener run on Google Infrastructure.

To me, stability is the biggest selling point of Goo.gl over alternatives - it's run by a company that isn't likely to disappear next year and is known for having some of the most scalable systems in the world. I know the team has worked hard on making this as scalable and reliable as most anything at Google, including search. I think it's safe to say Goo.gl links are any one's best bet as far as future-proofing your shortened URLs. Also, because of Google's obsession with end-user speed, this is likely the fastest URL shortener you are going to ever find. You don't have to take my word from it, see this post from TechCrunch a few months ago that shows Goo.gl as the fastest and most reliable from two different third-party analyses.


Reliability, Greener is Better


Speed, Smaller is Better


Of course, I'm the first one to admit that I'm disappointed that we even need URL Shorteners, but if you assume that you indeed do, this is a great choice.

Sep 29, 2010

Why you should learn just a little Awk - An Awk Tutorial by Example

In grad school, I once saw a prof I was working with grab a text file and in seconds manipulate it into little pieces so deftly it blew my mind. I immediately decided it was time for me to learn awk, which he had so clearly mastered.

To this day, 90% of the programmers I talk to have never used awk. Knowing 10% of awk's already small syntax, which you can pick up in just a few minutes, will dramatically increase your ability to quickly manipulate data in text files. Below I'll teach you the most useful stuff - not the "fundamentals", but the 5 minutes worth of practical stuff that will get you most of what I think is interesting in this little language.

Awk is a fun little programming language. It is designed for processing input strings. A (different) prof once asked my networking class to implement code that would take a spec for an RPC service and generate stubs for the client and the server. This professor made the mistake of telling us we could implement this in any language. I decided to write the generator in Awk, mostly as an excuse to learn more Awk. Surprisingly to me, the code ended up much shorter and much simpler than it would have been in any other language I've ever used (Python, C++, Java, ...). There is enough to learn about Awk to fill half a book, and I've read that book, but you're unlikely to be writing a full-fledged spec parser in Awk. Instead, you mi just want to do things like find all of your log lines that come from ip addresses whose components sum up to 666, for kicks and grins.  Read on!

For our examples, assume we have a little file (logs.txt) that looks like the one below. If it wraps in your browser, this is just 2 lines of logs each staring with an ip address.

07.46.199.184 [28/Sep/2010:04:08:20] "GET /robots.txt HTTP/1.1" 200 0 "msnbot"
123.125.71.19 [28/Sep/2010:04:20:11] "GET / HTTP/1.1" 304 -  "Baiduspider"

These are just two log records generated by Apache, slightly simplified, showing Bing and Baidu wandering around on my site yesterday.

Awk works like anything else (ie: grep) on the command line. It reads from stdin and writes to stdout. It's easy to pipe stuff in and out of it. The command line syntax you care about is just the command awk followed by a string that contains your program.

awk '{print $0}'

Most Awk programs will start with a "{" and end with a "}". Everything in between there gets run once on each line of input. Most awk programs will print something. The program above will print the entire line that it just read, print appends a newline for free. $0 is the entire line. So this program is an identity operation - it copies the input to the output without changing it.

Awk parses the line in to fields for you automatically, using any whitespace (space, tab) as a delimiter, merging consecutive delimiters. Those fields are available to you as the variables $1, $2, $3, etc.

echo 'this is a test' | awk '{print $3}'  // prints 'a'
awk '{print $1}' logs.txt


Output:
07.46.199.184
123.125.71.19

Easy so far, and already useful. Sometimes I need to print from the end of the string though instead. The special variable, NF, contains the number of fields in the current line. I can print the last field by printing the field $NF or I can just manipulate that value to identify a field based on it's position from the last. I can also print multiple values simultaneously in the same print statement.

echo 'this is a test' | awk '{print $NF}'  // prints "test"
awk '{print $1, $(NF-2) }' logs.txt

Output:
07.46.199.184 200
123.125.71.19 304

More progress - you can see how, in moments, you could strip this log file to just the fields you are interested in. Another cool variable is NR, which is the row number being currently processed. While demonstrating NR, let me also show you how to format a little bit of output using print. Commas between arguments in a print statement put spaces between them, but I can leave out the comma and no spaces are inserted.

awk '{print NR ") " $1 " -> " $(NF-2)}' logs.txt

Output:
1) 07.46.199.184 -> 200
2) 123.125.71.19 -> 304

Powerful, but nothing hard yet, I hope. By the way, there is also a printf function that works much the way you'd expect if you prefer that form of formatting. Now, not all files have fields that are separated with whitespace.  Let's look at the date field:

$ awk '{print $2}' logs.txt

Output:
[28/Sep/2010:04:08:20]
[28/Sep/2010:04:20:11]

The date field is separated by "/" and ":" characters. I can do the following within one awk program, but I want to teach you simple things that you can string together using more familiar unix piping because it's quicker to pick up a small syntax. What I'm going to do is pipe the output of the above command through another awk program that splits on the colon. To do this, my second program needs two {} components. I don't want to go into what these mean, just to show you how to use them for splitting on a different delimiter.

$ awk '{print $2}' logs.txt  | awk 'BEGIN{FS=":"}{print $1}'

Output:
[28/Sep/2010
[28/Sep/2010

I just specified that I wanted a different FS (field separator) of ":" and that I wanted to then print the first field. No more time, just dates! The simplest way to get rid of that prefix [ character is with sed, which you are likely already familiar with:

$ awk '{print $2}' logs.txt  | awk 'BEGIN{FS=":"}{print $1}' | sed 's/\[//'

Output:
28/Sep/2010
28/Sep/2010

I can further split this on the "/" character if I want using the exact same trick, but I think you get the point. Next, lets learn just a tiny bit of logic. If I want to return only the 200 status lines, I could use grep, but I might end up with an ip address that contains 200, or a date from year 2000. I could first grab the 200 field with Awk and then grep, but then I lose the whole line's context. Awk supports basic if statements. Lets see how I might use one:

$ awk '{if ($(NF-2) == "200") {print $0}}' logs.txt

Output:
07.46.199.184 [28/Sep/2010:04:08:20] "GET /robots.txt HTTP/1.1" 200 0 "msnbot"

There we go, returning only the lines (in this case only one) with a 200 status. The if syntax should be very familiar and require no explanation. Let me finish up by showing you one stupid example of awk code that maintains state across multiple lines. Lets say I want to sum up all of the status fields in this file.  I can't think of a reason I'd want to do this for statuses in a log file, but it makes a lot of sense in other cases like summing up the total bytes returned across all of the logs in a day or something. To do this, I just create a variable which automatically will persist across multiple lines:

$ awk '{a+=$(NF-2); print "Total so far:", a}' logs.txt

Output:
Total so far: 200
Total so far: 504

Nothing doing. Obviously in most cases, I'm not interested in cumulative values but only the final value.  I can of course just use tail -n1, but I can also print stuff after processing the final line using an END clause:

$ awk '{a+=$(NF-2)}END{print "Total:", a}' logs.txt

Output:
Total: 504

If you want to read more about awk, there are several good books and plenty of online references. You can learn just about everything there is to know about awk in a day with some time to spare. Getting used to it is a bit more of a challenge as it really is a little bit different of a way to code - you are essentially writing only the inner part of a for loop. Come to think of it, this is a lot like how MapReduce feels, which is also initially disorienting.

I hope some of that was useful. If you found it to be so, leave a comment to let me know, I enjoy the feedback if nothing else.

Update Sep 30, 2010: There are some great comments elsewhere in addition to here. I wish they would end up in one place, but the best I can do currently is to link to them:

Update Jan 2, 2011: This post caught the interest of Hacker Monthly who republished it in issue #8. You can grab the pdf version of this article courtesy of Lim Cheng Soon, Hacker News' Founder.

Birds of a Feather

If you are the type of person interested in Awk, you are probably the type of person I'd like to see working with me at Google. If you send me your resume (ggrothau@gmail.com), I can make sure it gets in front of the right recruiters and watch to make sure that it doesn't get lost in the pile that we get every day.

Sep 14, 2010

Solar Cell Backpacking

Recently, I've come to enjoy having my cell phone running while I'm backpacking for a variety of reasons. The problem is that my cell phone doesn't really last much more than a day or two on standby, let alone tracking GPS. If it's a 2 day trip, I can just take an extra battery, but in August I spent 7 days backpacking Kilimanjaro. Before the trip I did some research on various options to keep my cell phone running.

What I really wanted to do was set my self up with a solar panel. If it worked, I could use it for infinite-length backpacking. It isn't as if bringing a phone fits in the definition of ultra-light anyway.

On a recommendation from the forums at backpacker.com, I ordered the Powerfilm USB Charger from REI. It seemed to have the features I was looking for:

  • lightweight (<5oz)
  • fairly small
  • rugged
  • easy to attach to my pack
It also charges up 2 AA rechargable batteries which you can then use to charge the phone via a USB port. This seemed convenient instead of having to leave my phone hooked up all the time.

Before the big trip, Cristin and I did a 2-day side trip to Lassen. I took my Powerfilm for the journey. Unfortunately, while it has every feature you would want, it fails on the main one - it doesn't charge enough to keep my phone from eventually dying. It certainly helps, but not enough. I returned it. Sadly, I'm not sure that there is something on the market that will do the job.

The punchline is that I didn't have a solution for Kilimanjaro. Matt Cutts picked up a whole bunch of extra batteries, including some oversized nexus one batteries. That seemed to work well, but didn't satisfy the geek in me - it seemed like cheating.

So, this got me wondering - how theoretically possible is it to power a cell phone from solar power while backpacking? If my math is wrong, please correct me.

The amount of solar energy hitting a square meter of earth in a day is approximately 4.8 kWH. The Nexus One's battery is 1400 mAh, 3.7V and I find that I need to charge it about once per day backpacking. I don't want a solar panel much larger than the one mentioned above which is ~7x4 in = 28 sq in.= 0.018 sq meters.

1400mAh x 3.7 Volts = 5.18 WH = 0.00518 kWH.

Now we have the same units. My little panel is much smaller than a square meter, so:
4.8kWH / sq meter * 0.018 sq meters = 0.0864 kWH

So, the amount of solar energy hitting my little solar panel in a day is 0.0864 kWH and the amount of energy I need for my phone in a day is 0.00518 kWH. That's 16x more energy than I need! If I can convert just 1/16th of that power (6.25%) into battery juice, I'm set.

Unfortunately, I think the best solar panels available are only around 40% efficient. I suspect that within a backpacking product I probably won't get better than 10% efficiency. Also, efficiency decreases with heat, and the panels are close to my head / in the sun. Also, I'm unlikely to have my panels perfectly pointing at the sun at any given time - and never have any shade from clouds, trees, or my big head. So, I'd be lucky if I actually get half the amount of energy hitting my panels that I could if I were carefully aligning it. Then of course there is more efficiency lost in transferring that energy into my phone's battery. Basically, had I done the math to begin with, I don't think things look very promising. Roughly speaking: 10% / 2 / 2 = 2.5%

The good news is that we aren't that far off. With time perhaps my phone will become less power hungry. Solar panels are also getting cheaper and more efficient. Perhaps one day we'll be able to get cheap backpacking panels that are as much as 60% efficient. Maybe they'll be cheap, flexible, rugged, and light enough enough that I can just make the top of my backpack out of the material instead of having a small square hanging off the back, giving me a much larger area to absorb energy from.

I can imagine the day when the higher end backpacks come with a USB cable installed, just like a hydration system or an emergency whistle. You just plug in your digital camera, cell phone, and flashlight and your pack takes care of keeping everything charged. Would this be against the spirit of backpacking?

Update (16 September 2010):
There have been a number of great comments on the Buzz thread that got created off of this post. Questions about battery charging efficiency, piezo-electric chip alternatives, calories vs. Calories, 5000 farad capacitors, an estimate of the efficiency of my solar panels, etc. An Electrical Engineer buddy of mine, Andy Neff, weighed in and put my little bit of math to shame. If the above post was interesting, you should definitely check out the Buzz thread.

Sep 8, 2010

Instant Search - Subtle Details

The news is out this morning about Google's Instant Search UI. Basically, we now start showing search results as you are typing instead of waiting for you to press enter. I thought that I'd point out a few subtle details that not everyone probably noticed, which goes to show the polish put into this launch:


  • Instant doesn't show the results for what you've typed so far, but rather the most likely query we thing you are going to type. So, if you have typed in [greg groth] but not yet hit enter, you'll see this site at the top of the results instead of a page about Gregory Groth, Attorney at Law.
  • As soon as your mouse touches one of the results, the URL bar changes to include the query that results were shown for as the q= parameter instead of the incomplete query that is actually in the search box. This way, web site analytics don't break and start showing lots of substrings of the real query.
  • If you are interested in what was actually in the search box, the oq= parameter gives you this information in your logs.
  • For the sake of counting "impressions" in either your Advertising Console or your Ad Console, not all search results shown get counted as many of them the user ignores. A result gets counted as an "impression" in any of these three cases:
    • A user clicks anywhere on the result page (search result, ad, etc)
    • The user chooses a query interpretation by clicking on it, hitting enter, or pressing the "search button".
    • The results are displayed for a minimum of 3 seconds.
  • In any of the cases of the "impressions" above, an element is entered into your browser history, making the forward/back buttons work just like you'd expect.
  • Google was careful not to accidentally suggest [porn] if you are typing [por], instead you get results for [porsche]. Similarly, if you type something with no real "safe" suggestions such as [porn], you get no suggestions at all and have to explicitly complete the query first.
Update: If this was interesting, you should also see today's blog post: Google Instant Behind the Scenes