Sep 29, 2010

Why you should learn just a little Awk - An Awk Tutorial by Example

In grad school, I once saw a prof I was working with grab a text file and in seconds manipulate it into little pieces so deftly it blew my mind. I immediately decided it was time for me to learn awk, which he had so clearly mastered.

To this day, 90% of the programmers I talk to have never used awk. Knowing 10% of awk's already small syntax, which you can pick up in just a few minutes, will dramatically increase your ability to quickly manipulate data in text files. Below I'll teach you the most useful stuff - not the "fundamentals", but the 5 minutes worth of practical stuff that will get you most of what I think is interesting in this little language.

Awk is a fun little programming language. It is designed for processing input strings. A (different) prof once asked my networking class to implement code that would take a spec for an RPC service and generate stubs for the client and the server. This professor made the mistake of telling us we could implement this in any language. I decided to write the generator in Awk, mostly as an excuse to learn more Awk. Surprisingly to me, the code ended up much shorter and much simpler than it would have been in any other language I've ever used (Python, C++, Java, ...). There is enough to learn about Awk to fill half a book, and I've read that book, but you're unlikely to be writing a full-fledged spec parser in Awk. Instead, you mi just want to do things like find all of your log lines that come from ip addresses whose components sum up to 666, for kicks and grins.  Read on!

For our examples, assume we have a little file (logs.txt) that looks like the one below. If it wraps in your browser, this is just 2 lines of logs each staring with an ip address.

07.46.199.184 [28/Sep/2010:04:08:20] "GET /robots.txt HTTP/1.1" 200 0 "msnbot"
123.125.71.19 [28/Sep/2010:04:20:11] "GET / HTTP/1.1" 304 -  "Baiduspider"

These are just two log records generated by Apache, slightly simplified, showing Bing and Baidu wandering around on my site yesterday.

Awk works like anything else (ie: grep) on the command line. It reads from stdin and writes to stdout. It's easy to pipe stuff in and out of it. The command line syntax you care about is just the command awk followed by a string that contains your program.

awk '{print $0}'

Most Awk programs will start with a "{" and end with a "}". Everything in between there gets run once on each line of input. Most awk programs will print something. The program above will print the entire line that it just read, print appends a newline for free. $0 is the entire line. So this program is an identity operation - it copies the input to the output without changing it.

Awk parses the line in to fields for you automatically, using any whitespace (space, tab) as a delimiter, merging consecutive delimiters. Those fields are available to you as the variables $1, $2, $3, etc.

echo 'this is a test' | awk '{print $3}'  // prints 'a'
awk '{print $1}' logs.txt


Output:
07.46.199.184
123.125.71.19

Easy so far, and already useful. Sometimes I need to print from the end of the string though instead. The special variable, NF, contains the number of fields in the current line. I can print the last field by printing the field $NF or I can just manipulate that value to identify a field based on it's position from the last. I can also print multiple values simultaneously in the same print statement.

echo 'this is a test' | awk '{print $NF}'  // prints "test"
awk '{print $1, $(NF-2) }' logs.txt

Output:
07.46.199.184 200
123.125.71.19 304

More progress - you can see how, in moments, you could strip this log file to just the fields you are interested in. Another cool variable is NR, which is the row number being currently processed. While demonstrating NR, let me also show you how to format a little bit of output using print. Commas between arguments in a print statement put spaces between them, but I can leave out the comma and no spaces are inserted.

awk '{print NR ") " $1 " -> " $(NF-2)}' logs.txt

Output:
1) 07.46.199.184 -> 200
2) 123.125.71.19 -> 304

Powerful, but nothing hard yet, I hope. By the way, there is also a printf function that works much the way you'd expect if you prefer that form of formatting. Now, not all files have fields that are separated with whitespace.  Let's look at the date field:

$ awk '{print $2}' logs.txt

Output:
[28/Sep/2010:04:08:20]
[28/Sep/2010:04:20:11]

The date field is separated by "/" and ":" characters. I can do the following within one awk program, but I want to teach you simple things that you can string together using more familiar unix piping because it's quicker to pick up a small syntax. What I'm going to do is pipe the output of the above command through another awk program that splits on the colon. To do this, my second program needs two {} components. I don't want to go into what these mean, just to show you how to use them for splitting on a different delimiter.

$ awk '{print $2}' logs.txt  | awk 'BEGIN{FS=":"}{print $1}'

Output:
[28/Sep/2010
[28/Sep/2010

I just specified that I wanted a different FS (field separator) of ":" and that I wanted to then print the first field. No more time, just dates! The simplest way to get rid of that prefix [ character is with sed, which you are likely already familiar with:

$ awk '{print $2}' logs.txt  | awk 'BEGIN{FS=":"}{print $1}' | sed 's/\[//'

Output:
28/Sep/2010
28/Sep/2010

I can further split this on the "/" character if I want using the exact same trick, but I think you get the point. Next, lets learn just a tiny bit of logic. If I want to return only the 200 status lines, I could use grep, but I might end up with an ip address that contains 200, or a date from year 2000. I could first grab the 200 field with Awk and then grep, but then I lose the whole line's context. Awk supports basic if statements. Lets see how I might use one:

$ awk '{if ($(NF-2) == "200") {print $0}}' logs.txt

Output:
07.46.199.184 [28/Sep/2010:04:08:20] "GET /robots.txt HTTP/1.1" 200 0 "msnbot"

There we go, returning only the lines (in this case only one) with a 200 status. The if syntax should be very familiar and require no explanation. Let me finish up by showing you one stupid example of awk code that maintains state across multiple lines. Lets say I want to sum up all of the status fields in this file.  I can't think of a reason I'd want to do this for statuses in a log file, but it makes a lot of sense in other cases like summing up the total bytes returned across all of the logs in a day or something. To do this, I just create a variable which automatically will persist across multiple lines:

$ awk '{a+=$(NF-2); print "Total so far:", a}' logs.txt

Output:
Total so far: 200
Total so far: 504

Nothing doing. Obviously in most cases, I'm not interested in cumulative values but only the final value.  I can of course just use tail -n1, but I can also print stuff after processing the final line using an END clause:

$ awk '{a+=$(NF-2)}END{print "Total:", a}' logs.txt

Output:
Total: 504

If you want to read more about awk, there are several good books and plenty of online references. You can learn just about everything there is to know about awk in a day with some time to spare. Getting used to it is a bit more of a challenge as it really is a little bit different of a way to code - you are essentially writing only the inner part of a for loop. Come to think of it, this is a lot like how MapReduce feels, which is also initially disorienting.

I hope some of that was useful. If you found it to be so, leave a comment to let me know, I enjoy the feedback if nothing else.

Update Sep 30, 2010: There are some great comments elsewhere in addition to here. I wish they would end up in one place, but the best I can do currently is to link to them:

Update Jan 2, 2011: This post caught the interest of Hacker Monthly who republished it in issue #8. You can grab the pdf version of this article courtesy of Lim Cheng Soon, Hacker News' Founder.

Birds of a Feather

If you are the type of person interested in Awk, you are probably the type of person I'd like to see working with me at Google. If you send me your resume (ggrothau@gmail.com), I can make sure it gets in front of the right recruiters and watch to make sure that it doesn't get lost in the pile that we get every day.

65 comments:

RBerenguel said...

I completely agree with your ideas of awk. I picked it up quickly, any small tutorial will leave you with the working blocks to do that "take first column, add to third and remove anything else" in a matter of minutes.

Yes, you could write a C program doing that but it would take far longer.

Cheers,

Ruben

Kunal said...

Unix tools like sed/grep/awk are *true* gems! Anything which will take numbers of line (and yes, it could exceed hundreds also) in general programming language can be done in one-liner using sed/grep/awk (provided if it is being used for the correct purpose! Don't ask sed to open network connection!!).

danielgoessweden said...

Thanks for the article! Great intro to awk.

I think I spotted a small mistake:
I can print the last line by printing the field $NF or I can just manipulate that value.
Shouldn't that read "the last field"?

Greg said...

Good catch. I fixed that up.

Mark Carmichael said...

Perl largely supercedes awk for this sort of text manipulation, but awk has an elegance in its base functionality that's hard to match.

The authors (A. W. and K.) maintain a web page for the language and their own excellent AWK book:

http://cm.bell-labs.com/cm/cs/awkbook/

The source code to the reference implementation of the language is available there, as are all the examples and exercises from the book.

That book is a model of concision, and the nature and quality of the examples and exercises make it worth a look even for folks who don't plan to make use of awk itself.

Greg said...

I'm not actually familiar with the history of either, although perhaps I should be. However, an unconfirmed comment on reddit (http://www.reddit.com/r/programming/comments/dkew8/why_you_should_know_just_a_little_bit_of_awk/) mentions that Awk appeared in 1977 and Perl in 1987?

smutticus said...

Good post. I'm gonna bookamrk it for later reference. I'd like to suggest one small change to one of your examples.

Instead of
$ awk '{print $2}' logs.txt | awk 'BEGIN{FS=":"}{print $1}'

I suggest
$ awk '{print $2}' logs.txt | awk -F: '{print $1}'

Fewer characters and easier to read.

pompomtom said...

s/\(ie\:/eg/

Kristian said...

Great post!

Can you make a similar post about sed?

I have always felt that sed is quite similar to awk.

:)

Mark Carmichael said...

Awk was created in 1977, per the preface of the authors' book, but probably didn't see wide distribution outside of Bell Labs until the release of Version 7 Unix in 1979.

As for Perl,its development was directly inspired by limitations in awk; the first public release was at the end of 1987.

Ironically, an updated version of awk that addressed those very limitations had been created in 1985 at Bell Labs, but wasn't widely released until 1986 or 1987, I think.

Perl has features that can effectively implement the pattern-action structure that is awk's hallmark. Perl distributions also come with a translator program, 'a2p', to aid migration from AWK to Perl.

stringbot said...

Great guide! Now I feel like a _real Unix hacker. :)

Daniel Wellman said...

This was very helpful; succinct, easy to understand, and each point built logically off the previous one. Thanks for writing this up!

Vivek said...

Great read to start learning awk!

Paddy3118 said...

The bird is the Auk with a 'u'.

I believe one of Perls aims was to supplant Awk and there are early Perl features that make it straight-forward to recreate the Awk functionality or program in an Awk style, or even run the Awk2pl script to convert awk to perl.

Me, I learnt Awk first then Python as well as Perl and still use them.

Skunk said...

Why did you write awk "{print $0}" for your first example of awk source?

Isn't showing people awk with double quotes going to fuck them up? That line of code doesn't even work.

Greg said...

Skunk, Good call. Updated.

Greg said...

Paddy, bonus points if you know what the little creature representing Awk is that OReilly puts on their sed/awk book.

j.e.h. said...

This example:


awk '{if ($(NF-2) == "200") {print $0}}' logs.txt


Could have been better written as:


awk '$(NF-2) == "200" { print $0}' logs.txt


Or even the more succinct:


awk '$(NF-2) == "200"' logs.txt


Which would allow you to demonstrate the pattern matching behavior of awk (which leads to some of the truly powerful things one can do with it.)

darrowj said...

I used to use AWK/SED all the time. Then I discovered PERL and never used them again. That was back in 1999. I still have good memories with it though. Although the script I wrote needed to be maintained by others after me. I remember one person saying they would "kick my ass if they ever saw me again." :-)

Skunk said...

I apologize for the cursing.

Adam said...

Thanks for this! I have used sed for awhile, but I was intimidated by awk and the existing guides were either too basic or too advanced to get a feel for how one might actually use awk in day-to-day usage. My hat's off to you, sir!

Brianna Dillon said...

For your example here:

awk '{print $2}' logs.txt | awk 'BEGIN{FS=":"}{print $1}' | sed 's/\[//'

There is a much simpler (though more confusing for a new user) way to do it.

awk 'BEGIN{FS="[:[]"}{print $2}' logs.txt

or even less typing

awk -F'[:[]' '{print $2}' logs.txt

Fleas said...

You should mention the common gotchas between the awk versions (awk,gawk,nawk,etc). specifically awk versions that are not posix compliant and also those that use extensions.

ie. bsd vs gnu platforms

hint: awk --posix

Greg said...

Brianna, you raise a good point about being able to split on multiple characters.

A few of the commenters on reddit similarly pointed out that using "if($(NF-2) == "200") print $0" is not really the way of the awk, which is true. It's better to do line-matching, for example:

awk '$(NF-2) == "200" {print $0}' logs.txt

Your point about how these things are a little confusing to new users applies in both cases though. I wanted to keep things really simple and familiar, so that the examples were easy to understand and generalize. In the line matching case, you have to suddenly understand another concept, namely that there is both a match and a "code" portion to each statement. That seemed intimidating enough to want to leave out.

This post has generated a lot of interest though, I'll probably do a follow up in a couple days with some slightly more advanced tips.

Douglas Cuthbertson said...

Thank you for that introduction to awk. Most of the time I'm using Windows, which doesn't have anything remotely as useful as sed, awk and grep. I occasionally use FreeBSD and love tools like vi, sed and grep. I'm now going to add awk to my repertoire.

BTW, the creatures on the O'Reilly book are slender lorises.

Fleas said...

awk '$8 ~ /200/ {print $6}' "$1" ${APACHE_LOG} | sort | uniq -c | sort -rn | head -10

this list the top 10 200's and the request, note the use of "~" and if I wanted all status results except 200's I would use "!~"

*my log format has http status codes $8 your's will probably be different.

LB said...

I used to use it. Now, however, I have found out that even though Perl requires a little more coding it is doing the job much better.

jc said...

Douglas,

Gawk (the Gnu version of awk) is available for windows at
http://gnuwin32.sourceforge.net/packages/gawk.htm

Jonathan said...

Some people are offering shortcuts, which is cool. But for beginners, showing it the slightly longer way (as in Greg's article) is definitely clearer.

Don't change a thing! Except for bugfixes :-)

butu said...

yeah I too really love AWK. it is great for analyzing realtime apache log file.

I am keep on adding few awk code snippets in my small site

http://codesnipr.com/tags/AWK?cat=basic

mnwcsult said...

A little awk goes a long way. The major difference between awk and perl is that Perl encapsulates many UNIX like commands and regular expressions into it's framework. While awk is really just another UNIX command that when used in a shell script along with other UNIX commands can closely replicate perl.

Years ago I taught a course that compared Fortran, C/C++, AWK, and PERL by solving the same problem. Something about air burst atomic explosions. I digress. What flushed out was how amazing similiar Fortran and C language were and then how simple going from C to AWK to Perl actual were.

Here is a tidbit. What does AWK mean. No great flightless bird, instead, AWK (Aho,Weinberger Kernighan). The original developers of UNIX, C language.

How cool is that?

Carl Trachte said...

Greg,
Thanks for the article. I saw it on LinkedIn.
Just enough to get started with AWK.

chamatkari baba said...

Great article! I was looking for an introduction to awk and this was perfect!! The longer versions definitely helped me and made more sense (to me) than the shorter, more advanced versions. It is more in-sync with the popular programming languages. Cheers!

Joe said...

I have to admit that I absolutely love awk. Anytime I need to do some data massaging it's my goto tool. And besides, it's a fun language to use.

dan said...

I did a fairly thorough job of learning Perl many years back but have since abandoned it for all but one-off command line incantations. It excels in this use. It sounds like you use awk in a similar way.

It's important to have some tool available to you for quick one-off processing like this. It doesn't matter so much what it is. Some people use Perl, some use awk, some use pyline. Pick your poison, they'll all make you more effective.

Paddy3118 said...

I think that this animal: http://oreilly.com/catalog/9781565922259/preview#preview is one of these: http://www.kidcyber.com.au/topics/ayeaye.htm.

My son did a piece on the Galago for school which is similar and from Madagascar.

I don't know many birds, its just that I had already looked into awk/auk.

- Paddy.

jbm said...

Awk is also an invaluable tool for machine learning. Most of the work we do is data exploration and massage. Familiarity with awk, egrep, and sort are the key to spending hours looking for patterns instead of days debugging your data extract tools. It's basically the same problem as finding loglines, but, you know, messier.

Thanks for posting something I can point people at!

jbm said...
This comment has been removed by the author.
Liang said...

Great introduction! Thanks!

mohammad mahdi said...

thanks for the great introductory on awk, I checked the code on the post on windows with "Gawk for Windows" and here are some deference within the scope of this post:
in cmd you have to use '"' instead of ''', RS instead of FS and separate '{}'s with ; like:
awk "BEGIN{RS="sep"};{print $1}"

Richard Careaga said...

The old ways are the best ways (sometimes!). Thanks for the reminder. The virtues of awk are conciseness (think 115 baud glass teletypes that can't quite keep up with a fast typist) without being obscure. I don't use it as much as I used to, but you're right that it's handy.

jfromm said...

Modern communication: grep, awk, sed, vim! http://bit.ly/LzCCI

Adventurous said...

Sweeet
Thank you!

BeginLinux said...

Awk is such an incredible tool. This is an excellent awk intro for noobs like me. Thank you. Here's another one Introduction To Awk

Anirudh said...

Hi,

I recently had to process a lot (>2gb) of logs. After trying out awk, and simple perl scripts - neither of which are things I have experience with - I gave up and wrote a small program to parse each line and add it to a MongoDB Database.

This was actually fantastically useful. I could run much larger queries (and reasonably complex ones) using simple commands. And it ran blazing fast to boot.

Getting a top 10 of most common visitors?

db.find().sort({source:1})[10]


Getting a list of IP addresses who caused 500 errors (and thus possible attackers)

db.find({response:'500'}).sort({source:1})


Of course, this is no replacement for awk. I'm just saying there might be situations that warrant the use of bigger, albeit more complex and verbose tools, when the task can be solved with a relatively cryptic awk script.

P.S. Is there a way to make Blogger comments print preformatted code? Apparently "pre" and "code" tags aren't supported.

Samuel said...

Thanks for the material. I found it on LinkedIn. Just enough to get started with AWK. Thank you.

Issaquah Chiropractor

Tan said...

i never had a problem to solve using awk, so never tried using it exensively. But this post has perfect info to get started. well written.

questionable-authority said...

A comparison of regular expression matching implementations for Perl, grep, awk, among others:

http://swtch.com/~rsc/regexp/regexp1.html

Sushant said...

Hi all, so nice an article this is! I observe, as all others, there is really a magical effect when one starts things small and simple. Like slow pegs of wine, it makes you high slowly and gradually...keeping you in tune all through! Sorry to be obscure if I am so!

Thanks for the article!

Zubin Mehta said...

Very nice short and clear article. I always felt awk is weird, actually it is a bit disorienting at the start but it is powerful :)

killa bee said...

I like AWK, really, but many of the things people use it (and Perl) for should instead involve the UNIX tools "cut" and "paste", "column", etc. which are even simpler, and totally forgotten about.

Tim Schaeffer said...

If you want to see some awk in the wild, read the source to the quilt patch management program; awk is used throughout its shell function library. quilt is also a great example of how to create an elaborate application purely in a shell script.

rodri said...

Brief, clear and useful, thank you!

charles said...

IP Addresses don't start with "07"

Leonid Volnitsky said...

@RBerenguel
> Yes, you could write a C program doing that but it would take far longer.

Not true. In SCC (simplified C++) all above examples would be shorter.

Max Zamkow said...

Great quick intro. Thanks!

Vasudev Ram said...

Interesting article. A couple of comments - AWK is a somewhat deeper language than a lot of people think. And some of it's uses (related to those deeper aspects) can be pretty counter-intuitive to those who don't know them from before. I got to realize some of this from reading some intermediate/advanced examples of AWK use in either or both of the books "Programming Pearls" and "More Programming Pearls" by Jon Bentley, who also wrote "Writing Efficient Programs", another great book, but about performance tuning, not AWK. In the first two books (which are about powerful programming techniques and tricks, and use many languages), he gives some examples of powerful AWK uses, particularly involving it's associative arrays, which can reduce the equivalent C program of many more lines into a very few lines of AWK code. And what's worth emphasizing is that the resultant AWK code is better not only because it has fewer lines (that's not always an improvement in itself) but also because it is not cryptic at all, as long as you understand the concepts used (associative arrays are one), and how and why they are used in that particular example. Those two "Pearls" books are worth a look for anyone interested in improving their programming skills.

Robert Bram said...

Thank you Greg - excellent tutorial showing me quickly how to use the best bits of awk. I must admit that I have used awk in ignorance a lot, pulling together scraps of google-found logic that only half worked how I wanted. Now I know a bit more about how to understand awk syntax.

rmcc4444 said...

Good intro. However, I have a old habit of using grep and cut way too many times even when there is something better. :)

cyrano said...

Good read.
Short and to the point...what I would call a breakfast read.

Greg said...

Good catch charles. It probably was a copy/paste error. It's not relevant to the awk examples though so I won't bother fixing it up.

Greg said...

killa bee, you are definitely right. I underuse those unix utilities and use awk more often, but that's not always ideal. I think there is something to be said for using the right tool and something to be said for using what you are familiar with.

春晓的晓 said...

I'd like to know how to write one awk statement without func split() to replace that:

$ awk '{print $2}' logs.txt | awk 'BEGIN{FS=":"}{print $1}'

Thanks.

nepaliboy said...

One of the best resources I came across. Thanks for sharing and making it lovable KISS.

Jimmy O'Donnell said...

From a biologist learning to love programming, thanks for writing this! Much easier to follow and turn into doing useful things than any other tutorial I have come across.