Goals and activities

There are two goals in today's exercises:
  1. To review some simple Unix text-processing utilities
  2. To write some throwaway programs for simple data analysis using Awk and the other text-processing tools.
Awk is named for its creators: Aho, Weinberger, and Kernighan, all of Bell Labs, who designed the language in the late 1970s. Although newer scripting languages like Perl, Python, and Ruby are much more powerful, there are many useful tasks that Awk can perform with one line of code. Mastering the art of the "1-liner" is a very useful skill, particularly when all you need to do is to extract a simple subset of data from a large file (or a large collection of files). It is vastly simpler to do such tasks with a combination of Awk and the other Unix text processing tools than to write an equivalent program in a conventional language like Fortran, C, or Java.

Unix text-processing utilities

You can do a lot of useful work with the Unix/GNU text processing commands, each of which has a manual page; man grep, for instance, prints the manual page for the grep command. Here's a thumbnail description of some of them:

Exercise 1. Since football season is starting, we'll use the Arizona Cardinals roster as an input data file for this exercise. Using any combination of the text-processing utilities listed above (you don't need Awk here), write a one-line shell command that performs each of the following tasks (one command per task):

Surface Meteorological Airways Format

The so-called METAR format is used to encode certain weather observations that are taken at ground level. Tables of these observations are updated hourly and can be downloaded from any National Weather Service web site. Here are the METAR data for stations in and near Arizona taken between 2 pm and 3 pm on Sept. 7, 2005.

Exercise 2. Save the above file as metar.dat in your working directory. Then illustrate a grep command that lists the stations reporting scattered clouds.

The Awk programming language

The remaining exercises illustrate the basic use of Awk (more specifically, GNU Awk). While Awk is not as powerful as newer scripting languages like Perl and Python, it is a much smaller language that is ideal for quick, throwaway programs for simple data processing jobs.

Awk is a simple text processing language whose functionality can be expressed in pseudocode as follows:

read a line of text
do while(not end of file)
    process line
    read a line of text
enddo
All an Awk program specifies is the process line part; the Awk engine handles all the looping and i/o. For instance, the command
awk '{print $1, $2}' file
prints the first and second fields of each line in file. (If file is omitted, then Awk reads the standard input; this feature makes Awk useful in pipelines.)

After Awk reads a line of text, it automatically breaks up the line into fields, i.e., consecutive strings of nonblank characters separated by one or more blanks. (A blank is either a tab character or an ordinary space. It's possible to change the default separation character to something else, but we'll ignore that feature for now.) The first field in a given line is $1, the second is $2, and so on. Special case: $0 refers to the entire input line.

Awk has several useful built-in variables:

Example 1. The following command prints the contents of file, numbering each line:

awk '{print NR, $0}' file

The Awk program itself is enclosed in single quotes to prevent the shell from interpreting the characters. (Remember that strings like $1 are meaningful to the shell, too. Single quotes cause the shell to send the quoted text unchanged as the first command-line argument to Awk.)

More general Awk programs

More generally, an Awk program consists of one or pairs of the form
pattern   {action}
(the curly braces delimit the action). Whenever an input line satisfies the pattern, Awk performs the associated action.

The pattern can be any condition or regular expression.

Example 2.

$1 > 0 { print log($1) }
If the first field of a line is a positive value, then print the natural logarithm of the value. Lines whose first field is not a positive number are skipped. (Of course, this program is sensible only if the first field of each line is numeric.)

There are two special cases:
  • If pattern is omitted, then every input line is selected.
  • If action is omitted, then the default action is to print the entire line selected by the pattern.
  • Example 3. Print all lines in file.dat whose first field is a positive number:

    awk '$1 > 0' file.dat
    (The action is omitted, so the selected lines are simply printed.)

    Example 4. Print all lines in output.log containing the string error:

    awk '/error/' output.log
    The pattern in this case is a regular expression. Regular expressions are delimited by forward slashes. As in the previous example, the action is omitted, so it defaults to printing the matched lines.

    Example 5. Convert a file containing one x, y pair per line to pairs of the form log(x), y:

    awk '{print log($1), $2}'
    The pattern is omitted, so every line is selected.

    The syntax of actions is similar to that of C, C++, and Java. Successive statements can be delimited by semicolons. You can define your own variables. No declarations are required; a given variable simply springs into existence whenever it is mentioned. Uninitialized variables are treated as blank strings or as 0, depending on context.

    The print statement provides a default formatting to all values: strings are printed in their entirety, and floating-point numbers to 6 decimal digits. (Awk supplies a printf function with the same functionality as in C, but you do not need this capability for these exercises.)

    Example 6. Awk scripts that are more than one line or that contain quotes can be placed in a separate file for convenience. Suppose that the file fruit.awk contains the following code:

    /apple/ {a += $2}
    /banana/ {b += $2}
    {f += $2}
    END {print "apples:", a, "bananas:", b, "total fruit:", f}
    and suppose the file inventory consists of
    apple 2
    kiwi 1
    banana 1
    cherry 7
    apple 3
    banana 2
    orange 6
    Then the command
    awk -f fruit.awk inventory
    prints
    apples: 5 bananas: 3 total fruit: 22
    Each time a line is read in, Awk first checks whether the line contains apple. If it does, then the variable a is incremented by the value of the second field. Next, Awk checks whether the line contains banana. If so, then the variable b is incremented by the value of the second field. Next, the variable f is always incremented by the value of the second field because the omitted pattern matches every line. When the end of the input is reached, the values of a, b, and f are displayed as indicated.

    Awk is most useful for processing text files that have a regular structure. The script in Example 6 works because each line consists of a (type,count) pair and because the variables a and b begin with the value 0 when they are first mentioned.

    General program development advice. When you write a large simulation or other modeling program, design the output so that it is readily parsable with tools like Awk, Perl or Python. Such designs make it easy to scan through large numbers of log files when you want to explore, say, the estimated error in the output as a function of some set of input variables. The following example are intended to illustrate the utility of this strategy.

    Awk variables

    Internally, Awk stores all variables as strings (as does the shell). In an arithmetic context, Awk internally converts the string to a number, performs the indicated arithmetic, then converts the result back to a string. For instance, the statements

    n = 0;
    n = n + 1;
    first assign n the string 0. Since the second statement specifies arithmetic, the string 0 is converted to the number 0, the value 1 is added, and the result is converted to the string 1. The net effect of these rules is that Awk appears to do ordinary arithmetic, and the internal conversions are ordinarily transparent to the programmer. (You can always force Awk to treat a variable in a numeric context by adding 0 to it.)

    Exercise 3. Consider the Arizona Cardinals football roster used previously. Write an Awk script that displays all information on players whose jersey number is less than 10.

    Note: The goal of these exercises is to write simple, throwaway programs for exploratory data analysis. Consequently, there is no need to decorate your code with statements like
    print "The players whose jersey numbers are less than 10 are:";
    It suffices simply to output the corresponding lines of the data file without further decoration.

    As you become more proficient with software tools, you will find that such decoration is a nuisance. Avoid extraneous print statements except when debugging.

    The patterns BEGIN and END

    The special patterns BEGIN and END can be used for initialization before the first line of text is read and for finalization after the last line of text is read. To illustrate their use, consider the following Awk program, which, given input consisting of one positive value per line, prints the product of all the values:
    BEGIN { prod = 1 }
    { prod *= $1 }
    END { print prod }

    Here the variable prod is initialized to 1 before the first line of text is read. (Absent the BEGIN block, prod would be initialized to 0.) The second line omits the pattern, so the associated action is performed for all input lines (the first field is multiplied to the running product and nothing is printed). The action associated with the special END pattern is executed when the end of the input is reached; it prints the net product.

    To test this out, and before proceeding further, save this Awk program as product.awk and create a little data file with a text editor. Then run a command of the form

    awk -f product.awk datafile
    which executes product.awk using datafile as the input.

    Exercise 4. For each of the following tasks, write a corresponding Awk script. (One script per task.)

    Exercise 5. Suppose that the output file from a simulation contains data in the form

    0.01 3.1312 4.3657 5.4893 1.034e-04
    0.02 3.2387 4.1908 5.6504 5.238e-04
    0.03 3.4568 3.9877 5.7384 9.193e-04
    0.04 3.8392 3.8957 5.8392 3.117e-03
    and so on. In this file, the first column is the output time t, the second, third, and fourth columns are the x, y, and z components, respectively, of a solution vector, and the last column is an estimate of the error in the solution.
    Give an example of a one-line, pipelined shell command that invokes stat.awk, together with any of the text processing programs that we have discussed so far (e.g. head, tail, sort, awk, etc.), and prints the mean of the solution errors (i.e., the command prints one number).

    Exercise 6. Given a data file as in Exercise 5, write an awk script that prints all times t, if any, at which the error exceeds 0.001.

    Exercise 7. Using Awk and the other text processing tools to write a script for each of the following tasks for the METAR data file given above. (One script per task.) Copyright (c) 2007 by Eric J. Kostelich. All rights reserved.