Exercise 1. Since football season is starting, we'll use the Arizona Cardinals roster as an input data file for this exercise. Using any combination of the text-processing utilities listed above (you don't need Awk here), write a one-line shell command that performs each of the following tasks (one command per task):
Exercise 2. Save the above file as metar.dat in your working directory. Then illustrate a grep command that lists the stations reporting scattered clouds.
The remaining exercises illustrate the basic use of Awk (more specifically, GNU Awk). While Awk is not as powerful as newer scripting languages like Perl and Python, it is a much smaller language that is ideal for quick, throwaway programs for simple data processing jobs.
Awk is a simple text processing language whose functionality can be expressed in pseudocode as follows:
read a line of textAll an Awk program specifies is the process line part; the Awk engine handles all the looping and i/o. For instance, the command
do while(not end of file)
process line
read a line of text
enddo
awk '{print $1, $2}' fileprints the first and second fields of each line in file. (If file is omitted, then Awk reads the standard input; this feature makes Awk useful in pipelines.)
After Awk reads a line of text, it automatically breaks up the line into fields, i.e., consecutive strings of nonblank characters separated by one or more blanks. (A blank is either a tab character or an ordinary space. It's possible to change the default separation character to something else, but we'll ignore that feature for now.) The first field in a given line is $1, the second is $2, and so on. Special case: $0 refers to the entire input line.
Awk has several useful built-in variables:
Example 1. The following command prints the contents of file, numbering each line:
awk '{print NR, $0}' file
The Awk program itself is enclosed in single quotes to prevent the shell from interpreting the characters. (Remember that strings like $1 are meaningful to the shell, too. Single quotes cause the shell to send the quoted text unchanged as the first command-line argument to Awk.)
pattern   {action}(the curly braces delimit the action). Whenever an input line satisfies the pattern, Awk performs the associated action.
The pattern can be any condition or regular expression.
Example 2.
$1 > 0 { print log($1) }If the first field of a line is a positive value, then print the natural logarithm of the value. Lines whose first field is not a positive number are skipped. (Of course, this program is sensible only if the first field of each line is numeric.) There are two special cases:
Example 3. Print all lines in file.dat whose first field is a positive number:
awk '$1 > 0' file.dat(The action is omitted, so the selected lines are simply printed.)
Example 4. Print all lines in output.log containing the string error:
awk '/error/' output.logThe pattern in this case is a regular expression. Regular expressions are delimited by forward slashes. As in the previous example, the action is omitted, so it defaults to printing the matched lines.
Example 5. Convert a file containing one x, y pair per line to pairs of the form log(x), y:
awk '{print log($1), $2}'The pattern is omitted, so every line is selected.
The syntax of actions is similar to that of C, C++, and Java. Successive statements can be delimited by semicolons. You can define your own variables. No declarations are required; a given variable simply springs into existence whenever it is mentioned. Uninitialized variables are treated as blank strings or as 0, depending on context.
The print statement provides a default formatting to all values: strings are printed in their entirety, and floating-point numbers to 6 decimal digits. (Awk supplies a printf function with the same functionality as in C, but you do not need this capability for these exercises.)
Example 6. Awk scripts that are more than one line or that contain quotes can be placed in a separate file for convenience. Suppose that the file fruit.awk contains the following code:
/apple/ {a += $2}and suppose the file inventory consists of
/banana/ {b += $2}
{f += $2}
END {print "apples:", a, "bananas:", b, "total fruit:", f}
apple 2Then the command
kiwi 1
banana 1
cherry 7
apple 3
banana 2
orange 6
awk -f fruit.awk inventoryprints
apples: 5 bananas: 3 total fruit: 22Each time a line is read in, Awk first checks whether the line contains apple. If it does, then the variable a is incremented by the value of the second field. Next, Awk checks whether the line contains banana. If so, then the variable b is incremented by the value of the second field. Next, the variable f is always incremented by the value of the second field because the omitted pattern matches every line. When the end of the input is reached, the values of a, b, and f are displayed as indicated.
Awk is most useful for processing text files that have a regular structure. The script in Example 6 works because each line consists of a (type,count) pair and because the variables a and b begin with the value 0 when they are first mentioned.
General program development advice. When you write a large simulation or other modeling program, design the output so that it is readily parsable with tools like Awk, Perl or Python. Such designs make it easy to scan through large numbers of log files when you want to explore, say, the estimated error in the output as a function of some set of input variables. The following example are intended to illustrate the utility of this strategy.
Internally, Awk stores all variables as strings (as does the shell). In an arithmetic context, Awk internally converts the string to a number, performs the indicated arithmetic, then converts the result back to a string. For instance, the statements
n = 0;first assign n the string 0. Since the second statement specifies arithmetic, the string 0 is converted to the number 0, the value 1 is added, and the result is converted to the string 1. The net effect of these rules is that Awk appears to do ordinary arithmetic, and the internal conversions are ordinarily transparent to the programmer. (You can always force Awk to treat a variable in a numeric context by adding 0 to it.)
n = n + 1;
Exercise 3. Consider the Arizona Cardinals football roster used previously. Write an Awk script that displays all information on players whose jersey number is less than 10.
Note: The goal of these exercises is to write simple, throwaway programs for exploratory data analysis. Consequently, there is no need to decorate your code with statements likeprint "The players whose jersey numbers are less than 10 are:";It suffices simply to output the corresponding lines of the data file without further decoration.
As you become more proficient with software tools, you will find that such decoration is a nuisance. Avoid extraneous print statements except when debugging.
BEGIN { prod = 1 }
{ prod *= $1 }
END { print prod }
Here the variable prod is initialized to 1 before the first line of text is read. (Absent the BEGIN block, prod would be initialized to 0.) The second line omits the pattern, so the associated action is performed for all input lines (the first field is multiplied to the running product and nothing is printed). The action associated with the special END pattern is executed when the end of the input is reached; it prints the net product.
To test this out, and before proceeding further, save this Awk program as product.awk and create a little data file with a text editor. Then run a command of the form
awk -f product.awk datafilewhich executes product.awk using datafile as the input.
Exercise 4. For each of the following tasks, write a corresponding Awk script. (One script per task.)
Exercise 5. Suppose that the output file from a simulation contains data in the form
0.01 3.1312 4.3657 5.4893 1.034e-04and so on. In this file, the first column is the output time t, the second, third, and fourth columns are the x, y, and z components, respectively, of a solution vector, and the last column is an estimate of the error in the solution.
0.02 3.2387 4.1908 5.6504 5.238e-04
0.03 3.4568 3.9877 5.7384 9.193e-04
0.04 3.8392 3.8957 5.8392 3.117e-03
Exercise 6. Given a data file as in Exercise 5, write an awk script that prints all times t, if any, at which the error exceeds 0.001.
Exercise 7. Using Awk and the other text processing tools to write a script for each of the following tasks for the METAR data file given above. (One script per task.)station  visibilitysorted in descending order of visibility.
station time temperature dewpointwhere the latter two values are given in Fahrenheit. No special formatting of the time field is necessary, but please format the temperatures nicely. One little challenge is to format an entry like 24/M03, which means that the ambient temperature is 24 degrees Celsius but the dewpoint is -3 degrees Celsius. (Hint: look up the built-in functions split and sub in the Awk reference manual.) It's easy to imagine how such a script might be part of a larger one that reads a METAR file and automatically reformats it for a Web page intended to be accessed by the public.