In-class and homework exercises for Tuesday, Aug. 28

Assignment due on Tuesday, Sept. 4.

Introduction

The goal of this assignment is to do some exploratory data analysis in a Unix/Linux environment. The first problem asks you to work with some of simple the text processing utilities:
file manipulation: ls mv cp rm chmod
directory handling: cd pwd
text processing: head tail sort grep more wc
Each of these commands has a manual page; man grep, for instance, prints the manual page for the grep command. Here's a thumbnail description of some of the text processing commands, just to refresh your memory:

Exercise 1

Since football season is starting, we'll use the Arizona Cardinals roster as an input data file for this exercise. Use your browser to save this file to your home directory or other convenient location. (It doesn't matter what you call the file; I've named it cardinals.txt on the server, but you can change that to something else if you like.)

Using any combination of the text processing tools mentioned above, devise a command to perform each of the following tasks (one command per task):

Strive to do each task with a one-line command. Do not write a C or Java program to do these!

Exercise 2

Give an example of the sort command that sorts the roster in increasing order by player height.

Note added after lab: The behavior of sort has been changed recently. On older versions, the command

sort -n -k5.1 -k5.3 cardinals.txt
sorts the list by height, but this does not work with the version on the ECA lab machines. So skip this problem. I regret that I did not spot this difficulty sooner.

The sum program

The sum program is a Python script that accepts input lines consisting of at most one number per line. It computes and displays the sum of the input values. Save this file as sum and give it execute permission with
chmod +x sum
(This command turns the Python script into its own executable program. Experiment with it by creating your own little data file with one number per line.)

Exercise 3

(a) Modify the sum script so that it prints the mean, standard error, and number of items in the input; call the new script stats. (One estimator of the standard error is the square root of E(X2)-(E(X))2, where E(X) is the mean of X.) Do not decorate your output with phrases like The mean is; simply print three numbers. (Decorated output gets in the way when you want to write a script to process a large number of files to identify those with the smallest means or largest standard errors, for example.)

(b) What is the average weight (and corresponding standard error) of a player on the Cardinals football team? You can extract a list of the weights with

awk '{print $6}' cardinals.txt | ./stats
Awk is another very handy scripting language. You can read more:

Exercise 4

The objective here is to write some simple Python scripts to extract selected fields from the Cardinals roster, and, if appropriate, pipe them to stats or another program to answer each question. In other words, for each of the following tasks, you will write a command line of the form
python myscript.py < cardinals.txt
or
python myscript.py < cardinals.txt | other program
where the other program can be stats or one of the other standard Unix utilities. (The redirection symbol < is necessary if your scripts read only from the standard input---which is good enough for now.)

The Python material that you will need to know is in Chapters 4, 5, and 6 of the Learning Python text and in Sections 3 and 4 of Guido van Rossum's tutorial.

Important: Python is designed to allow you to start writing useful code quickly. So do not sit and read 100 pages of the textbook before you start to program! Instead, experiment--and if you can't get something to work or don't know how to do something, then look in the index in the Python textbook or do an online search.

Field splitting. One common data-processing task is to split lines into fields (i.e., consecutive sequences of nonblank characters that are separated by whitespace). Python provides a simple facility for this:

for line in f:
   field = line.split()
In this example, field is a Python list of strings: field[0] is the first text field, field[1] is the second, and so on; len(field) gives the total number of fields in line. (Try this out by adding a print statement and piping the first few lines of the Cardinals roster to your script.)

So, here we go (one Python script and/or command line per task):

Submission instructions

An annotated printout of your scripts (with associated output) is fine. The usual collaboration policy applies. Assignments are due at the beginning of class on Tuesday, Sept. 4.

Copyright (c) 2007 by Eric J. Kostelich. All rights reserved.