In-class and homework exercises for Tuesday, Aug. 28
Assignment due on Tuesday, Sept. 4.
Introduction
The goal of this assignment is to do some exploratory data analysis
in a Unix/Linux environment. The first problem asks you to work
with some of simple the text processing utilities:
file manipulation: ls mv cp rm chmod
directory handling: cd pwd
text processing: head tail sort grep more wc
Each of these commands has a manual page; man grep, for instance,
prints the manual page for the grep command. Here's a thumbnail
description of some of the text processing commands, just to refresh your
memory:
- head -n count file prints the first count
lines of file. The default count is 10 lines, and head
reads from the standard input if no file is specified.
- tail Like head, but prints the last count lines
of the input.
- sort sorts text. See the manual page for the options
-n, -k, and -r. For instance,
sort -n -k3 sorts in numerical order using the third field
of each line as the key.
- more file displays file one screenful at a time;
the standard input is used if no file is specified. Hit the space bar
to advance by one screenful, the return key to advance by one line,
and q to quit.
- wc ("word count") displays a count of lines, words, and
characters in a file or list of files.
- grep pattern file prints all the lines
in file that contain pattern. In the simplest case,
pattern is a sequence of letters, digits, and spaces.
(If the pattern contains spaces, enclose it in quotation marks.)
Exercise 1
Since football season is starting, we'll use the
Arizona Cardinals roster as an input
data file for this exercise. Use your browser to save this file
to your home directory or other convenient location. (It doesn't
matter what you call the file; I've named it cardinals.txt
on the server, but you can change that to something else if you like.)
Using any combination of the text processing tools mentioned above,
devise a command to perform each of the following tasks
(one command per task):
- (a) count the number of players on the roster (be sure that you don't
have a blank line at the end of the file!)
- (b) display the roster in order by jersey number
(indicated by 0 if no number was listed on the Cards' web site)
- (c) display the heaviest five players
- (d) display all players who went to college in Arizona
- (e) display all the quarterbacks (QB)
- (f) count the number of wide receivers (WR)
- (g) display the roster in increasing order by years of player
experience.
Strive to do each task with a one-line command.
Do not write a C or Java program to do these!
Exercise 2
Give an example of the sort command that sorts the roster
in increasing order by player height.
Note added after lab: The behavior of sort has been
changed recently. On older versions, the command
sort -n -k5.1 -k5.3 cardinals.txt
sorts the list by height, but this does not work with the version
on the ECA lab machines. So skip this problem. I regret that
I did not spot this difficulty sooner.
The sum program
The sum program
is a Python script that accepts input lines consisting of at most
one number per line. It computes and displays the sum of the input values.
Save this file as sum and give it execute permission
with
chmod +x sum
(This command turns the Python script into its own executable program.
Experiment with it by creating your own little data file with
one number per line.)
Exercise 3
(a) Modify the sum script so that it prints the mean, standard error,
and number of items in the input; call the new script stats.
(One estimator of the standard error is the
square root
of E(X2)-(E(X))2,
where E(X) is the mean of X.)
Do not decorate your output with phrases like The mean is;
simply print three numbers. (Decorated output gets in the way
when you want to write a script to process a large number of files to
identify those with the smallest
means or largest standard errors, for example.)
(b) What is the average weight (and corresponding standard error)
of a player on the Cardinals football team? You can extract a list
of the weights with
awk '{print $6}' cardinals.txt | ./stats
Awk is another very handy scripting language. You can read more:
Exercise 4
The objective here is to write some simple Python scripts to extract selected
fields from the Cardinals roster, and, if appropriate, pipe them
to stats or another
program to answer each question. In other words,
for each of the following tasks, you will
write a command line of the form
python myscript.py < cardinals.txt
or
python myscript.py < cardinals.txt | other program
where the other program can be stats or one of the other
standard Unix utilities. (The redirection symbol < is
necessary if your scripts read only from the standard input---which is good
enough for now.)
The Python material that you will need to know is in Chapters 4, 5, and 6
of the Learning Python text and in Sections 3 and 4 of
Guido van Rossum's
tutorial.
Important: Python is designed to allow you to start writing useful
code quickly. So do not sit and read 100 pages of the textbook
before you start to program! Instead, experiment--and if you can't
get something to work or don't know how to do something, then look in the
index in the Python textbook or do an online search.
Field splitting. One common data-processing task is to split
lines into fields (i.e., consecutive sequences of nonblank characters
that are separated by whitespace). Python provides a simple facility for this:
for line in f:
field = line.split()
In this example, field is a Python list of strings:
field[0] is the first text field,
field[1] is the second, and so on; len(field) gives
the total number of fields in line. (Try this out by adding
a print statement and piping the first few lines of the Cardinals
roster to your script.)
So, here we go (one Python script and/or command line per task):
- (a) How many rookies are on the Cardinals football team?
(A rookie is a player with no previous professional experience.)
- (b) What are the names and positions of the players with more than
8 years of professional experience?
- (c) What is the average height (and standard error) in inches
of a Cardinals football player? (You will find the construct
ht = field[4].split('-')
helpful.)
- (d) Display each player's name, position, and body mass
index. The body mass index is defined as
BMI = w/h2,
where w is weight (well, mass) in kilograms and h is
height in meters. One inch is exactly 0.0254 meters, and you may assume
that one kilogram equals 2.2 pounds.
- (e) What is the average age in years (and the corresponding standard error)
of a Cardinals football player? You may assume that each player was born
on the first of the month for the purpose of this calculation.
Submission instructions
An annotated printout of your scripts (with associated output) is fine.
The usual collaboration policy applies.
Assignments are due at the beginning of class on Tuesday, Sept. 4.
Copyright (c) 2007 by Eric J. Kostelich. All rights reserved.