STP 429, Spring 2009: Experimental Statistics

ASSIGNMENTS AND READINGS

For Jan 22:

  1. Download the R statistical computing package from www.r-project.org to your laptop.
  2. Review confidence intervals and hypothesis tests. Useful online references, if you no longer have your first statistics book, are The Little Handbook of Statistical Practice and the Rice Virtual Lab in Statistics. You may also find the Wikipedia articles useful in your review: Student's t-test and Confidence interval
  3. Begin reading "An Introduction to R", available at http://www.r-project.org/  under Manuals.
  4. Other useful references on using R for statistics: Regression Using R and Mark Gardener's Using R for Statistical Analysis
For Jan 29:
  1. Read Chapters 1 and 2 of the text.
  2. Use www.amazon.com for this exercise. Pick a genre of books that you are interested in (for example, mysteries). Now pick two sub-genres (for example, British detectives and hard-boiled). For each sub-genre, take a simple random sample of size 20 of the paperback books that Amazon sells in that sub-genre (you'll have 40 books altogether). For each book, record the number of pages, copyright date, average customer review, number of customer reviews, and price. Now compare the number of pages for the two sub-genres you chose. Construct histograms, boxplots, and normal probability plots for your two samples. Do a two-sample t test for equality of means, and give a 95% confidence interval for the difference in means.

    Turn in your annotated R code (including the details on how you selected your samples), plots, and relevant output. Do the assumptions for a two-sample t test appear to be met for these data?

    Make sure you save these data for a future assignment.
For Feb 12:
  1. Read Chapter 3 of the text.
  2. Load the "faraway" library into R. Go to the Packages menu and select "Load package." Select a CRAN mirror in the U.S., then select faraway from the menu that appears.
  3. Type "library(faraway)" to load the library and then type "help(prostate)" to learn about the prostate data. Now consider using lcavol as the predictor variable (x), and lpsa as the response variable (y). Plot y vs. x, and fit the least squares regression line. Give the regression line, and display it on the data plot. Does the model appear to be a good fit? Give a 95% confidence interval for the slope, and interpret it. What is R2 for this model, and what does it mean?
  4. Add the predictors lweight and svi to the model with lcavol. Do a hypothesis test that the coefficients of the extra parameters are both 0.
  5. Now construct all pairwise plots for all the variables in the prostate data. Fit a model with all the predictors in the data. What is R2 for this model?
  6. Do problem 1 (a,c,d) in Chapter 3. For parts c and d, also construct the 95% CI for the mean response. How do you interpret each interval?
For Feb 26:
  1. Read Chapter 4 of the text.
  2. Do problem 2 on page 51 in Chapter 3.
  3. For the prostate data, fit the model with all predictors. Compare the added variable plot for the predictor variable lcavol with the plot of lpsa vs. lcavol. Draw the line with slope from the multiple regression on the AVP, and the regression line for the regression of lpsa on lcavol on the plot of lpsa vs. lcavol. What information do you get from each plot? How do the slopes of the lines compare? How would you explain the difference in the plots to a medical doctor?
  4. Repeat the previous question for the predictor pgg45. Explain to a medical doctor why pgg45 is significant when considered as a single predictor but not in the multiple regression model.
  5. In November 2008, there was speculation about who would win the senate seat in Alaska. As of November 11, Mark Begich had a total of 103,337 counted votes and Ted Stevens had a total of 106,594 counted votes, leading many news organizations to project Stevens as the winner. But at that date, about 9200 early ballots remained to be counted, along with about 50,000 absentee ballots.

    Let's see if regression methods could have been used to predict the allocation among the uncounted ballots. The file alaska.csv has the following information for each of the 40 state House Districts in Alaska.

    Column 1: House District Number (HD)
    Column 2: Number of early votes counted for Begich (EarlyBegich)
    Column 3: Number of early votes counted for Stevens (EarlyStevens)
    Column 4: Number of election day ballots counted for Begich (EDBegich)
    Column 5: Number of election day ballots counted for Stevens (EDStevens)
    Column 6: Percent of early votes received by Begich (PercentEarlyB); set to missing value of -9 if there are fewer than 100 early votes in the house district
    Column 7: Percent of election day votes received by Begich (PercentEDB)
    Column 8: Number of uncounted early votes (Early)
    Column 9: Number of uncounted absentee ballots (Absentee)

    Fit a straight-line model predicting PercentEarlyB from PercentEDB. How would you characterize the strength of the linear relationship? Interpret the slope and intercept (and their statistical significance or lack thereof) for this data set. Make sure you include a plot of the data with the regression line drawn in.
  6. Now try a quadratic model. Is the quadratic term significant? What happens to R2 when you add the quadratic term? Which model do you prefer for these data? Draw the fitted model with the quadratic term on the plot.
  7. For each of the 40 House Districts, find the predicted percentage of early ballots that would be cast for Begich.
  8. Now, using your regression model, allocate the early and absentee ballots to the two candidates according to your predicted percentage of early votes in each House District that would be for Begich. What are the predicted total vote tallies (including the votes counted on election day and the early ballots already counted, given in columns 2-5) for each candidate using the regression predictions? What assumptions are you making to form these predictions?
For Mar 5:
  1. Read Section 7.1 of the text.
  2. Do question 3 on page 75. You only need to do parts (a)-(e), since you constructed added variable plots in Assn 3.
  3. Consider the Alaska data from Assn 3. For the linear model, construct the residual plots and interpret them. What does each plot tell you about the assumptions for the regression model? Identify observations that have unusually high leverage or influence, and do the outlier test for each point.
  4. Construct the residual plots for the quadratic model. Identify observations that have unusually high leverage or influence, and do the outlier test for each point. Based on your residual plots and the results from Assn 3, which model would you recommend?
For in-class exam on Mar 19: The file golf.csv contains information about sample of 84 golf courses in the United States. Columns in the file are separated by commas, so you should use read.csv to read the data in.

The variables in the data set are:

state state where course is located
name name of golf course
type = 1 if public, 0 otherwise
yearbuilt year course was built
weekend cost to play all 18 holes on the weekend
backtee yardage from farthest tee to hole
rating course rating (higher means a more challenging course)
par par for course (higher generally, but not always, means more challenging)
cart golf cart rental fee
caddy = 1 if caddies are available, 0 otherwise
pro = 1 if a golf professional is available, 0 otherwise


Gus Golfer wants to know if the cost to play on a course on a weekend, weekend, can be predicted from the course features yearbuilt, backtee, rating, par, cart, caddy, and pro.

Fit a regression model with weekend as response, and yearbuilt, backtee, rating, par, cart, caddy, and pro as predictors. Plot the data using pairs. Plot the residuals for this model. What do you see?

Now look at the model with the same predictors, but with log(weekend) as response. Plot the data using pairs, and also construct the added variable plots. Why might you choose the log transformation? Plot the residuals for this model. Does the transformation help? Evaluate influential points and outliers in this model.

Some other models you may want to consider and compare (use response variable log(weekend) for all of these):
Gus also wants to predict the log(weekend) cost for two golf courses that have missing data on cost, but have the following values for the predictor variables:
type new course 1 new course 2
type 1 1
yearbuilt 1961 1990
backtee 6388 7150
rating 7167
par 71 67.5
cart 13.5 28
caddy 11
pro 10

Find the predicted values and prediction intervals for these golf courses, using the model with log(weekend) as the response and yearbuilt, backtee, rating, par, cart, caddy, and pro as predictors.

You should bring the following printed output to class:
Also bring your laptop to class, with the data loaded. You may use R during class to perform additional analyses.

For Apr 2:
  1. Read Section 5.3 and Chapter 8 of the text. Note that while the methods listed in Section 8.2 are of historical interest, we will not be using them in this class. We will use the R2, adjusted R2 and Cp methods for model selection. These criteria can be calculated using the leaps library available from the R website.
  2. Do question 5 on page 87, regarding multicollinearity in the prostate data. Which variables appear to be involved in the multicollinearity?
  3. For the prostate data with lpsa as the response, determine the "best" model using each of the criteria R2, adjusted R2 and Cp. Include your plots for these criteria in the homework. Based on these criteria, which model would you recommend?
  4. Return to the data you collected from amazon.com in Assignment 1. Construct an indicator variable for the two subgenres, e.g., genre = 1 if British detectives and genre = 0 if hard-boiled. Now fit the full model predicting price from genre, copyright date, average customer review, and number of pages. Look at the residual plots, and investigate the Box-Cox method to see if a transformation might help (we did Box-Cox in the church data). If called for, transform the response and refit the model. Is there multicollinearity among the explanatory variables? What is the "best" subset of variables for predicting price, and what criterion did you use? Examine the residual plots and influence statistics of the model you decide to adopt, and interpret these. Is the genre variable significant after you adjust for the other covariates?


For Apr 9:
  1. Read Chapter 14 of the text.
  2. Read in the fruitfly data, in file fruitfly.csv. The variables are described in fruitflyR.txt. For now, ignore the information on thorax and sleep. We want to study the effects of the number and types of females (in factor group) on longevity. Construct side-by-side boxplots of the data, and carry out the F test for the hypothesis H0: μ1 = ... = μ5 , for the means of longev of the five groups. Give, and interpret, your p-value for the test. What do you conclude? For the boxplot, you may want to rename the group means to be something more informative.
  3. Plot the residuals for these data. Do you see any unusual features?
  4. Do a factor plot of the five group means. Carry out a multiple comparisons analysis of the means of longev. Which pairs of means are significantly different at the familywise 0.05 level using Bonferroni? Using Tukey? Plot the familywise confidence intervals for the Tukey method, and construct the lines plot for the Tukey method.
  5. What do you conclude from your data analysis?


For Apr 23:
  1. Read Chapter 15 of the text.
  2. Use the rats data from the "faraway" package for questions 2-6. After loading the package, and typing "library(faraway)", you can find the variable descriptions by typing "help(rats)". Plot the data using stripplot, plot.factor, and an interaction plot, and comment on what you see. Fit a 2-way ANOVA model with interaction to the data, and plot the residuals. What do you see in the residual plots?
  3. Use the Box-Cox method (see the church data example) to determine an appropriate transformation for the data.
  4. Plot the data again, with the transformation, using stripplot, plot.factor, and an interaction plot, and comment on what you see. Refit the ANOVA model using the transformed response, and plot the residuals again. What do you see?
  5. What is the overall F statistic for the full model? Which factors are significant?
  6. If the interaction is significant, do a multiple comparisons analysis of the 12 treatments that arise if you view this as a 1-way ANOVA. If the interaction is not significant, refit the model without the interaction term, and do a multiple comparisons analysis for each factor separately. Which treatment(s) and poison(s) are "best" if you want to have the longest survival time (take your transformation into account when answering this question)? Does the best treatment differ for different poisons? If you were a poisoned rat, which treatments would you want to avoid?
  7. A student who wanted to find out how to maximize the popcorn volume when making microwave popcorn did a replicated 23 experiment, with data in popcorn.txt. The factors were: butter (regular or extra), brand (Act II or generic), and time (2 minutes 40 seconds or using the preset popcorn button on the microwave). The response was volume of popcorn (ml). Do a cubeplot of these data. Set up the design matrix in standard form (with -1's and +1's), and determine which factors and interactions are statistically significant. Do a normal probability plot of the factor effects. Which factor settings would you recommend to maximize popcorn volume?


For Final exam:
  1. See evap.pdf on restricted web site. Note that in the first version I posted, I had a typo for model 1. For the first model, you should predict evap from variables 1-10. This is now (as of May 1) fixed in the file evap.pdf.



Return to STP 429 Home Page
Last Modified on 1 May 2009
Copyright © 2009, Sharon L. Lohr