STP 429, Spring 2010: Experimental Statistics

ASSIGNMENTS AND READINGS

For Jan 25:

  1. Download the R statistical computing package from www.r-project.org to your laptop.
  2. Review confidence intervals and hypothesis tests. Useful online references, if you no longer have your first statistics book, are The Little Handbook of Statistical Practice and the Rice Virtual Lab in Statistics. You may also find the Wikipedia articles useful in your review: Student's t-test and Confidence interval
  3. Begin reading "An Introduction to R", available at http://www.r-project.org/  under Manuals. This file is also available through the Help menu of the R program.
  4. Other useful references on using R for statistics:
    1. Regression Using R
    2. Computing Primer for Applied Linear Regression, Third Edition Using R and S-Plus by S. Weisberg This guide is tied to Weisberg's regression book, but it goes through many regression commands in R.
    3. Mark Gardener's Using R for Statistical Analysis
For Feb 1:
  1. Read Chapters 1 and 2 of the text.
  2. Use www.blockbuster.com for this exercise. Pick two genres of movies that you are interested in (for example, horror movies and romance movies). For each genre, take a simple random sample of size 20 of the movies that Blockbuster rents in that genre (you'll have 40 movies altogether). Note that the number of movies listed on each page is limited, so when you take a sample of the population using the sample function in R you need to make sure you select the movie from the appropriate page that corresponds to the movie in your sample. For each movie, record the theatrical running time in minutes, release date, MPAA rating (e.g. PG13), average customer rating, and number of customer ratings. Now compare the theatrical running times for the two sub-genres you chose. Construct histograms, boxplots, and normal probability plots for your two samples. Do a two-sample t test for equality of means, and give a 95% confidence interval for the difference in means.

    Turn in your annotated R code (including the details on how you selected your samples), plots, and relevant output. Make sure you remove unnecessary material from your assignment; I do not want to see mistyped commands, computer mistakes, or pages that are irrelevant to the assignment. Do the assumptions for a two-sample t test appear to be met for these data? State the assumptions for a two-sample t test and say why the data do or do not appear to meet them.

    Make sure you save these data for a future assignment.
For Feb 8:
  1. Read Chapter 3 of the text.
  2. Load the "faraway" library into R. Go to the Packages menu and select "Load package." Select a CRAN mirror in the U.S., then select faraway from the menu that appears.
  3. Type "library(faraway)" to load the library and then type "help(prostate)" to learn about the prostate data. Now consider using lcavol as the predictor variable (x), and lpsa as the response variable (y). Plot y vs. x, and fit the least squares regression line. Give the regression line, and display it on the data plot. Does the model appear to be a good fit? Give a 95% confidence interval for the slope, and interpret it. What is R2 for this model, and what does it mean in the context of this data set?
  4. Use the movie data you collected for Assignment 1. Perform a regression analysis predicting running time for the movie (y) as a function of the release year (x). Plot y vs. x, and fit the least squares regression line. Give the regression line, and display it on the data plot. Does the model appear to be a good fit? Give a 95% confidence interval for the slope, and interpret it. Is your slope significantly different from 0? What is R2 for this model, and what does it mean?
  5. Describe an incident of regression to the mean that you have observed and explain why the phenomenon might plausibly be regression to the mean.
For Feb 17 (in class):
  1. Read Chapter 4 in the text.
  2. Load the prostate data from the faraway library. Use lpsa as the response variable (y). Now fit the model with all predictors. Construct all pairwise plots, and the added variable plots for the data. Compare the added variable plot for the predictor variable lcavol with the plot of lpsa vs. lcavol. Draw the line with slope from the multiple regression on the AVP. Draw two lines on the plot of lpsa vs. lcavol: (1) the regression line for the regression of lpsa on lcavol, without the other predictors, and (2) a line with the slope for lcavol from the multiple regression which goes through the point (mean of lcavol, mean of lpsa). What information do you get from each plot? How do the slopes of the lines compare? How would you explain the difference in the plots to a medical doctor?
  3. In your full regression model, which predictors are significant at the .05 level?
  4. In the full regression model, compute a 95% confidence interval for the slope of lcavol. How does this relate to the p-value from a hypothesis test that the slope is 0?
  5. Perform a hypothesis test of whether the slopes of pgg45 and age are both 0. Give your F statistic and interpret your p-value.
  6. Consider an individual with lcavol=1.5, lweight=3.5, age=60, lbph=.28, svi=0, lcp=-.7, gleason=7, and pgg45=22. Give a 95% prediction interval for the response. What happens to the width your prediction interval if you change pgg45 to 95? Why does it change? (Hint: Look at your pairs plot.)
For Feb 22:
  1. Review the relevant sections of the text: pp. 16-18 for R2, Sections 3.1-3.2 for F and t tests (pay special attention to the second and third paragraphs on page 30), Section 3.4 for confidence intervals for the slopes, Section 3.5 for confidence intervals for the mean response and prediction intervals for a future response, Sections 3.6-3.8 for interpreting regression models, page 72 for added variable (also called partial regression) plots (we are not doing partial residual plots in this class).
  2. Do question 1 (a, b, f) on page 23 and 3 on page 51. Also construct and interpret added variable plots for these data.
  3. For your movies data, fit a regression model predicting running time from the other variables you measured (genre, release date, average customer rating, and number of customer ratings). (For genre, include a variable that takes on the value 1 if in your first genre and 0 if in your second genre.) Interpret your regression coefficients. What is your F statistic and p-value for the null hypotheses that all slopes are 0? What is R2? Interpret these values. Do any of the variables you collected help explain the variability in running time?
For Mar 1:
  1. Review the relevant sections of the text: pp. 53-59 for residual plots, pp. 64-71 for influential observations and outliers.
  2. Do question 2 on page 74.
  3. For your movies data, use the full model predicting running time from genre, release date, average customer rating, and number of customer ratings that you fit for the previous homework. Identify outliers and influential points in your data set.
For in-class exam on Mar 8:
  1. The exam covers material in Chapters 1-4 of the text.
  2. Bring the printouts from your data analysis below to the exam. Also bring your laptop to class, with the data loaded. You may use R during class to perform additional analyses. You may bring the book and class notes to the exam, but turn the wifi off on your laptop.
  3. Read the blog on “What Makes Happy States?” by Richard Florida, at http://www.creativeclass.com/creative_class/2009/03/13/what-makes-happy-states/ (you may want to print this out and bring it to class on March 8 in case I ask questions about statements made in the column). Also read the Wikipedia article on The Ecological Fallacy (the book describes the ecological fallacy on pages 151-153), and think about how this concept might be relevant to the data considered here.
  4. The data in the file wellbeing.txt in the restricted web page are collected from a variety of sources, including the sources mentioned in Florida’s column, census.gov, statemaster.com, and statehealthfacts.org. Note that for some of the x variables the values in the data set are from a different year, and so are not exactly the same as Florida's data. The columns are as follows:

    ColumnNameDescription
    1stateState name
    2abbrevState abbreviation
    3wellbeingAverage value of well-being index for persons surveyed in the state. The well-being index is computed from 40 questions asked about the person’s individual life situation, work environment, access to life’s necessities such as food, healthy behavior, physical health status, and emotional health status. Higher values are said to indicate higher “well-being.”
    4 income Per capita personal income, in 2008 (dollars). This is the total personal income for the state divided by the number of residents.
    5 HHinc Median household income, in 2007 (dollars)
    6 Walmart Number of Walmart stores in state per 100,000 people
    7 foreign Percent of population that is foreign born
    8 unemp Unemployment rate, 2006
    9 obesity percent of population who are obese
    10 smoking adult smoking rate, in percent
    11 AIDS AIDS case rate per 100,000 population, 2007

    The data are in .csv (comma-delimited) format. You can read them in R using the command

    well <- read.csv(file.choose(),header=T)

    You can create a data structure with only the numeric columns by using the command

    welldata <- well[,3:11]

    This makes it easier to construct plots and compute correlations.

  5. Fit Model 1, with wellbeing as response, and income, HHinc, Walmart, foreign, unemp, obesity, smoking, and AIDS as predictors. Plot the data using pairs, and compute correlations among the variables. Look at the added variable plots for income, HHinc, obesity, and smoking. Plot the residuals and evaluate influential points and outliers in this model.

    Some other models you should consider and compare (use response variable wellbeing for all of these, and test the reduced models against the full model):
For Mar 10:

Professor Richard De Veaux is giving a special lecture on campus during our class meeting time. Instead of meeting in our regular classroom, you will attend the lecture and write about it for the next homework. The lecture is "Data Mining: Fool's Gold? Or the Mother Lode?" in Memorial Union Room 230 at 2pm on March 10.

For Mar 22:

  1. Write a short paragraph about Professor De Veaux's use of regression in data mining.
  2. Do question 5 on page 87.
  3. Use the Box-Cox method to explore possible transformations for the teengamb data. Note that some values of gamble are 0, so you should consider using response gamblep1 = gamble + 1 before exploring the transformations. Which transformation does the Box-Cox method lead you to try? Plot the residuals, and find outliers, high leverage points, and points with high Cook's distance, using the transformed data. Are the same points outliers and influential as in your homework for March 1?

In-class activities, Mar 24:

  1. In your previous homework, you identified problems with multicollinearity in the prostate data (this is true even though the vif's were small: if any of the diagnostics indicates multicollinearity, we say it's there). Now let's apply model-selection methods to the prostate data.
  2. For the prostate data with lpsa as the response, determine the "best" model using each of the criteria R2, adjusted R2 and Cp. Construct plots for each criterion vs. p. Based on these criteria, which model would you recommend?

For Mar 29:

  1. Model selection methods are discussed in the text in Chapter 8. Note that while the methods listed in Section 8.2 are of historical interest, we will not be using them in this class. We primarily use the criterion-based procedures in Section 8.3: R2, adjusted R2 (R2a), and Cp. These criteria can be calculated using the leaps library available from the R website.
  2. As you might have noticed from the coefficient sign changes, multicollinearity is an issue for the data set you looked at in exam 1. In this assignment we will examine the multicollinearity, and use criterion-based methods to choose a model.
  3. Calculate the condition numbers and variance inflation factors for the well-being data. Also look at the correlations among the explanatory variables. What are the indications of multicollinearity in this data set?
  4. Produce plots of R2, adjusted R2, and Cp for each value of p. Using these criteria, which model would you adopt for the data? Make sure you examine the residual plots and influence measures for candidate models (you do not have to include these in the homework, but tell me what information they give you).

For Apr 12:

  1. Read Chapter 14 of the text.
  2. Read in the fruitfly data, in file fruitfly.csv. The variables are described in fruitflyR.txt. For now, ignore the information on thorax and sleep. We want to study the effects of the number and types of females (in factor group) on longevity. Construct side-by-side boxplots of the data, and carry out the F test for the hypothesis H0: μ1 = ... = μ5 , for the means of longev of the five groups. Give, and interpret, your p-value for the test. What do you conclude? For the boxplot, you may want to rename the group means to be something more informative.
  3. Plot the residuals for these data. Do you see any unusual features?
  4. Do a factor plot of the five group means. Carry out a multiple comparisons analysis of the means of longev. Which pairs of means are significantly different at the familywise 0.05 level using Bonferroni? Using Tukey? Plot the familywise confidence intervals for the Tukey method, and construct the lines plot for the Tukey method.
  5. What do you conclude from your data analysis?


For Apr 19:
  1. Read Chapter 15 of the text.
  2. Use the rats data from the "faraway" package for questions 2-6. After loading the package, and typing "library(faraway)", you can find the variable descriptions by typing "help(rats)". Plot the data using stripplot, plot.factor, and an interaction plot, and comment on what you see. Fit a 2-way ANOVA model with interaction to the data, and plot the residuals. What do you see in the residual plots?
  3. Use the Box-Cox method (see the church data example) to determine an appropriate transformation for the data.
  4. Plot the data again, with the transformation, using stripplot, plot.factor, and an interaction plot, and comment on what you see. Refit the ANOVA model using the transformed response, and plot the residuals again. What do you see?
  5. What is the overall F statistic for the full model? Which factors are significant?
  6. If the interaction is significant, do a multiple comparisons analysis of the 12 treatments that arise if you view this as a 1-way ANOVA. If the interaction is not significant, refit the model without the interaction term, and do a multiple comparisons analysis for each factor separately. Which treatment(s) and poison(s) are "best" if you want to have the longest survival time (take your transformation into account when answering this question)? Does the best treatment differ for different poisons? If you were a poisoned rat, which treatments would you want to avoid?
  7. A student who wanted to find out how to maximize the popcorn volume when making microwave popcorn did a replicated 23 experiment, with data in popcorn.txt. The factors were: butter (regular or extra), brand (Act II or generic), and time (2 minutes 40 seconds or using the preset popcorn button on the microwave). The response was volume of popcorn (ml). Do a cubeplot of these data. Set up the design matrix in standard form (with -1's and +1's), and determine which factors and interactions are statistically significant. Do a normal probability plot of the factor effects. Which factor settings would you recommend to maximize popcorn volume?

Exam 2 (take-home). Due Apr 28

  1. Read the paper Effects of Smoking on Aging of Photoprotected Skin by Yolanda R. Helfrich et al., Archives of Dermatology, 2007, pp. 397-402. You should be able to download the pdf file directly from a computer connected to ASU wifi. Write a critique of the statistical methods used in the paper. Include a discussion of the research goals, experimental design, model selection methods, and model diagnostics (e.g. residual analysis and identification of influential points) used by the authors. Discuss the validity of the authors' conclusions. What could they have done to improve the design or analysis of this experiment?
  2. Read the material on Experiment 1 in the paper Disgust as Embodied Moral Judgment by Simone Schnall, Jonathan Haidt, Gerald L. Clore and Alexander H. Jordan, Personality and Social Psychology Bulletin, 2008, pp. 1096-1109. Experiment 1 is discussed on pages 1096-1099 with conclusions on pages 1105-1107. Write a critique of the statistical methods used in this paper, including a discussion of the research design, criteria for inclusion or exclusion in study, the analysis methods, the display of the results, and model diagnostics. Discuss the validity of the authors' conclusions. What could they have done to improve the design or analysis of this experiment?
  3. Both of your critiques should be typed and turned in at the beginning of class on April 28. You should have somewhere between 5 and 10 pages (double spaced) total.

Exam 3 (final exam).

In class on Monday, May 10, 12:10-2:00 pm. Bring your output from the data analysis to the exam. The data analysis is described in "STP 429 Final Exam Instructions" on the restricted page of the website.


Return to STP 429 Home Page
Last Modified on 21 Apr 2010
Copyright © 2010, Sharon L. Lohr