Psychology340: Multiple Regression

Psychology 340 Syllabus
Statistics for the Social Sciences

Illinois State University
J. Cooper Cutting
Fall 2002

Multiple Regression

The General Linear Model
Hypothesis testing with Multiple regression
- R²
- The ANOVA table
- The coefficients
- Comparing multiple models
Using SPSS for multiple regression

The General Linear Model

Last time we introduced the General Linerar Model for Bivariate Regression (regression with only two variables). Now we will learn about regression with more than two variables (multiple regression). That is, we'll still be predicting one variable (Y), but now we'll use several explanatory variables (e.g., X₁, X₂, & X₃).

For bivariate regression we stated the model as:

For multiple regression the FIT portion of the model gets more parts. For each additional explanatory variable we add a new Beta (the Beta's are often called parameters).

Note: now B₁ is no longer interpreted simply as the slope of the line. Rather, it is a measure of how much the its associated explanatory variable (X₁) contributes to the model predicting the response variable (Y).

Hypothesis testing with mulitple regression

With multiple regression, it is typical to examine several models to see which set of variables offer the best prediction. Along with each model may be several hypothesis tests. Basically each model will have an R², an ANOVA result which tests the overall Model, and a t-test result for each explanatory variable (and the intercept, although this is still not usually of theoretical interest).

Squared multiple correlation (R²) is still a measure of how much of the variance in the response variable (Y) can be accounted for by the explanatory variables (now, X₁, X₂, ..., X_p). Typically it'll be the first result that you examine when comparing different models. Generally, the higher the R² the better the model. This gets balanced in practice with a parsimony principle which states that the simplier the model (the fewer the explanatory variables) the better. So when comparing models, these two factors may trade-off and the researcher needs to decide how much of a change in the R² is needed to pick a more complex model over a simple one.

The computation of R² is:

In addtion to the descriptive statistic R², there is a statistical test of the overall model. The null hypothesis that the ANOVA is testing is that all of the betas (except for the intercept) are equal to zero.

₀

₁

₂

The alternative is that at least one beta is not equal to 0.

Here are the formulas that go into the different components of the ANOVA. For this class you won't have to do any of these compuations by hand (so the table below is just for those who want to know more).

In addition to the overall ANOVA result, the statistical analysis of each model will include individual t-tests for each of the Betas (there will be one for each explanatory variable in the model). Unlike in bivariate regression (with only one explanatory variable, X) the Beta is no longer simply the slope of a line. Instead, the Beta should be thought of as a weighting of how much its paired explanatory variable contributes to the overall model. That is, it tells us whether the explanatory variable actually does any "explaining."

Using SPSS to perform multiple regression analyses

Multiple regression analyses in SPSS use essentially the same procedures that we used for Bivariate regression, except now we will add more than one independent variable.

To review:

Under the "Analyze" menu select "regression". Click here.
Under the "regression" submenu select "linear". Click here.
Enter your dependent (response) variable and your independent (explanatory) variables into the appropriate fields. Click here.
To follow along the example in SPSS, you may download the height.sav datafile.
Let's start a simple bivariate regression to predict height based on a person's average calcium intake in their first 5 years.
The output looks like this:
Rather than always present all of this output, I'll just report a summary of the results for different models. So for the output above, the summary would be:
Let's compare this model against another bivariate model. Let's predict height with weight. The summary of the results are as follows:
Compare the two models. Both are significant (both have significant ANOVAs). That means that both weight and calcium intake can predict a significant portion of the variability in height. The first model (calcium) accounts for 16.8% of the variance. The second model (weight) accounts for 63.1% of the variance in height. This suggests that weight is a better predictor of height than calcium intake in the first five years.
Now let's look at a multiple regression model that predicts height with both weight and calcium intake. The model summary is:
There are several things to note.
- The first is that there are two t-test results for Beta parameters, one for the weight variable and one for the calcium variable. In this model, both are statistically significant contributors to the model predicting height (although do note that calcium is only significant at the 0.05 level now). This means that both weight and calcium are contributing to the model.
- Overall the model is accounting for 67.0% of the variance in height. This is better than either Model 1 (calcium alone) or Model 2 (weight alone). So Model 3 is the "best" model so far.
- Now that there is more than one explanatory variable, it is no longer the case that the ANOVA F is the square of one of the t-tests of the betas.
Now let's look at a new variable, parent's average height.
The bivariate model has these results:
This model accounts for 64.1% of the variance in height. Relative to our other models, that seems like a lot. So let's add parent's average height to our Model 3 (calcium and weight) and see if we can account for (nearly) the rest of the variance.
Model 5 looks great.
- We're now accouting for 76.8% of the variance in height
- But note that while weight and avg par hght have significant Betas, the Beta for calcium is not significant. Why is this?
  - Some of the variability that calcium was accounting for in Model 3 can be accounted for by our new variable (avg par hght). In fact if you were to compute the correlation between these two variables you'd find that they are strongly correlated (r = 0.52). As a result, with avg par hght included in the model, calcium no longer accounts for a significant amount of the variability in height.
In fact, let's look at a model which has only weight and avg par height (dropping calcium).
This model accounts for just as much variability in height as the previous model (7). Which is considered the "better" model? If selecting between two models which account for about the same amount of variance, the model that contains the fewest explanatory variables (the simpliest) is generally considered the best model.

Advanced topic: There are a number of different methods of entering variables into the regression equation. The default is to enter them as entered into SPSS. To use other methods, you use the menu box labeledMethod. This allows you five different methods of entering variables into the regression equation. * on the down arrow to make them appear.

Enter: This is the forced entry option. SPSS will enter at one time all specified variables regardless of significance levels.
Forward: This method will enter variables one at a time, based on the significance value to enter.
Backward: This enters all independent variables at one time and then removes variables one at a time based on a preset significance value to remove.
Stepwise: This combines both forward and backward procedures. Since inter correlations are complex, the variance due to certain variables will change when new variables are entered into the equation. This is the most frequently used of the regression methods.
Remove: This is the forced removal option. It requires an initial regression analysis usingthe Enter procedure. In the next block (Block 1 of 1) you may specify one or morevariables to remove. SPSS will then remove the specified variables and run the analysis again.

Generally the best way to enter your variables is to enter them into the model in an order that is guided by your theory. That is, your theory should make some claims about what variables should be important and what variables should not be.

To follow along the example in SPSS, you may download the CSDATA.sav datafile.

This is the data set that your textbook uses as the case study in chapter 11 (multiple regression). The suggested questions below are taken (roughly) from the chapter to help facilitate the connection between what you do in class with SPSS and what the book says (note: the book uses a different statistical package, so the output in the book is in a different format than your SPSS output will be).

The CSDATA data set has the following variables:

GPA: university GPA after the first 3 semesters for Computer Science majors (4 point scale, 4 = A)

SATM: Math SAT score

SATV: Verbal SAT score

HSM: Average High School Math grade (10 point scale, 10 = A, 9 = A-, etc.)

HSS: Average High School Science grade (10 point scale, 10 = A, 9 = A-, etc.)

HSE: Average High School English grade (10 point scale, 10 = A, 9 = A-, etc.)

Sex: The sex of the student

A good first step of most analyses is to compute the descriptive statistics of your data.

1) Using SPSS, compute the mean and standard deviations of the continuous variables (GPA, SATM, SATV, HSM, HSS, HSE).

2) Compute the correlations between the continuous variables (you can do this all at once in one big correlation matrix).

Suppose that your theory is that High School grades should be better predictors (explanatory variables) of University GPA than standardized tests (SAT scores).

So you may want to start by comparing two multiple regression models.

Model 1 will use HSM, HSS, and HSE to predict GPA
Model 2 will use SATM and SATV to predict GPA

3) Using SPSS compute the regression analysis for Model 1

b) Which explanatory variables (if any) are significant predictors? Which do not explain any of the variance?

c) How might you develop a new "better" model based on Model 1?

4) Using SPSS compute the regression analysis for Model 2

b) Which explanatory variables (if any) are significant predictors? Which do not explain any of the variance?

c) How might you develop a new "better" model based on Model 2?

5) Compare Model 1 and Model 2. Which does a better job predicting University GPA?

6) Given the results of Model 1 and Model 2 how might you improve your prediction of University GPA (what other Model(s) might you try)?

7) In your opinion, is the Model which includes all of the variables (HSM, HSS, HSE, SATM, SATV) "better" than Model 1? Is it better than a Model that only includes HSM?

Psychology 340 SyllabusStatistics for the Social Sciences