Psychology 340 Syllabus
Statistics for the Social Sciences

Illinois State University
J. Cooper Cutting
Fall 2002



Simple linear Regression

  • Basics
  • Computing the least squares regression line
  • Computing the residuals (error around the line)
  • Residual plots
  • The linear model
  • Hypothesis testing with regression
  • Using SPSS to do regression
  • Cautions about regression


    Basics

    As we discussed last time, a correlation tells us about three characteristics about the relationship between X and Y

    The equation for a line

    Let's start by reviewing some old geometry. Consider the follwing graph.

    at X = 0, Y = 1
    at X = 1, Y = 1.5
    at X = 2, Y = 2.0
    at X = 3, Y = 2.5
    at X = 4, Y = 3.0

    So as X goes up by 1, Y goes up by 0.5. This is called the slope (b). This is a constant.

    The intercept (a) is the value of Y when X = 0. This is also a constant.

    We can describe the line in the following linear equation:

    Y = bX + a ---> Y = (.5)X + 1.0

    in other words, using the linear equation, we can determine the value of Y, if we know the values of X, b, & a

    - recall that predicting Y based on X is one of the main things that this chapter is all about


    Scatterplots

    Okay, now let's return to our scatterplots. Let's start with the case of a perfect positive correlation (r = 1.0).

    When we do a regression analysis, what we are doing is trying to find the line (and linear equation) that best fits the data points. For this example it is pretty easy. There is only one possible line that makes sense to fit to this set of data.

    Now let's look at a case when the correlation is not perfect.
    Now it isn't as easy. Clearly no single straight line will fit each data point (that is, you can't draw a single line through all of the data points). In fact it is not too hard to imagine several different possible lines fitting to this data. What we want is the line (and linear equation) the fits the best.

    Getting SPSS to make a scatter plot and put a least squares regression line on our scatterplot


    Least Squares Regression

    What does it mean to be the line that "best fits" the data?


    Calculating the residuals (the error around the line)

    So now we have our regression equation for the data from our example. We can use this equation to predict Y, given values of X. However, we need more information to complete our description of the relationship between the two variables.

    Any complete description of a predictive relationship between X and Y needs to include a measure of the error around the line.

    Consider these two scatterplots. The equation for the line is identical for both, however the relationship between the variables (X&Y) are differrent. The relationship is weaker in the first plot relative to the second. That's because there is more error around the line in the first. This is why we must report not only the equation for the line but also a measure of error.

    (). The standard error of the estimate describes the typical error in using to estimate Y.

    To get this we'll look at each point, and compare the actual value for Y with the predicted value of Y (which is called (pronounced "Y-hat")

    distance =

    SSerror = total squared error =

    We get the values from the line, and the Y values from the actual data points

    We need to do this for all of the values of a and b.

    How do we compute the standard error of the estimate ?


    Residual Plots

    The sum of the residuals should always equal 0 (as should the mean). This is because the least squares regression line splits the data in half, half of the error is above the line and half is below the line.

    However, in addition to summing to zero, we also want there the residuals to be randomly distributed. That is, there should be no pattern to the residuals. If there is a pattern, it may suggest that there is more than a simple linear relationship between the two variables.

    Residual plots are very useful tools to examine the relationship even further. These are basically scatterplots of the residuals () against the Explanatory (X) variable (note: plots of the residuals against other variables can also be enlightening). Consider the three examples below (note: the examples actually plot the residuals that have transformed into z-scores).

    The scatterplot shows a nice linear relationship. The residual plot shows
    that the residuals fall randomly above and below the line. Critically
    there doesn't seem to be a discernable pattern to the residuals.

    The scatterplot also shows a nice linear relationship. However, the residual
    plot shows that the residuals get larger as X increases. This suggests
    that the variability around the line is not constant across values of X.This is refered
    to as a violation of homogeniety of variance.

    The scatterplot shows what may be a linear relationship. The residual
    plot, however, suggests that a non-linear relationship may be more
    appropriate (see how a curved pattern appears in the residual plot).

    Getting residual plots in SPSS

    This is done when SPSS performs the regression analysis. At the bottom of the regression window there is a button labeled "save".
    When you click the save button, this window opens. Click the save residuals box in the upper right corner.


    The General Linear Model

    The General Linear Model brings samples and population issues back into the picture. The least squares regression line that we've discussed up until this point can be considered the sample estimate of the true regression equation for the population.

    The notation will change a little bit. Now we'll refer to the slope of the line as B1 and the intercept as B0. So the equation for the line is:

    As was discussed above, we also need a measure of the error (variability) around the line. This will we signified by the Greek letter epsilon. So the full equation will be:

    This equation describes a statistical model of the data. The portions of the equation sometimes described as consisting of the FIT and the RESIDUAL. The fit consists of the linear equation and the RESIDUAL is the error (the variability that isn't explained by the linear equation).


    Using Regression to Test for a Relationship

    The test of the regression model basically boils down to a test of whether the slope is equal to zero. The steps are similar to previous tests of hypotheses.


    Using SPSS to preform Regression

    If you'd like to follow along the example with the SPSS data file that it is based on, you may download the height.sav datafile.

    We can also use our regression technique to test for a significant relationship between two variables. Remember that when we perform a regression, we calculate a slope (b) for the "best fit" line to describe the data. SPSS provides a test (a t-test) to determine if the slope (b) is significantly different from 0 (indicating that there is a linear relationship between the two variables).

    To review:

    So for this relationship the linear equation (the FIT part of the model) is:

    Y = 1.2X - 12.9




    Some facts and cautions about correlation and regression

  • prediction is not the same as causation
  • Extreme values may radically distort things
  • Regression should not be used to make predictions beyond the range of values of X included in the data set. We discussed this last time when talking about correlations. The reasons are the same.
  • We assume the relationship between X and Y is linear
  • We assume the variance of Y is equal at all values of X (homoscedasticity)
  • As we already mentioned, unlike correlation, in regression the distinction between explanatory and response variables is very important. If you look back at the doing regression by hand part of the lab you'll notice that we are only looking at the deviations from the line for the Y variable (in the Y direction). That is because, we are trying to use X to predict Y, or to explain the variability in Y.

  • There is a close connection betweeen correlation and the slope of the least-squares line. This was also discussed above.

  • The least-squares line always passes through the point (meanX,meanY).

    For practice you may download the bear.sav file to answer the following questions. The datafile includes data about bears. There are 6 variables: age, neck diameter, length (from head to toe), chest diameter, weight, and gender.

    1) Relationships between variables


    Below, I've provided a link to a very nice tool for getting a feel for regression (and residuals). On this page, you can place points on a scatterplot. The page will automatically compute the least squares regression line corresponding to the points (you can have a line put on too). Additionally, you can have it open up another window on which it will display a residual plot. I stongly suggest that you play with this.



    If you have any questions, please feel free to contact me at jccutti@mail.ilstu.edu.