Your textbook:
Correlation is a statistical technique that measures and describes the relationship between two variables.
Consider the follwing example:
Data Set | Scatterplot | |
Person X Y A 1 1 B 1 3 C 3 2 D 4 5 E 6 4 F 7 5 |
Y | |
X |
1) The direction of the relationship
positive correlation (a positive number) means that the two variables tend to move in the same direction. That is, as one gets larger, so does the other.
negative correlation (a negative number) means that the two variables tend to move in opposite directions. That is, as one gets larger, the other gets smaller.
2) The form of the relationship
we will focus on linear correlations (straight lines), but there are also other forms that the relationship can take.
linear (e.g., height and weight) | non-linear (e.g., age and height) |
Why (and When) do we use correlations?
Prediction - if we know that two variables are strongly related, then we may be able to predict the value of one, based on the value of the other.
e.g., if you know that ultrasound measurements of a baby's head are positively correlated with birth weight, then you can make an educated guess of the baby's birth weight by measuring the baby's head from an ultrasound
Validity - if you develop a new test (TEST A) for X, and you want to know whether it is truely measuring X, then you can see if TEST A correlates with things that you already know correlate with X.
e.g., if you discover a new formula for predicting birth weight (imagine some magic formula that includes the height and weight of the mother and father combined), then this formula should also correlate with the ultrasound estimates of birthweight.
Reliability - if you use the same test twice on the same individuals, you can correlate the two sets of scores. If the test is reliable, then it should give similar results both times, giving you a high correlation
Theory Verification - many theories will predict that a relationship exists between different variables. So you can then go out, collect some data, and see if such a relationship exists.
Okay, so how do we quantify the idea of correlation? There are a number of different correlations, we will focus on the most common measure, the Pearson product-moment correlation.
r = degree to which X and Y vary together = covariability of X and Y degree to which X and Y vary separately variability of X and Y separately
remember that a "perfect correlation" is r = 1.0 (or -1.0). This means that the number in the numerator equals the number in the denominator. On the bottom, we have two things, how much does X change and how much does Y change. On the top we have, how much to X and Y change together. If these three parts add up to the same thing, then we have and r = 1.0.
now let's consider how we actually compute r.
need to introduce a new concept: sum of products of deviations (SP)
Consider the following:
X Y X- Y- (devX)(devY) 0 1 -6 -1 6 10 3 +4 +1 4 4 1 -2 -1 2 8 2 +2 0 0 8 3 +2 +1 2 sum 30 10 14 mean 6.0 2.0 |
|
So: SP = 14 |
X Y XY 0 1 0 10 3 30 4 1 4 8 2 16 8 3 24 30 10 74 |
SP = = = 74 - 60 = 14 |
Hopefully, SP reminds you of SS (Sum of Squares). The concepts are very similar. The basic difference is that with SS, we just had one variable (X), however with SP we have two variables (X & Y).
Sum of Squares (SS) | Sum of products (SP) |
SS = | SP = | SS = | SP = |
Okay, now let's compute the pearson correlation (r).
r = degree to which X and Y vary together = covariability of X and Y degree to which X and Y vary separately variability of X and Y separately
r =
in other words, we've got SP on top, which is our measure of covariability of X and Y. On the bottom we've got our measure of variability of X alone and Y alone
so let's return to our example:
X Y X- Y- (devX)(devY) (X- )2 (Y- )2 0 1 -6 -1 6 36 1 10 3 +4 +1 4 16 1 4 1 -2 -1 2 4 1 8 2 +2 0 0 4 0 8 3 +2 +1 2 4 1 sum 30 10 14 64 4 mean 6.0 2.0 So SP = 14; SSX = 64; SSY = 4
So there is a fairly strong positive correlation, as X goes up we can predict that Y will too.
But there are some additional things that we need to consider.
Let's look at each point in a little more depth
4) Correlations describe a relationship between two variables, but DOES NOT explain why the variables are related
e.g.,
a) Suppose that Dr. Steward finds that rates of spilled coffee and severity of plane turbulents are strongly positively correlated.
correlationally speaking, one might argue that spilling coffee causes turbulents
b) Suppose that Dr. Cranium finds a positive correlation between head size and digit span (describe digit span).
correlationally speaking, one might argue that people with bigger heads have bigger digit spans (instead of something like, head size and digit span increase with age)
c) Suppose the Dr. Ruth finds a positive correlation between the number of baby's born and the rate of stork sightings (I believe that such a correlation has been reported)
correlationally speaking, one might interpret this as support for the hypothesis that storks bring babies to home
Suppose that in one study we look for a correlation between age and height, but we only test 0 to 10 yr olds. But in a second study we look for the same relationship but only test 25 to 25 yr olds. In the first case we will probably find a strong positive correlation, but in the later case we may find a near 0 correlation.
Which correlation is correct? Both are, if considered with respect to the range represented in the data. We should conclude that the strong positive correlation exists for a restricted range. That is, from years 0 to 10, there is a strong positive correlation between age and height. (note: a non-linear function is appropriate for this relationship)
r = -0.05 | r = +0.76 |
7) When considering "how good" a relationship is, we really should consider r2, not just r.
r2 is called the coefficient of determination
we'll talk more about this towards the end of this chapter. What it basically measures is how much of the variability in one variable can be determined by the other variable.
In other words, suppose that we find that the correlation (r) between height and weight is 0.76. We can use this information to predict a person's weight, if we know their height. But, notice that the correlation is not perfect, so we know that we may be off by a bit.
But we also know that we'll be close. The r2 for this relationship is (0.76)2 = .578. What we can conclude from this is that 57.8% of the variability in weight can be accounted for from the relationship that it has with height.
notice that if we do have a perfect correlation (r = ± 1.0), then r2 = 1.02 = 1.0. So 100% of the variance in Y can be accounted for by X.
Yes. We can test predictions about whether or not there is a relationship and even about what direction the relationship has. At the population level, a relationship is represented by rho ( r ), and at the sample level by our familar r.
What are the hypotheses?
H1: r not equal to 0
no positive rel. | no negative rel. |
H0: r < 0 | H0: r > 0
|
H1: r > 0 | H1: r < 0 |
Why subtract 2? Because we know two values, X & Y, so we lose two degrees of freedom.
Linear Regression - a brief introduction
Let's start by talking about lines and graphs. Consider the follwing graph.
at X = 0, Y = 1 at X = 1, Y = 1.5 at X = 2, Y = 2.0 at X = 3, Y = 2.5 at X = 4, Y = 3.0 So as X goes up by 1, Y goes up by 0.5. This is called the slope (b). This is a constant. The intercept (a) is the value of Y when X = 0. This is also a constant. We can describe the line in the following linear equation: Y = bX + a ---> Y = (.5)X + 1.0 in other words, using the linear equation, we can determine the value of Y, if we know the values of X, b, & a - recall that predicting Y based on X is one of the main things that this chapter is all about
|
Okay, now let's return to our scatter plots. Let's start with the case of r = 1.0.
When we do a regression analysis, what we are doing is trying to find the line (and linear equation) that best fits the data points. For this example it is pretty easy. There is only one possible line that makes sense to fit to this set of data. |
Now let's look at a case when the correlation is not perfect.
Now it isn't as easy. Clearly no single straight line will fit each data point (that is, you can't draw a single line through all of the data points). In fact it is not too hard to imagine several different possible lines fitting to this data. What we want is the line (and linear equation) the fits the best. |
What does it mean to be the line that best fits the data?
distance = Y -
SSerror = total squared error = formula We get the values from the line, and the Y values from the actual data points We need to do this for all of the values of a and b. |
X Y X- Y- (devX)(devY) (X- )2 (Y- )2 0 1 -6 -1 6 36 1 10 3 +4 +1 4 16 1 4 1 -2 -1 2 4 1 8 2 +2 0 0 4 0 8 3 +2 +1 2 4 1 sum 30 10 14 64 4 mean 6.0 2.0 So SP = 14; SSX = 64; SSY = 4
slope = b = SP/SSX = 14/64 = .22
intercept = a = - b = 2.0 - (.22)(6.0) = .68
So the regression equation is:
= .22(X) + .68 |
So now we have our regression equation for these data. We can use this equation to predict Y, given values of X. However, there are some precautions that we will need to consider when interpreting the regression.
2) Regression should not be used to make predictions beyond the range of values of X included in the data set. We discussed this last time when talking about correlations. The reasons are the same.
SSerror =
Then we'll divide that by our degrees of freedom (which gives us a measure of variance, or mean squared error)
remember that df = n - 2
So in the end we end up with:
X Y (Y - ) (Y - ) 2 0 1 0.68 .32 .102 10 3 2.88 .12 .014 4 1 1.56 -.56 .314 8 2 2.44 -.44 .193 8 3 2.44 .56 .314 sum 30 10 10 0 .937 mean 6.0 2.0 So SP = 14; SSX = 64; SSY = 4; r = 0.875
= .22(X) + .68
Serror = = = = .559
An easier way to compute Serror is to use the correlational information.
SSerror = (1 - r2)SSY = (1 - (+0.875)2)(4) = (1 - .766)(4) = .9375
Serror = = = .559