Consider the follwing example:
Suppose that you want to know if there really is a relationship between amount of time studying and test performance. So you get 6 of your fellow students to volunteer to report to you how much time they spent studying (in hours) for the exam, and what their score on the exam is (on a scale of 0 to 6, with 6 being the maximum score on the exam). The data are presented below in two ways. The table shows each person's exam score and the number of hours that they studied. The graph (a scatterplot) shows the same information in a different way. Each point corresponds to a person, the location of the point is determined by their values on the two variables (study time [X-axis] & exam score [Y-axis]).
| Data Set | Scatterplot | |
Person Hrs(X) Exam score(Y) A 1 1 B 1 3 C 3 2 D 4 5 E 6 4 F 7 5 |
Y | |
| X |
Correlation is a statistical technique that measures and describes the relationship between two variables. (Notice that this means that there must be at least two scores from each individual, one for each of the two variables.)
1) The direction of the relationship
positive correlation (a positive number) means that the two variables tend to move in the same direction. That is, as one gets larger, so does the other.
negative correlation (a negative number) means that the two variables tend to move in opposite directions. That is, as one gets larger, the other gets smaller.
2) The form of the relationship
we will focus on linear correlations (straight lines), but there are also other forms that the relationship can take.
| linear (e.g., height and weight) | non-linear (e.g., age and height) |
|
|
Why (and When) do we use correlations?
Prediction - if we know that two variables are strongly related, then we may be able to predict the value of one, based on the value of the other.
e.g., if you know that ultrasound measurements of a baby's head are positively correlated with birth weight, then you can make an educated guess of the baby's birth weight by measuring the baby's head from an ultrasound
Validity - if you develop a new test (TEST A) for X, and you want to know whether it is truely measuring X, then you can see if TEST A correlates with things that you already know correlate with X.
e.g., if you discover a new formula for predicting birth weight (imagine some magic formula that includes the height and weight of the mother and father combined), then this formula should also correlate with the ultrasound estimates of birthweight.
Reliability - if you use the same test twice on the same individuals, you can correlate the two sets of scores. If the test is reliable, then it should give similar results both times, giving you a high correlation
Theory Verification - many theories will predict that a relationship exists between different variables. So you can then go out, collect some data, and see if such a relationship exists.
Okay, so how do we quantify the idea of correlation? There are a number of different correlations, we will focus on the most common measure, the Pearson product-moment correlation.
r = degree to which X and Y vary together = covariability of X and Y degree to which X and Y vary separately variability of X and Y separately
remember that a "perfect correlation" is r = 1.0 (or -1.0). This means that the number in the numerator equals the number in the denominator. On the bottom, we have two things, how much does X change and how much does Y change. On the top we have, how much to X and Y change together. If these three parts add up to the same thing, then we have and r = 1.0.
Now let's consider how we actually compute r.
r =
| note: your book uses what looks like a very different
formula. However, if you compare that formula with the one I'll use, you'll see that they really are the same thing [the book's formula removes 1/(n-1) from the summation, which turns our SSX & SSY become standard deviations (sX & sY)] |
need to introduce a new concept: sum of products of deviations (SP)
Consider the following:
X Y X- |
|
|
So: SP = 14 |
|
Note: we can also compute SP with a computational formula:
|
Hopefully, SP reminds you of SS (Sum of Squares). The concepts are very similar. The basic difference is that with SS, we just had one variable (X), however with SP we have two variables (X & Y).
| Sum of Squares (SS) | Sum of products (SP) |
SS = ![]() |
SP = ![]() |
SS = ![]() |
SP = ![]() |
Okay, now let's compute the pearson correlation (r).
r = degree to which X and Y vary together = covariability of X and Y degree to which X and Y vary separately variability of X and Y separately
r =
in other words, we've got SP on top, which is our measure of covariability of X and Y. On the bottom we've got our measure of variability of X alone and Y alone
so let's return to our example:
X Y X-Y-
(devX)(devY) (X-
)2 (Y-
)2 0 1 -6 -1 6 36 1 10 3 +4 +1 4 16 1 4 1 -2 -1 2 4 1 8 2 +2 0 0 4 0 8 3 +2 +1 2 4 1
sum 30 10 14 64 4 mean 6.0 2.0
So SP = 14; SSX = 64; SSY = 4
=
=
14/16 = +0.875
Now that we know how to compute a correlation, we need to consider how we interpret it. We already know the basics:
But there are some additional things that we need to consider.
Let's look at each point in a little more depth
4) Correlations describe a relationship between two variables, but DOES NOT explain why the variables are related
e.g.,
a) Suppose that Dr. Steward finds that rates of spilled coffee and severity of plane turbulents are strongly positively correlated.
b) Suppose that Dr. Cranium finds a positive correlation between head size and digit span (roughly the number of digits you can remember).
c) Suppose the Dr. Ruth finds a positive correlation between the number of baby's born and the rate of stork sightings (I believe that such a correlation has been reported)
Suppose that in one study we look for a correlation between age and height, but we only test 0 to 10 yr olds. But in a second study we look for the same relationship but only test 25 to 25 yr olds. In the first case we will probably find a strong positive correlation, but in the later case we may find a near 0 correlation.
Which correlation is correct? Both are, if considered with respect to the range represented in the data. We should conclude that the strong positive correlation exists for a restricted range. That is, from years 0 to 10, there is a strong positive correlation between age and height. (note: a non-linear function is appropriate for this relationship)
|
|
| r = -0.05 | r = +0.76 |
7) When considering "how good" a relationship is, we really should consider r2, not just r.
r2 is called the coefficient of determination
we'll talk more about this towards the end of this chapter. What it basically measures is how much of the variability in one variable can be determined by the other variable.
In other words, suppose that we find that the correlation (r) between height and weight is 0.76. We can use this information to predict a person's weight, if we know their height. But, notice that the correlation is not perfect, so we know that we may be off by a bit.
But we also know that we'll be close. The r2 for this relationship is (0.76)2 = .578. What we can conclude from this is that 57.8% of the variability in weight can be accounted for from the relationship that it has with height.
notice that if we do have a perfect correlation (r = ± 1.0), then r2 = 1.02 = 1.0. So 100% of the variance in Y can be accounted for by X.
Okay, what about hypothesis testing? Can we test hypotheses with correlations?
Yes. We can test predictions about whether or not there is a relationship and even about what direction the relationship has. At the population level, a relationship is represented by rho ( r ), and at the sample level by our familar r.
So when is a correlation the appropriate analysis? Check the decision tree.
Again the logic is the same as before.
This means that we can state a null and alternative hypothesis for the population correlation r based on our predictions for a correlation. Let's look at how this works in an example.
| Suppose that we wanted to know if students who near campus have higher GPAs than students who live farther away and commute to campus. We could measure students' GPAs and also measure who far away they live by measuring the distance to their residence from the middle of the quad. These are the two measured variables we're interested in. |
Now let's go through our hypothesis testing steps:
Step 1: State hypotheses and choose our a level
H1: r not equal to 0
| no positive rel. | no negative rel. |
| H0: r< 0 | H0: r> 0 |
| H1: r > 0 | H1: r < 0 |
Remember we're going to state hypotheses in terms of our population correlation r. In this example, we expect GPA to decrease as distance from campus increases. This means that we are making a directional hypothesis and using a one-tailed test. It also means we expect to find a negative value of r, because that would indicate a negative relationship between GPA and distance from campus. So here are our hypotheses:
Ha: r < 0
H0: r > 0
We're making our predictions as a comparison with 0, because 0 would indicate no relationship. Note that if we were conducting a two-tailed test, our hypotheses would be r = 0 for the null hypothesis and r not equal to 0 for the alternative.
We'll use our conventional a = .05.
Step 2: Collect the sample
Here are our sample data:
| Subject | GPA | Distance from campus (in miles) |
| A | 3.45 | 1.3 |
| B | 3.03 | .8 |
| C | 2.67 | 5.7 |
| D | 2.50 | .5 |
| E | 3.16 | 2.9 |
| MeanGPA = 2.96 | Meandistance = 2.24 |
Step 3: Calculate test statistic
For this example, we're going to caluclate a Pearson r statistic. Recall the formula for Person r:
r = 
The bottom of the formula requires us to calculate the sum of squares (SS) for each measure individually and the top of the formula requires calculation of the sum of products of the two variables (SP).
We'll start with the SS terms. Remember the formula for SS is:
We'll calculate this for both GPA and Distance. For our example, we get:
SSGPA = .58 and SSdistance = 18.39
Now we need to calculate the SP term. Remember the formula for SP is
SP = S(X -
)(Y -
)
SP = -.63
Plugging these SS and SP values into our r equation gives us
r = -.19
Now we need to find our critical value of r using a table like we did for our z and t-tests. We'll need to know our degrees of freedom, because like t, the r distribution changes depending on the sample size. For r,
df = n - 2
Why subtract 2? Because we know two values, X & Y, so we lose two degrees of freedom. |
So for our example, we have df = 5 - 2 = 3. Now, with df = 3, a = .05, and a one-tailed test, we can find rcritical in the table of Pearson r values.
Our rcrit = .805. We'd write rcrit(3) = -.805 (negative because we are doing a one-tailed test looking for a negative relationship).
Step 4: Compare observed test statistic to critical test statistic and make a decision about H0
Our robs(3) = -.19 and rcrit(3) = -.805
Since -.19 is not in the critical region that begins at -.805, we cannot reject the null. We must retain the null hypothesis and conclude that we have no evidence of a relationship between GPA and distance from campus.
We can also use SPSS to a hypothesis test with Pearson r. We could calculate the Pearson r with SPSS and then look at the output to make our decision about H0. The output will give us a p value for our Pearson r (listed under Sig in the Output). We can compare this p value with a to determine if the p value is in the critical region.
|
Under the Analyze menu you will find the Correlate submenu. From the Correlate submenu you want to select "bivariate"
|
![]() |
| In the bivariate correlation window, select the variables that you want correlated (you can have more than two at a time). For today's lab, make sure that Pearson is selected (the others are other kinds of correlations). |
![]() |
The output that you get is a correlation matrix. It correlates each variable against each variable (including itself). You should notice that the table has redundant information on it (e.g., you'll find an r for height correlated with weight, and and r for weight correlated with height. These two statements are identical.)
| In SPSS you'll also get some additional information in the correlation matrix. This is te information we are now interested. Look where it says "Sig. 2-tailed". This is where we'll find the p value we're looking for to compare with a. In this case, the given p is .000 (meaning p < .001). If this value is lower than a (which is should be), we can reject the H0. N is simply the number of paired scores that were in the comparison. |
![]() |
So in the correlation matrix above, height and weight have an r = .794. This is a fairly strong positive correlation.
(1) A high school counselor would like to know if there is a relationship between mathematical skill and verbal skill. A sample of n = 25 students is selected, and the counselor records achievement test scores in mathematics and English for each student. The Pearson correlation for this sample is r = +0.50. Do these data provide sufficient evidence for a real relationship in the population? Test at the .05 a level, two tails.
(2) It is well known that similarity in attitudes, beliefs, and interests plays an important role in interpersonal attraction. Thus, correlations for attitudes between married couples should be strong and positive. Suppose a researcher developed a questionnaire that measures how liberal or conservative one's attitudes are. Low scores indicate that the person has liberal attitudes, while high scores indicate conservatism. Here are the data from the study:
Couple A: Husband - 14, Wife - 11
Couple B: Husband - 7, Wife - 6
Couple C: Husband - 15, Wife - 18
Couple D: Husband - 7, Wife - 4
Couple E: Husband - 3, Wife - 1
Couple F: Husband - 9, Wife - 10
Couple G: Husband - 9, Wife - 5
Couple H: Husband - 3, Wife - 3
Test the researcher's hypothesis with a set at .05.
(3) A researcher believes that a person's belief in supernatural events (e.g., ghosts, ESP, etc) is related to their education level. For a sample of n = 30 people, he gives them a questionnaire that measures their belief in supernatural events (where a high score means they believe in more of these events) and asks them how many years of schooling they've had. He finds that SSbeliefs = 10, SSschooling = 10, and SP = -8. With a = .01, test the researcher's hypothesis.
(4) To measure the relationship between anxiety and test performance, a researcher asked his students to come to the lab 15 minutes before they were to take an exam in his class. The researcher measured the students' heart rates and then matched these scores with their exam performance after they had taken the exam. Use the data below and SPSS to conduct a hypothesis test for the correlation between anxiety and test performance in the population. Use a = .05.
Student Heart rate Exam score A 76 78 B 81 68 C 60 88 D 65 80 E 80 90 F 66 68 G 82 60 H 71 95 I 66 84 J 75 75 K 80 62 L 76 51 M 77 63 N 79 71 _______________________________________________