Lesson 6
Correlation
What is a Correlation?
Thus far we’ve covered the key descriptive statistics—the mean, median, mode, and standard deviation—and we’ve learned how to test the difference between means. But often we want to know how two things (usually called "variables" because they vary from high to low) are related to each other.
For example, we might want to know whether reading scores are related to math scores, i.e., whether students who have high reading scores also have high math scores, and vice versa. The statistical technique for determining the degree to which two variables are related (i.e., the degree to which they co-vary) is, not surprisingly, called correlation.
There are several different types of correlation, and we’ll talk about them later, but in this lesson we’re going to spend most of the time on the most commonly used type of correlation: the Pearson Product Moment Correlation. This correlation, signified by the symbol r, ranges from –1.00 to +1.00. A correlation of 1.00, whether it’s positive or negative, is a perfect correlation. It means that as scores on one of the two variables increase or decrease, the scores on the other variable increase or decrease by the same magnitude—something you’ll probably never see in the real world. A correlation of 0 means there’s no relationship between the two variables, i.e., when scores on one of the variables go up, scores on the other variable may go up, down, or whatever. You’ll see a lot of those.
Thus, a correlation of .8 or .9 is regarded as a high correlation, i.e., there is a very close relationship between scores on one of the variables with the scores on the other. And correlations of .2 or .3 are regarded as low correlations, i.e., there is some relationship between the two variables, but it’s a weak one. Knowing people’s score on one variable wouldn’t allow you to predict their score on the other variable very well.
Computing the Pearson Product Moment Correlation
Let’s do a correlation to see how the formula works and what it produces. The formula for the Pearson product moment correlation is:

Where:
rxy
is the correlation coefficient between X and Y.n is the size of the sample.
X is the individual’s score on the X variable.
Y is the individual’s score on the Y variable.
XY is the product of each X score times its corresponding Y score.
X2 is the individual X score squared.
Y2
is the individual Y score squared.Let’s see what the correlation is between 30 students’ reading scores and their math scores. The data we need to compute the formula are given in Table 12.
Table 12
Reading and Math Scores and the Associated Data for Computing the Pearson Product Moment Correlation (N=30)
|
Total ( |
X (Reading Scores) |
Y (Math Scores) |
X2 |
Y2 |
XY |
|
191 |
180 |
36481 |
32400 |
34380 |
|
|
103 |
101 |
10609 |
10201 |
10403 |
|
|
187 |
173 |
34969 |
29929 |
32351 |
|
|
108 |
103 |
11664 |
10609 |
11124 |
|
|
180 |
170 |
32400 |
28900 |
30600 |
|
|
118 |
113 |
13924 |
12769 |
13334 |
|
|
178 |
171 |
31684 |
29241 |
30438 |
|
|
127 |
122 |
16129 |
14884 |
15494 |
|
|
176 |
168 |
30976 |
28224 |
29568 |
|
|
134 |
130 |
17956 |
16900 |
17420 |
|
|
165 |
150 |
27225 |
22500 |
24750 |
|
|
147 |
145 |
21609 |
21025 |
21315 |
|
|
160 |
150 |
25600 |
22500 |
24000 |
|
|
157 |
154 |
24649 |
23716 |
24178 |
|
|
155 |
145 |
24025 |
21025 |
22475 |
|
|
168 |
164 |
28224 |
26896 |
27552 |
|
|
150 |
145 |
22500 |
21025 |
21750 |
|
|
172 |
170 |
29584 |
28900 |
29240 |
|
|
145 |
130 |
21025 |
16900 |
18850 |
|
|
185 |
179 |
34225 |
32041 |
33115 |
|
|
140 |
141 |
19600 |
19881 |
19740 |
|
|
195 |
193 |
38025 |
37249 |
37635 |
|
|
135 |
136 |
18225 |
18496 |
18360 |
|
|
100 |
101 |
10000 |
10201 |
10100 |
|
|
130 |
128 |
16900 |
16384 |
16640 |
|
|
125 |
121 |
15625 |
14641 |
15125 |
|
|
105 |
106 |
11025 |
11236 |
11130 |
|
|
120 |
118 |
14400 |
13924 |
14160 |
|
|
115 |
112 |
13225 |
12544 |
12880 |
|
|
110 |
108 |
12100 |
11664 |
11880 |
|
|
4381 |
4227 |
664583 |
616805 |
639987 |
So, we plug the numbers from this table into the formula, and do the math:
or
or
or
![]()
In this case, the correlation between reading and math scores is remarkably high (because I concocted the numbers so it would turn out that way). With real scores, it would be high, but not that high. If you glance over the numbers in Table 12, even before we’ve computed the correlation you can easily see (in this small sample of 30) that high scores in reading tend to go with high scores in math, low reading scores tend to go with low math scores, and so on. But, of course, you wouldn’t be able to see that pattern if you had a sample of 500.
Positive and Negative Correlations
I pointed out above that a correlation can vary from +1.00 to –1.00. The correlation we just computed is a positive correlation. That is, high reading scores go with high math scores, low with low, and so on. However, we could have a negative correlation. This is not something bad; it simply denotes an association in which high scores on one variable go with low scores on the other. For example, if we were computing a correlation between, say, amount of time students watch television and their achievement score, we would find a negative correlation: high TV watching is associated with lower achievement scores, and vice versa. Such a correlation might be something like
–.71.Determining Statistical Significance
OK, so we have a correlation coefficient. What precisely does it mean, and how do we interpret it? It’s not a percent, as many people mistakenly think.
First, we can determine its statistical significance in the same way we did with the t test. We can look it up in a table in the appendices of any statistical text. In the case of our .98 correlation between reading and math scores, if we look that up in the table for correlations, we find that the value needed to reject the null hypothesis at the .01 level of confidence (and declare that the correlation is statistically significant, or unlikely due to chance) for our sample of 30 is .45 (in this case using the one-tailed test because the samples are dependent).
So if we were stating this finding in a research report, we could say that the correlation of reading scores with math scores = .98 p <.01 with df = 28. (Now see how smart you are because you know what all that means.)
Practical vs. Statistical Significance
But we have the same issue we had with the t-test: determining its practical vs. its statistical significance. We don’t have an effect test, as we did with the t-test, but we have something similar. It has an imposing name—the coefficient of determination—but you’ll be ecstatically happy to learn that it’s very simple.
The coefficient of determination is nothing more than
r2. You simply multiply r by itself, and you’ve got it. OK, you’ve got it, what does it mean? The coefficient of determination, r2, tells us how much of the variance in one of the variables is accounted for by the variance in the other variable. Thus, if we have a correlation of .60 between, say, students’ achievement scores and a measure of their socioeconomic status, r2 = 36. That means that 36% of the variance in the students’ achievement scores (not 60, which is the correlation) can be accounted for by variance in their socioeconomic status. But that also means that the remaining variance (64%) in achievement scores cannot be accounted for by socioeconomic status, but is attributable to many other factors, such as study time, intelligence, motivation, quality of instruction, and so on.Other Correlations
All the correlations we’ve talked about so far have been based on what we call interval data, i.e., data where the distance between scores or values is the same. The distance between a 65 and 66 is assumed to be the same as the distance between a 14 and a 15. But many times we want to determine the relationship between two variables when that is not the case. Suppose, for example, we want to compute the correlation between students’ class rank in their junior year with their class rank in their senior year. Ranks are not the same as scores; there may be a much smaller (or bigger) difference between ranks 1 and 2 than between ranks 8 and 10 (like the difference between the first two teams and the last two teams in football or baseball). If the data we have are ranks rather than scores, we can’t use the product moment formula. But there is another correlation formula for use with ranks (it’s called rho).
And suppose we want to determine the relationship between two variables when one is based on what is called nominal or categorical data, and the other is interval data. An example would be correlating gender with achievement scores. Again, the product moment correlation can’t be used, but there is also a special formula for doing a correlation with these disparate types of data. In this case, it’s called the point biserial correlation.
Table 13 displays the several different types of correlation for use with variables based on different levels of measurement. In this course, we’re not going to compute them. But with the knowledge and skills you’ve developed thus far, when you encounter situations where the variables you want to correlate are based on different levels of measurement (interval, ordinal, or nominal), you’ll be able to select the type you need.
Table 13
Alternative Types of Correlation for Different levels of Measurement*
|
Type of Measurement and Examples |
|||
|
|
|
Correlation Being Computed |
Type of Correlation |
|
Interval (reading scores) |
Interval (math scores) |
Correlation between reading and math achievement |
Pearson product moment (r) |
|
Ordinal (class rank in the junior year) |
Ordinal (class rank in the senior year) |
Correlation between class rank in the last two years of high school |
Spearman rank coefficient (rho or p) |
|
Nominal (social class, high, middle, or low |
Ordinal (rank in high school graduating class) |
Correlation between social class and rank in high school |
Rank biserial coefficient ( rbs) |
|
Nominal (family configuration, e.g., intact or single parent) |
Interval (grade point average) |
Correlation between family configuration and grade point average |
Point biserial |
|
Nominal (voting preference— Republican or Democrat) |
Nominal (gender, i.e., male or female) |
Correlation between voting preference and gender |
Phi coefficient |
*This table was adapted from a similar one found in Neil Salkind’s Statistics for People Who (Think They) Hate Statistics, Sage Publications, 2000, p. 101.
Correlation and Cause
Before we conclude this lesson, we need to understand one of the most important facts about correlation, namely, that it does not necessarily indicate cause. It may be that one of the variables does in fact cause the other, but we don’t know that just from the fact that the two are correlated.
Smoking and Lung Cancer
It is now an established fact that smoking causes lung cancer, but that conclusion could not be reached simply because there is a correlation between the two. When the association between smoking and lung cancer first appeared, and many argued that indicated that smoking caused lung cancer, the tobacco companies argued that there were other factors that could explain the relationship, e.g., smoking is higher among blue collar workers who also have greater exposure to other toxic elements, smokers drink more and lead more stressful lives, and so on. And logically they were right. It took other kinds of direct physiological evidence and animal experiments to prove that the association was indeed causal.
We often find strong correlations where clearly a causal relationship makes no sense. For example, we may find a strong correlation between car sales and college attendance. Neither one of these is causing the other; both increase during financially prosperous times.
Wine Consumption and Heart Disease
But it is when two correlated variables seem likely to be causally related to one another that we tend to jump to the unsupportable conclusion that one causes the other. For example, when we hear about a correlation between an increase in stork nests and the birth rate in Germany, we laugh it off as clearly due to some unknown third factor. But when we hear that moderate wine consumption is associated with lower rates of heart disease, we’re ready to immediately conclude (especially if we’re wine lovers) that there is obviously some medically beneficial element in wine. But when these reports first came out, skeptics (they were probably statisticians) pointed out that other things could account for the association between moderate wine consumption and lower rates of heart disease. Moderate wine drinkers are likely to be more educated, non-smokers, get more exercise, and have lower rates of obesity. Again, as it has turned out, other kinds of physiological evidence do support the conclusion that moderate wine consumption is medically beneficial, but we can’t conclude that just on the basis of the correlation.
The Important Lesson About Correlation and Cause
The important lesson here is that the correlation coefficient is a highly useful statistic for determining the relationship between variables, but a correlation does not demonstrate a causal relationship between the variables.
The same holds for differences between means. If, for example, we give a pre-test and a post-test to students who have participated in a new reading program, and we find that the increase in the mean reading score is both statistically and practically significant, that does not entitle us to conclude that the new program caused the increase. Any number of other factors could account for the increase: the students were older, and they had been exposed to many other influences and experiences that could have—and probably did—improve their reading. To determine how much, if any, of the improvement was caused by the new program, we would have to employ a control group (or some other method for determining "the expectation of non-treatment"). This would tell us how much improvement occurred in comparable students who had the same experiences except for the new reading program. For additional information on these and other designs that address this question, see the Ed Leaders Evaluation Web Site at http://edl.nova.edu/secure/EVASupport/index.html.