| Statistics for the Behavioral Sciences, 4/e Michael Thorne,
Mississippi State University -- Mississippi State Martin Giesen,
Mississippi State University -- Mississippi State
Correlation and Regression
Chapter OverviewCorrelation is defined as the degree of relationship between two or more variables. Although there are
many kinds of correlation, the chapter focuses on linear correlation, or the degree to which a straight line
best describes the relationship between two variables.
The degree of linear relationship between two variables may assume an infinite range of values, but it
is customary to speak of three different classes of correlation. Zero correlation is defined as no relationship
between variables. Positive correlation means there is a direct relationship between the variables, such that
as one variable increases, so does the other. An inverse relationship in which low values of one variable are
associated with high values of the other is called negative correlation.
A scatterplot is often used to show the relationship between two variables. Scatterplots are graphs in
which pairs of scores are plotted, with the scores on one variable plotted on the X axis and scores on the
other variable plotted on the Y axis. On the scatterplot, a pattern of points describing a line sloping upward
to the right indicates positive correlation, and points indicating a line sloping downward to the right reveal
negative correlation. Zero correlation is shown by a random pattern of points on the scatterplot. High
correlation between two variables doesn’t necessarily mean that one variable caused the other.
When the data are at least interval scale, the Pearson product-moment correlation coefficient, or
Pearson r, is used to compute the degree of relationship between two variables. The Pearson r may be
defined as the mean of the z-score products for X and Y pairs, where X stands for one variable and Y stands
for the other. One approach to understanding the Pearson correlation is based on a close relative of
variance, the covariance, which is the extent to which two variables vary together. Covariance can be used
to derive a simple formula for the Pearson correlation, and we can think of the Pearson r as a standardized
covariance between X and Y.
The range of r is from –1 to +1. Restricting the range of either the X or the Y variable lowers the value
of r. The coefficient of determination, r2, tells the amount of variability in one variable explained by
variability in the other variable.
After computing the Pearson r, we can test it for significance. First, we assume that our sample was
taken from a population in which there is no relationship between the two variables; this is just another
version of the null hypothesis. Then, we consult Table E, which contains values of r for different degrees of
freedom (N – 2) with probabilities of either .05 or .01. If our computed coefficient, in absolute value, is
equal to or greater than the critical value at the 5% level, we reject the null hypothesis and conclude that
our sample probably came from a population in which there is a relationship between the variables.
From the definition of correlation as the degree of linear relationship between two variables, we can
use the correlation coefficient to compute the equations for the straight lines best describing the relationship
between the variables. The equations (one to predict X and one to predict Y) are called regression
equations, and we can use them to predict a score on one variable if we know a score on the other. The
general form of the equation is Y = bX + a, where b is the slope of the line and a is where the line
intercepts the Y axis. The regression line is also called the least squares line.
The Spearman rank order correlation coefficient, rS, is a computationally simple alternative to r that is
useful when the measurement level of one or both variables is ordinal scale. Like the Pearson r, the
Spearman coefficient can be tested for significance. To test rS for significance, we compare its value with
critical values in Table F for the appropriate sample size; if our computed value is larger in absolute value
than the table value at the 5% level, we reject the null hypothesis and conclude that the two variables are
related.
Other correlation coefficients briefly considered in the chapter are the point biserial correlation (rpbis)
and the phi coefficient (φ). The former is useful when one variable is dichotomous (has only two values)
and the other variable is continuous or interval level, whereas the latter is used when both variables are
dichotomous. All of the inferential statistical methods covered in the text through this chapter can be tied
together under the general linear model, which is a general, relationship-oriented multiple predictor
approach to inference. |
|