I've been struggling (a lot) with a series of posts I'm trying to write, and I recently realized the problem is I need to start at the beginning. These posts are supposed to be about how "correlation scores" are being misused and abused within the scientific community. The problem is, what are "correlation scores"? That's where we'll begin today.
At its simplest, a correlation score ought to reflect the relationship exists between two variables. Sometimes these relationships are coincidental. Everyone has heard the phrase, "Correlation is not causation." There are bigger concerns though.
There are many types of correlation tests. The most commonly used one is the Pearson correlation coefficient. When someone says "correlation score" without any further context, this is what they're referring to. In a programming language like R, one can calculate this correlation score with the simple cocmmand:
Which will return a numerical value indicating the correlation score of the x and y variables. What does that mean though? What equation is this command using? Here is what Wikipedia gives for it:
But if you don't know what covariance is, it's of little help. Let's look at things involving simpler terms instead. Let's start with the top portion of the equation, the covariance. Before we can look at what that is, let's create some variables to use in our calculations:
x = c(3,5,7,9) y = c(2,4,6,10)
This gives us four pairs of variables with (x,y) coordinates: (3,2), (5,4); (7,6); (9,10). We can see these two variables are highly related, a result we can check:
cor(x,y)  0.9827076
The covariance can be checked with a simple command:
cov(x,y)  8.666667
But that's not much help since we don't know what it means. Instead, let's calculate the average, or mean, of x and y:
mean(x)  6 mean(y)  5.5
Now, let's subtract those averages from each x and y value:
x - mean(x)  -3 -1 1 3 y - mean(y)  -3.5 -1.5 0.5 4.5
Next we multiply the corresponding x and y values together:
(x - mean(x)) * (y - mean(y))  10.5 1.5 0.5 13.5
We add these values together, giving us the sum:
sum((x - mean(x)) * (y - mean(y)))  26
Now, I'm going to gloss over one thing here. In statistics, we often have to distinguish between a "population" and a "sample." Put crudely, a "population" is when you have all possible data while a "sample" is just that, a sample of the population.
Since we only have a finite number of measurements, we are working with a sample, and thus, we have to take one further step of dividing the number above by n-1, where n is the number of data points. Since we have four data points, we divide by 4-1 = 3:
26/3  8.666667
Which is the covariance score we calculated before:
cov(x,y)  8.666667
To go from covariance to correlation, we simply divide by the standard deviation of the two variables:
cov(x,y) / (sd(x) * sd(y))  0.9827076
Which as you can see matches the correlation score we calculated before:
cor(x,y)  0.9827076
If you want to know how to calculate the standard deviation, that's straightforward as well. Remember when we subtracted the average value of x from each x value? To calculate the standard deviation, square each of those values:
(x - mean(x))^2  9 1 1 9
Then take the sum of them and divide by n-1 (3):
sum((x - mean(x))^2)/3  6.666667
The last step is to take the square root of that result, giving you the standard deviation:
sqrt(sum((x - mean(x))^2)/3)  2.581989 sd(x)  2.581989
As you can see, it is not difficult to calculate the standard deviation or covariance of variables. It isn't difficult to use those values to calculate the correlation coefficient of two variables. You could do all this by hand. All the computer program does for us is save us a lot of time by doing all the arithmetic for us.
Given that, there is no reason to treat correlation scores as a magical black box you put numbers into and get results out of. In fact, we can use the steps above to let us do more than just find a single correlation score.
For instance, what if we wanted to know how much each data point contributed to the correlation score? That's a good thing to check because we don't want results to depend on outliers and/or questionable data points.
Fortunately, this is easy to do. Remember how we had to take the sum of each data value then divide the result by other things? Mathematically speaking, that's no different than dividing each value then taking the sum. (2+4)/2 is the same as (2/2) + (4/2). These results are the same:
sum((x - mean(x)) * (y - mean(y))/3)  8.666667 sum((x - mean(x))/3 * (y - mean(y)/3))  8.666667
It doesn't matter for calculating the covariance if you sum the values then divide by n-1 (3) or divide those values by n-1 (3) then sum them. We can even do the same thing with the division of standard deviations in calculating the correlation score:
sum((x - mean(x)) * (y - mean(y))/(3*sd(x)*sd(y)))  0.9827076
We could simplify the formulation of that line by introducing new terms (like dot product), but the point should be clear enough. We calculate the correlation score by taking the total sum of a set of values that eachc correspond to an (x,y) pair of values.
This means each pair of (x,y) values has a contribution to the correlation score we can find simply by not taking the sum. Instead of adding all the contribution values together, we could look at them individually:
(x - mean(x)) * (y - mean(y))/(3*sd(x)*sd(y))  -0.2032073 -0.1433286 0.2189215 1.1103220
To make things easier to examine, let's assign those values to a variable then display our (x,y) pairings with their corresponding contribution:
contribution = (x - mean(x)) * (y - mean(y)) / (3*sd(x)*sd(y)) cbind(x,y,contribution) x y contribution [1,] 3 2 0.39686270 [2,] 5 4 0.05669467 [3,] 7 6 0.01889822 [4,] 9 10 0.51025204
And there you have it. A correlation score is merely the result you get if you create a table like this for two variables then take the sum of the contribution column. If you want to know how much any particular data point contributes to your results, this is all it takes to find out.
Anyone can do this. There is no complicated math involved. It's just a somewhat lengthy series of simple aritmetic operations. This is important because it means the origin of corrleation scores is an objective fact. There is no uncertainty about how the numbers come about.
People may come to different conclusions about what a given data point's contribution means, but there is simply no room for doubt about what the numbers actually are. This same process can be repeated for any data set where someone wishes to calculate correlation scores.
And it should be done. Correlation scores should not be treated as magical black boxes we can't understand. Doing so will lead to mistakes. In the next post, I'll discuss some examples of where it has.