What is Correlation?

I've been struggling (a lot) with a series of posts I'm trying to write, and I recently realized the problem is I need to start at the beginning. These posts are supposed to be about how "correlation scores" are being misused and abused within the scientific community. The problem is, what are "correlation scores"? That's where we'll begin today.

At its simplest, a correlation score ought to reflect the relationship exists between two variables. Sometimes these relationships are coincidental. Everyone has heard the phrase, "Correlation is not causation." There are bigger concerns though.

There are many types of correlation tests. The most commonly used one is the Pearson correlation coefficient. When someone says "correlation score" without any further context, this is what they're referring to. In a programming language like R, one can calculate this correlation score with the simple cocmmand:

cor(x, y)

Which will return a numerical value indicating the correlation score of the x and y variables. What does that mean though? What equation is this command using? Here is what Wikipedia gives for it:

But if you don't know what covariance is, it's of little help. Let's look at things involving simpler terms instead. Let's start with the top portion of the equation, the covariance. Before we can look at what that is, let's create some variables to use in our calculations:

x = c(3,5,7,9)
y = c(2,4,6,10)

This gives us four pairs of variables with (x,y) coordinates: (3,2), (5,4); (7,6); (9,10). We can see these two variables are highly related, a result we can check:

cor(x,y)
[1] 0.9827076

The covariance can be checked with a simple command:

cov(x,y)
[1] 8.666667

But that's not much help since we don't know what it means. Instead, let's calculate the average, or mean, of x and y:

mean(x)
[1] 6
mean(y)
[1] 5.5

Now, let's subtract those averages from each x and y value:

x - mean(x)
[1] -3 -1  1  3
y - mean(y)
[1] -3.5 -1.5  0.5  4.5

Next we multiply the corresponding x and y values together:

(x - mean(x)) * (y - mean(y))
[1] 10.5  1.5  0.5 13.5

We add these values together, giving us the sum:

sum((x - mean(x)) * (y - mean(y)))
[1] 26

Now, I'm going to gloss over one thing here. In statistics, we often have to distinguish between a "population" and a "sample." Put crudely, a "population" is when you have all possible data while a "sample" is just that, a sample of the population.

Since we only have a finite number of measurements, we are working with a sample, and thus, we have to take one further step of dividing the number above by n-1, where n is the number of data points. Since we have four data points, we divide by 4-1 = 3:

26/3
[1] 8.666667

Which is the covariance score we calculated before:

cov(x,y)
[1] 8.666667

To go from covariance to correlation, we simply divide by the standard deviation of the two variables:

cov(x,y) / (sd(x) * sd(y))
[1] 0.9827076

Which as you can see matches the correlation score we calculated before:

cor(x,y)
[1] 0.9827076

If you want to know how to calculate the standard deviation, that's straightforward as well. Remember when we subtracted the average value of x from each x value? To calculate the standard deviation, square each of those values:

(x - mean(x))^2
[1] 9 1 1 9

Then take the sum of them and divide by n-1 (3):

sum((x - mean(x))^2)/3
[1] 6.666667

The last step is to take the square root of that result, giving you the standard deviation:

sqrt(sum((x - mean(x))^2)/3)
[1] 2.581989
sd(x)
[1] 2.581989

As you can see, it is not difficult to calculate the standard deviation or covariance of variables. It isn't difficult to use those values to calculate the correlation coefficient of two variables. You could do all this by hand. All the computer program does for us is save us a lot of time by doing all the arithmetic for us.

Given that, there is no reason to treat correlation scores as a magical black box you put numbers into and get results out of. In fact, we can use the steps above to let us do more than just find a single correlation score.

For instance, what if we wanted to know how much each data point contributed to the correlation score? That's a good thing to check because we don't want results to depend on outliers and/or questionable data points.

Fortunately, this is easy to do. Remember how we had to take the sum of each data value then divide the result by other things? Mathematically speaking, that's no different than dividing each value then taking the sum. (2+4)/2 is the same as (2/2) + (4/2). These results are the same:

sum((x - mean(x)) * (y - mean(y))/3)
[1] 8.666667
sum((x - mean(x))/3 * (y - mean(y)/3))
[1] 8.666667

It doesn't matter for calculating the covariance if you sum the values then divide by n-1 (3) or divide those values by n-1 (3) then sum them. We can even do the same thing with the division of standard deviations in calculating the correlation score:

sum((x - mean(x)) * (y - mean(y))/(3*sd(x)*sd(y)))
[1] 0.9827076

We could simplify the formulation of that line by introducing new terms (like dot product), but the point should be clear enough. We calculate the correlation score by taking the total sum of a set of values that eachc correspond to an (x,y) pair of values.

This means each pair of (x,y) values has a contribution to the correlation score we can find simply by not taking the sum. Instead of adding all the contribution values together, we could look at them individually:

(x - mean(x)) * (y - mean(y))/(3*sd(x)*sd(y))
[1] -0.2032073 -0.1433286  0.2189215  1.1103220

To make things easier to examine, let's assign those values to a variable then display our (x,y) pairings with their corresponding contribution:

contribution = (x - mean(x)) * (y - mean(y)) / (3*sd(x)*sd(y))
cbind(x,y,contribution)
     x  y contribution
[1,] 3  2   0.39686270
[2,] 5  4   0.05669467
[3,] 7  6   0.01889822
[4,] 9 10   0.51025204

And there you have it. A correlation score is merely the result you get if you create a table like this for two variables then take the sum of the contribution column. If you want to know how much any particular data point contributes to your results, this is all it takes to find out.

Anyone can do this. There is no complicated math involved. It's just a somewhat lengthy series of simple aritmetic operations. This is important because it means the origin of corrleation scores is an objective fact. There is no uncertainty about how the numbers come about.

People may come to different conclusions about what a given data point's contribution means, but there is simply no room for doubt about what the numbers actually are. This same process can be repeated for any data set where someone wishes to calculate correlation scores.

And it should be done. Correlation scores should not be treated as magical black boxes we can't understand. Doing so will lead to mistakes. In the next post, I'll discuss some examples of where it has.

5 comments

  1. This post wasn't scheduled to go live for another couple hours, but oh well. Hopefully I caught all the errors in my earlier round of proofreading. Feel free to speak up if you see any problems/have any questions.

  2. Hi Brandon,
    The contribution array should be all positive #s. The centered x & y arrays are both {neg, neg, pos, pos}, so each product term should be a positive value.

    Found the bug. You have
    contribution = (x - mean(x))/(3*sd(x)*sd(y)) * (y - mean(y)/(3*sd(x)*sd(y)))
    Should be
    contribution = (x - mean(x)) * (y - mean(y)) / (3*sd(x)*sd(y))

  3. Thanks HaroldW. You threw me off when you said all the contribution values should be positive (in this case, that's true but it isn't in general), but it only took me a second to spot my error once you pointed it out. I've updated the post, and it should (hopefully!) be correct now. I should have known I could't write a post with this many math steps without making a boneheaded error.

    I was uncertain with the last couple commands for this post, but I figured it was just fatigue and the fact it's much less clear when typing on a computer than when writing on paper. The reason I thought that was the final result came out correct. For the mathematically inclined, a fun challenge is to figure out why the final result still came out correct. It's both interesting for the math and fun because you get to see how bad my mistake was.

  4. Brandon -
    Yes, I wrote that a little sloppily, glad you were able to figure it out.

    And yes, it's fun to figure out why it "works" and the sum of the contribution array still comes out to the correct value of 0.98+. It reminds me of "Lucky Larry" problems, where a student performs some mathematical process incorrectly, yet ends up with the correct result by coincidence. Google "Lucky Larry arithmetic" or such. One example found here is a student, reducing the fraction 19/95, cancels the 9's to give 1/5, and does the same with 16/64=1/4.

Leave a Reply

Your email address will not be published. Required fields are marked *