Last week, I demonstrated how the mathematics which go into calculating "correlation scores" is relatively simple. Today, I'd like to look at some of the steps involved to better understand what correlation scores actually mean.
In our previous example, we had four (x,y) data points. We wanted to find how the x and y values correlated. Having done so, we went a step further an examined how each (x,y) data point contributed to that correlation. The result was this table:
x y contribution [1,] 3 2 0.39686270 [2,] 5 4 0.05669467 [3,] 7 6 0.01889822 [4,] 9 10 0.51025204
As we can se, x-y = 1 for three of the four data points. The one data point where that is not true is the fourth data point. For that data point, x-y = -1. That makes it an "outlier," a point notably different from the rest.
Notice how the outlier gets the greatest weight of all the data points (.51 as opposed to .40, .06 and .02). One might wonder, why does this happen? To understand, we can look back at how these contribution scores were calculated:
contribution = (x - mean(x)) * (y - mean(y)) / (3*sd(x)*sd(y))
You can go back to the previous post if you need a refresher on how we got this formula. The important thing to understand is for each data point, we subtract the average value of that variable. For each x, we subtract mean(x) and for each y we subtract mean(y).
That is in the numerator (the top portion of a fraction). As everyone knows, the larger the value in the numerator, the larger the result will be. Similarly, everyone should understand the closer a number is to the average value, the smaller your result will be when you subtract the mean value.
What this tells us is data points which are less like the rest will tend to get more weight in correlation calculations. There are further complications. Outliers will also affect the mean values you calculate. They will also increase the standard deviations the numberator is divided by. Additionally, since the x and y values (less their means) are multiplied together, having only one of the two values in a pair be an outlier will have a different effect than if both points are outliers.
Understanding those details can help us quantify the effect of outliers. We don't need to understand them to ounderstand the concept though. The concept behind outliers getting more weight in correlation calculations is the assumption the data has a normal distribution. To understand what that is, here is an image from Wikipedia:
This image shows three different curves, each with a "normal distribution." Each curve has a single peak, with values tapering off at the same rate in either direction. How steep the peak is or how quickly the values taper off on either side can vary, but normal distributions will all have this same general shape. For correlation calculations, it is assumed the data has such a shape.
The reason outliers get more weight in the calculations is we expect there to be far fewer of them. We don't want to give each of the 100 points near the peak of the curve the same weight as the one or two points we have near the thin "tails" of the curve.
Of course, that assumes the outlier is a valid data point. Often, outliers are not. When an outlier is caused by some sort of data error, it can be bad to give it extra weight. In fact, the larger the data error, the more weight you'd give the outlier. Issues like that mean we cannot give a perfect answer to every question just by looking at the math.
But looking at the math can help us learn a lot about our data sets. An outlier which doesn't fit on a normal distribution curve will cause problems for our calculations, but that's not the only time our data will fail to have a normal distribution. To demonstrate, let's look at these results from a question on a survey:
This is clearly not a normal distribution. 1,145 people took this survey. For the question seen in this example, there were asked how much they agree with a statement, with 1 being "strongly disagree" and 4 being "strongly agree." The idea they were asked to agree or disagree with? NASA faked the moon landing.
Of course people nearly universally disagreed with the idea NASA faked the moon landing. That the results don't have a normal distribution is to be expected. The question is, what effect would this lack of a normal distribution have on any correlation scores? To try to get an idea, let's look at another histogram of responses to the same survey:
The distribution of this data is a bit closer to a normal distribution, but it is still clearly "skewed" toward one side. Why is that? Well, people were asked to agree or disagree with the idea the federal government knew the attack on Pearl Harbor was going to happen but let it happen so the United States would join World War II. That isn't as crazy as the idea of NASA faking the moon landing.
Looking at these two histograms, you can probably guess there would be a positive "correlation score" between responses to these two survey items. Let's check. We start by assigning our variable:
x = surv$CYMoon y = surv$CYPearlHarbor
And check the results with the built in correlation calculation function:
cor(x,y)  0.2252181
Telling us there is a strong correlation between people's agreement/disagreement with the idea NASA faked the moon landing and the 1940s federal government allowed the Japanese fleet to bomb a United States military base in order to justify the United States entering into a war.
But what does that correlation score mean? Does it mean people who believe in one conspiracy theory are inclined to believe the other? That's easy enough to check. Let's make a table showing how people responded to both survey items:
table(x,y) y x 1 2 3 4 1 420 524 112 11 2 2 49 17 0 3 1 1 0 2 4 0 2 2 2
A total of 10 people responded with a 3 (agree) or 4 (strongly agree) with the idea NASA faked the moon landing. Of them, four disagreed with the idea of the federal government intentionally allowing the Pearl Harbor attack to happen. Only six people claimed to agree with both conspiracy theories.
If only six people claimed to agree with both conspiracy theories, why then do we find a positive correlation between the answers to each survey item? To find out, let's calculate the correlation scores as we did before. I don't want to show a table of all 1,145 responses. Instead, I'll shameless crib off some code from Steve McIntyre, proprietor of Climate Audit to make a cleaner table:
N=nrow(surv) Stat= data.frame(CYMoon=c(1,2,3,4,1,2,3,4,1,2,3,4,1,2,3,4),PearlHarbor=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4),count=c( with(surv,table(CYMoon,CYPearlHarbor)) )) m=apply(surv,2,mean);m Stat$dot= (Stat$CYMoon-m)*(Stat$PearlHarbor-m) Stat$normdot= (Stat$CYMoon-m)*(Stat$PearlHarbor-m)/(sd(surv$CYMoon)*sd(surv$CYPearlHarbor))/(N-1) Stat$contribution= Stat$normdot*Stat$count
The math is the same as before, but McIntyre's code is cleaner. If you'd prefer to use the previous formula and generate a table of over 1,000 lines, that'll give the same results: You could also clean up the code I've written to make it more modular. Whatever you do, you'll get the same final results (I'm excluding the fourth and firth columns as they are intermediary steps):
Stat[,c(1:3,6)] CYMoon PearlHarbor count contribution 1 1 1 420 0.098971237 2 2 1 2 -0.005269441 3 3 1 1 -0.005505087 4 4 1 0 0.000000000 5 1 2 524 -0.036637985 6 2 2 49 0.038306391 7 3 2 1 0.001633446 8 4 2 2 0.004970258 9 1 3 112 -0.042054369 10 2 3 17 0.071370195 11 3 3 0 0.000000000 12 4 3 2 0.026691422 13 1 4 11 -0.007491562 14 2 4 0 0.000000000 15 3 4 2 0.031821024 16 4 4 2 0.048412587
Like in our previous example, if we sum the contribution scores:
sum(Stat$contribution)  0.2252181
We get the correct calculation score. Now that we know that, let's check where this correlation score came from. The largest contribution score is in the (1,1) coordinate pairing, where 420 people said they "strongly disagree" with both conspiracy theories. This contributed 0.099 of the 0.225 total correlation. Another 0.032 and 0.048 came from four people who claimed to agree with both conspiracies. Another 0.071 came from 17 people who said they disagreed with one conspiracy but agreed with the other.
This tells us of 1,145 people who took this survey, a correlation between two survey items stems primarily from 4m20 people who strongly disagree with both conspiracies, 17 people who agree with one but disagree with the other, and four people who agree with both conspiracies.
Let's try the same with a different pairing of conspiracies. This time, we'll use the moon landing conspiracy and the idea global warming is a hoax. For the sake of space, I'll skip some of the steps. Here is a frequency table showing how people responded to the two items:
table(surv$CYMoon, surv$CYClimChange) 1 2 3 4 1 892 53 65 57 2 39 20 5 4 3 2 1 0 1 4 2 2 0 2
The histogram showing how many people agreed global warming is a hoax:
The correlation score between the two items:
cor(surv$CYMoon, surv$CYClimChange)  0.1265264
And finally, the table showing how this correlation score arises:
CYMoon ClimChange count contribution 1 1 1 892 0.081511056 2 2 1 39 -0.039846588 3 3 1 2 -0.004269590 4 4 1 2 -0.006495765 5 1 2 53 -0.008748525 6 2 2 20 0.036911683 7 3 2 1 0.003856235 8 4 2 2 0.011733771 9 1 3 65 -0.027398354 10 2 3 5 0.023564379 11 3 3 0 0.000000000 12 4 3 0 0.000000000 13 1 4 57 -0.038643707 14 2 4 4 0.030320669 15 3 4 1 0.015838294 16 4 4 2 0.048192843
892 people said they strongly disagreed with the idea NASA faked the moon alanding and strongly disagreed with the idea global warming is a hoax. That contributed 0.082 to the total correlation score of 0.127. Two people said they strongly agreed with both ideas, contributing 0.048. One person claimed to strongly agree with one idea and agree with the other, contributing 0.016.
Of 1,145 respondents, a total of three people claimed to believe NASA faked the moon landing and global warming is a hoax. 892 people found both ideas laughable. That is what the math indicates. What then, should one say about this math?
A competent researcher would notice this data is so heavily skewed any correlation scores we might calculate are meaningless as the assumption of a normal distribution in the data is violated to an extreme. That's not what was done though. This is the title of a paper published based on this survey:
NASA Faked the Moon Landing—Therefore, (Climate) Science Is a Hoax
That paper got its authors a ton of attention and caused them to be "experts" in their field. That's a shame. In Part Three, I will prove these authors would have gotten the same results they published if nobody taking their survey had claimed to believe NASA faked the moon landing. They would have gotten the same result they published if nobody taking the survey had claimed to believe global warming is a hoax.
These authors published results which didn't require any data that supported them. They did so because they used naive correlation tests on data with a heavily skewed distribution, violating the requirements of those tests. They then chose not to examine their data to find out where their results came from despite it being very easy to do.
And they're not the only ones. Many "scientists" are doing the exact same thing.