I've written a post titled, "Correlation is Meaningless" once before. It makes the same basic point I made in a recent post discussing the PhD dissertation by one Kirsti Jylhä. I'm going to continue my discussion of Jylha's work today to examine more of a phenomenon where people misuse simple statistics to come up with all sorts of bogus results. In Jylha's case, it undercuts much of the value of her PhD.
To briefly recap the last post on this topic, the first paper Jylha relies on for her thesis has two studies in it. The results of the first study are given by this table in the paper:
As well as a discussion of a follow-up analysis. For now, I want to continue focusing on this first table. It supposedly demonstrates a statistically significant relationship between "climate change denial" and various political/social ideologies. However, when we visualize the data, we see this relationship is an artifact. Here is the data plotted to show the supposed relationship between "climate change denial" and "socialdominance orientation," an ideology which accepts inequality amongst groups of people:
The data has a small jitter value added to it to enable us to see data density. As this graph shows, there is no meaningful relationship between climate change denial and social dominance orientation. People who strongly believe in global warming strongly oppose inequality, but there is no evidence the opposite is true.
The reason the authors can claim there is a "statistically significant" correlation between these two traits is they collected almost no data from anyone who "denies" climate change. The approach the authors have taken is to draw a line through their data, which is how you normally calculate the relationship between two variables, then extrapolate it out far beyond where their data extends.
There are a lot of ways of describing this approach. When I've previously said correlation is meaningless, I used an example in which I demonstrated a "statistically significant" correlation between belief in global warming and support for genocide. It was completely bogus. I was able to do it because I used the same approach the authors used. Namely:
1) Collect data for any group of people.
2) Determine views that group holds.
3) Find a group which is "opposite" the group you study.
4) Assume they must hold the opposite view of the group you studied on every issue.
This will work with literally any subject and any group of people. You can reach basically any conclusion you want because this approach doesn't require you have any data for the group of people you're drawing conclusions about.
You can have some data for them though. This same approach works as long as your data set is heavily skewed toward one group. When there is some data for the target group, it is not quite as easy to dismiss the results out of hand. Just plotting the data on a simple chart is no longer sufficient. We need to use a bit of math.
The correlation score between "climate change denial" and "social dominance orientation" was given as 0.53. With a bit of math, we can determine exactly where that number comes from. First, let's try examining a contingency table for the data:
Social_dominance_orientation Climate_change_denial 1 2 3 4 5 1 20 19 1 0 0 2 22 38 3 0 1 3 3 15 11 0 0 4 0 1 0 1 0
I need to caution readers I have rounded this data. Each of the traits like "climate change denial" is generated via a 10+ item questionnaire. I was unable to get the raw data, instead being provided only the averages for each trait. The result is there are a ton of values like 1.81, 2.06 and so forth.
Everything I'm saying and any patterns i highlight will hold true for the un-rounded data. but this rounding will cause the exact numbers to be different. For instance, the correlation score changes from 0.53 to 0.41. All underlying patterns will remain, but it is too cumbersome to work with 30+ unique values per item for this post.
With that said, this contingency table shows the same thing we saw when we visualized the data. There is practically no data from anyone who denies climate change or favors inequality between social groups. The highest value (5) for climate change denial doesn't even show up in the contingency table because nobody scored that high. That should be enough to cause anyone to doubt the idea there is a relationship between these two variables.
We can go further though. Given this contingency table, we can ask ourselves, "How much does each pairing contribute to the correlation score?" As the first step to answering this question, let's change our contingency table to a list (CCD = climate change denial, SDO = social dominance orientation):
CCD SDO count 1 1 1 20 2 2 1 22 3 3 1 3 4 4 1 0 5 5 1 19 6 1 2 38 7 2 2 15 8 3 2 1 9 4 2 1 10 5 2 3 11 1 3 11 12 2 3 0 13 3 3 0 14 4 3 0 15 5 3 0 16 1 4 1 17 2 4 0 18 3 4 1 19 4 4 0 20 5 4 0
I've excluded the 5 values for the CCD trait since there are none. To find out the relative contribution of each pairing to the total correlation score, we take something called the "dot product." You don't need to worry about the details of it. What matters is the dot product is the standard method for determining the angle between two vectors. You can see more about it here, but in simple terms, We can use it to calculate the weight for each pairing. The below table shows the dot product and the normalized (centered) dot product:
CCD SDO count dot normdot 1 1 1 20 0.772565158 0.0106728842 2 2 1 22 -0.042249657 -0.0005836734 3 3 1 3 -0.857064472 -0.0118402309 4 4 1 0 -1.671879287 -0.0230967885 5 1 2 19 -0.175582990 -0.0024256555 6 2 2 38 0.009602195 0.0001326530 7 3 2 15 0.194787380 0.0026909616 8 4 2 1 0.379972565 0.0052492701 9 1 3 1 -1.123731139 -0.0155241952 10 2 3 3 0.061454047 0.0008489794 11 3 3 11 1.246639232 0.0172221541 12 4 3 0 2.431824417 0.0335953288 13 1 4 0 -2.071879287 -0.0286227350 14 2 4 0 0.113305898 0.0015653058 15 3 4 0 2.298491084 0.0317533466 16 4 4 1 4.483676269 0.0619413874 17 1 5 0 -3.020027435 -0.0417212747 18 2 5 1 0.165157750 0.0022816322 19 3 5 0 3.350342936 0.0462845392 20 4 5 0 6.535528121 0.0902874461
The normalized version is what we're interested in. It tell us the relative contribution to the correlation score each possible pairing has. It tells us each person who is rated a 1 on both the climate change denial and social dominance orientation scale adds 0.010 to the total correlation score. Each person who scores a 3 on both scales, meaning they express no opinion on either topic, adds 0.017 to the total correlation score. A person who scores 4 on both scales, meaning they deny climate change and favor social inequality, adds a whopping 0.062 to the total correlation score.
Of course, the normalized dot product only tells us what each pairing would add to the total correlation score. It doesn't consider how many people actually fit in that pairing. To find that out, we multiply the normdot column by the count column to get the total contribution of each pairing. Let's do that and sort the data by most influential pairings:
CCD SDO count dot normdot contribution 1 1 1 20 0.772565158 0.0106728842 0.213457685 11 3 3 11 1.246639232 0.0172221541 0.189443695 16 4 4 1 4.483676269 0.0619413874 0.061941387 7 3 2 15 0.194787380 0.0026909616 0.040364424 8 4 2 1 0.379972565 0.0052492701 0.005249270 6 2 2 38 0.009602195 0.0001326530 0.005040815 10 2 3 3 0.061454047 0.0008489794 0.002546938 18 2 5 1 0.165157750 0.0022816322 0.002281632 4 4 1 0 -1.671879287 -0.0230967885 0.000000000 12 4 3 0 2.431824417 0.0335953288 0.000000000 13 1 4 0 -2.071879287 -0.0286227350 0.000000000 14 2 4 0 0.113305898 0.0015653058 0.000000000 15 3 4 0 2.298491084 0.0317533466 0.000000000 17 1 5 0 -3.020027435 -0.0417212747 0.000000000 19 3 5 0 3.350342936 0.0462845392 0.000000000 20 4 5 0 6.535528121 0.0902874461 0.000000000 2 2 1 22 -0.042249657 -0.0005836734 -0.012840814 9 1 3 1 -1.123731139 -0.0155241952 -0.015524195 3 3 1 3 -0.857064472 -0.0118402309 -0.035520693 5 1 2 19 -0.175582990 -0.0024256555 -0.046087455
Sum up the values in the final column, and you'll get 0.41, the total correlation score (remember, this is different from the 0.53 reported by the authors due to me rounding the data). This means we can now determine, mathematically, the source of the reported correlation with complete certainty. We don't have to rely on visualization or perception. This is cold, hard math.
A few months ago I contacted Kirsti Jylhä about this paper, and during the following discussion, I reported these same results to her. I then said:
With that established, we can see in this contingency table most of the .41 correlation we calculate (after rounding the responses for simplicity) comes from two types of response combinations. .21 of the .41 comes from people answering 1 on both the SDO and Denial scales. That is, half of the correlation comes from respondents who fully believe in global warming and completely reject the idea of social dominance. Another .19 of the .41 score comes from people who gave a neutral (3) response to both items.
If we total up the contribution to the calculated correlation score from people who answered on the "denial" side (greater than 3), we see it amounts to a much smaller .07 (the positive values total more than the calculated correlation score because there are some negative contributions as well). That contribution comes from only two respondents, one of whom actually gave a response of 2 on the SDO scale, meaning they reject the idea of social dominance.
She has told me several times she is too busy to look at this. I had hoped she would find the time before I discussed this publicly, but after several follow-ups over a few months, I decided I had waited long enough. As I went on to tell her:
This same process can be repeated for each correlation score you reported in this paper. I've also tested it on the correlation scores you reported for the second paper discussed in your thesis. I haven't had time to do the same calculations for the third paper, but I expect the pattern may hold for it as well. As an example, here is a sorted contingency table for the "denial" and RWA items for Study 1 of the 2014 paper:D RWA count normdot normsum 9 1 3 7 -0.0139782293 -0.097847605 7 3 2 18 -0.0025399557 -0.045719203 3 3 1 1 -0.0205870096 -0.020587010 6 2 2 47 -0.0001252091 -0.005884827 2 2 1 3 -0.0010148526 -0.003044558 4 4 1 0 -0.0401591666 0.000000000 8 4 2 0 -0.0049547024 0.000000000 10 2 3 14 0.0007644344 0.010702082 5 1 2 23 0.0022895376 0.052659364 12 4 3 2 0.0302497619 0.060499524 11 3 3 10 0.0155070981 0.155070981 1 1 1 10 0.0185573044 0.185573044
Again, we see the calculated correlation score comes primarily from responses to both items being either 1 (fully believe in climate change, fully reject social dominance) or 3 (neutral on both issues). The only contribution from anyone on the "denial" side of things is that of two individuals, who contribute only .06 while people not on the "denial" side of things contribute ~.40.
Put simply, your data does not show a correlation between denying global warming and RWA or SDO. What it shows is a correlation between not denying global warming and rejecting RWA or SDO. The reason this happens is what I mentioned near the start of this e-mail - your data is highly skewed. Because of how few respondents were on the "denial" side of things, your correlation scores are weighted heavily toward people who believe do not deny or reject global warming.
This is a fairly well-known mathematical phenomenon. It stems from the fact simple correlation tests, like the bivariate Pearson Correlation test, assumes bivariate normality in the data. A skewed data set like that you used fails to fulfill this assumption. As such, the results of the test will be errant. The degree to which the results will be errant depend in part on how skewed the dataset is.
Your datasets have skewed enough populations to produce seemingly "significant" correlation scores even though there is no demonstrable linear relationship between the variables in question. Had an appropriate test been used, no significant results would have been found because there was simply not enough data from people on the "denial" side of things.
To demonstrate the full breadth of this problem, I should stress if we removed every response which indicated any amount of "denial" from your data set, we would still find statistically significant correlations between climate denial and things like SDO or RWA. In fact, you could remove everyone who gave a high response to any item, and you would still get "statistically significant" correlations.
Boiled down to its simplest form, this flaw in the methodology would allow us to draw conclusions about groups of people via "statistically significant" results even when we had absolutely no data for those groups of people. To demonstrate, imagine you had asked these same people these two questions on one of your surveys:
Do you like ice cream?
Do you believe global warming is real?
If you have respondents answer on a 5-point LIkert scale, you will find a "statistically significant" correlation between liking ice cream and believing global warming is real. From this, you could conclude people who reject global warming don't like ice cream. That would be wrong. It would only happen because you didn't ask many people who would reject global warming.
This is not particularly complicated. The most complex aspect of what I have discussed so far is understanding the relationship between the dot product of two vectors and the correlation coefficient for them.* You don't even need to understand that though. Even if you don't know why it works, it's still easy to see that it does. And even if you don't trust that, simply plotting the data demonstrates the same point.
It is regrettable Jylhä couldn't find time to examine this issue during the last few months. I imagine having this mistake pointed out publicly could be somewhat humiliating. Still, this fundamental error of extrapolating data for one group of people to draw conclusions about a different group of people is an important one many scientists in her field are making. It is also one which shows up time and time again in her dissertation. It cannot be ignored.
Future posts will discuss how this error influences other aspects of Jylhä's dissertation, often masked by the use of more complicated statistical tests which were applied inappropriately because researchers didn't bother to check to see if the data they used could be examined appropriately with the tests they used.
*This was something I hadn't realized until a few years ago when Steve McIntyre, proprietor of the Climate Audit blog, demonstrated the approach with a different data set. It seems obvious in retrospect, but I had never thought about it before. I should also point out I've cannibalized the code he used for my own use. Why reinvent the wheel?