Please Help Me Find A Better Explanation

I do not like making accusations of dishonesty. I have done so plenty of times, but each time I did, I first put significant effort into trying to find an alternative explanation. Today's post is for that. I have encountered data with properties I cannot explain. I am hoping someone can find an explanation for me that isn't, "Someone fabricated data."

For those who don't know, I have been examining a PhD dissertation by one Kirsti Jylha due to its misuse of correlation calculations. I came upon this dissertation because it claims to demonstrate climate change deniers possess certain characteristics, using a methodology I have criticized others for using. I suggest you read a recent post of mine for some background information on why "statistically significant correlations" like those published by Jylha mean nothing. For a hugely simplified explanation:

There are a lot of ways of describing this approach. When I've previously said correlation is meaningless, I used an example in which I demonstrated a "statistically significant" correlation between belief in global warming and support for genocide. It was completely bogus. I was able to do it because I used the same approach the authors used. Namely:

1) Collect data for any group of people.
2) Determine views that group holds.
3) Find a group which is "opposite" the group you study.
4) Assume they must hold the opposite view of the group you studied on every issue.

This will work with literally any subject and any group of people. You can reach basically any conclusion you want because this approach doesn't require you have any data for the group of people you're drawing conclusions about.

I'm not going to rehash any of that today. I'm not going to harp on other points I've made before, such as the fact Jylha and others performed a study to examine how they could change people's views on various issues, including global warming, while not bothering to ask people their views on global warming prior to conducting their experiment. These issues are worth consideration, but I don't want to get bogged down.

Instead, I'd like to ask a simple question. Before we get to it, let's consider what the second paper discussed in Jylha's dissertation has to say about the demographics of its survey respondents:

The sample consisted of 221 participants (aged between 18 and 72 years, M = 28.45, SD = 10.78, 66% women) who were recruited by announces on a webpage, notice boards and face-to-face.

A previous paper discussed in the thesis had 60% females in one study and 68% females in another. I hadn't thought too much about it at the time, but this second paper seeks to find correlations involving variables like "empathy." It is well known women will tend to respond to surveys as being more empathetic than men.

I don't know if research has determined whether this is because of actual differences in how empathetic the two genders are. It might be due to things like social biases where men feel answering questions in certain ways would be "unmanly." It doesn't really matter. What matters is the authors published this table of correlations:

Without saying anything about what effect their skewed gender sample might have on their results. If women are more likely to show "empathy" in a survey and there are twice as many women as men, the dataset will be skewed to show more "empathy" than if the genders were evenly distributed. If there were similar gender biases in other variables, that could easily skew the results of a correlation table like this.

To test this possibility, I started by replicating the table. I loaded the data into R and created a correlation table of my own:

                    ID cc_denial   SDO Nature_dominance System_just pol_orient Domineering Empathy Openness Anxiety_avoid   Sex
ID                1.00     -0.04 -0.07             0.13        0.01      -0.11       -0.08    0.09     0.00          0.04  0.01
cc_denial        -0.04      1.00  0.37             0.30        0.20       0.24        0.11   -0.16    -0.21          0.28  0.11
SDO              -0.07      0.37  1.00             0.22        0.26       0.25        0.42   -0.34    -0.29          0.18  0.25
Nature_dominance  0.13      0.30  0.22             1.00        0.25      -0.02        0.13   -0.17    -0.11          0.14  0.25
System_just       0.01      0.20  0.26             0.25        1.00       0.16        0.11   -0.03    -0.15          0.20  0.08
pol_orient       -0.11      0.24  0.25            -0.02        0.16       1.00        0.08   -0.05    -0.20          0.18  0.06
Domineering      -0.08      0.11  0.42             0.13        0.11       0.08        1.00   -0.24    -0.12          0.05  0.08
Empathy           0.09     -0.16 -0.34            -0.17       -0.03      -0.05       -0.24    1.00     0.18         -0.11 -0.29
Openness          0.00     -0.21 -0.29            -0.11       -0.15      -0.20       -0.12    0.18     1.00         -0.23 -0.06
Anxiety_avoid     0.04      0.28  0.18             0.14        0.20       0.18        0.05   -0.11    -0.23          1.00  0.10
Sex               0.01      0.11  0.25             0.25        0.08       0.06        0.08   -0.29    -0.06          0.10  1.00

It was good to see the results all match up (for the variables the authors reported) with the authors' as in a previous study, I found the numbers published in one of the authors' tables were wrong. I don't know how the authors got those numbers, but it would be impossible to get those results with the tests they claimed to have used.

With the authors' results replicated, I felt comfortable moving on to examine the effect gender has on them. My table shows a "statistically significant" positive correlation between gender and "Nature dominance" as well as gender and "Social dominance orientation." They also show a similar negative correlation between gender and Empathy measurements. Finally, there is a positive correlation that meets the 90% level the authors also use between gender and climate change denial.

Given gender appears to show correlation with other variables, an obvious next step seemed to be to see what this correlation table would look like if one gender was excluded. Here are the results for men:

                    ID cc_denial   SDO Nature_dominance System_just pol_orient Domineering Empathy Openness Anxiety_avoid Sex
ID                1.00      0.16  0.12             0.23        0.14      -0.19        0.11   -0.07     0.04         -0.01  NA
cc_denial         0.16      1.00  0.39             0.25        0.35       0.35        0.34   -0.15    -0.36          0.49  NA
SDO               0.12      0.39  1.00             0.10        0.30       0.33        0.43   -0.24    -0.43          0.15  NA
Nature_dominance  0.23      0.25  0.10             1.00        0.19      -0.09        0.28   -0.19     0.11          0.05  NA
System_just       0.14      0.35  0.30             0.19        1.00       0.12        0.36   -0.20    -0.28          0.18  NA
pol_orient       -0.19      0.35  0.33            -0.09        0.12       1.00        0.10   -0.13    -0.30          0.28  NA
Domineering       0.11      0.34  0.43             0.28        0.36       0.10        1.00   -0.19    -0.13          0.22  NA
Empathy          -0.07     -0.15 -0.24            -0.19       -0.20      -0.13       -0.19    1.00     0.12         -0.11  NA
Openness          0.04     -0.36 -0.43             0.11       -0.28      -0.30       -0.13    0.12     1.00         -0.27  NA
Anxiety_avoid    -0.01      0.49  0.15             0.05        0.18       0.28        0.22   -0.11    -0.27          1.00  NA
Sex                 NA        NA    NA               NA          NA         NA          NA      NA       NA            NA   1

Here are the results for women:

                    ID cc_denial   SDO Nature_dominance System_just pol_orient Domineering Empathy Openness Anxiety_avoid Sex
ID                1.00     -0.17 -0.18             0.07       -0.05      -0.05       -0.18    0.19    -0.02          0.07  NA
cc_denial        -0.17      1.00  0.34             0.31        0.12       0.15       -0.01   -0.13    -0.11          0.14  NA
SDO              -0.18      0.34  1.00             0.21        0.23       0.19        0.41   -0.32    -0.20          0.16  NA
Nature_dominance  0.07      0.31  0.21             1.00        0.27       0.01        0.03   -0.05    -0.23          0.16  NA
System_just      -0.05      0.12  0.23             0.27        1.00       0.17        0.01    0.07    -0.08          0.20  NA
pol_orient       -0.05      0.15  0.19             0.01        0.17       1.00        0.06    0.02    -0.13          0.10  NA
Domineering      -0.18     -0.01  0.41             0.03        0.01       0.06        1.00   -0.26    -0.12         -0.05  NA
Empathy           0.19     -0.13 -0.32            -0.05        0.07       0.02       -0.26    1.00     0.21         -0.06  NA
Openness         -0.02     -0.11 -0.20            -0.23       -0.08      -0.13       -0.12    0.21     1.00         -0.20  NA
Anxiety_avoid     0.07      0.14  0.16             0.16        0.20       0.10       -0.05   -0.06    -0.20          1.00  NA
Sex                 NA        NA    NA               NA          NA         NA          NA      NA       NA            NA   1

There are a number of interesting, and arguably important, differences to be found when comparing these two tables. My plan for today was to discuss exactly that. Only, I noticed something troubling.

See the column and row labeled "ID"? That's just a number from 1 to 221 included in the data set so each survey responded has its own ID number. After making a couple of these tables, I realized I should have filtered those values out as there obviously shouldn't be any correlation between the order people take a survey and what their results are.

Emphasis on the word "shouldn't." You see, while there should be no correlation between respondent ID and their responses, these tables show there are. In fact, there is a "statistically significant" correlation between responded ID and responses to a number of questions.

For men, there were only 75 respondents so the statistical power of any test would necessarily be limited. Perhaps because of this, only two variables show a "statistically significant correlation" with respondent ID as a 90% level, and none do so at a 95% level (though one comes in at 94.9%).

There is far more data for women (146 respondents). Perhaps because of this, there are "statically significant correlations" between female respondent ID and "Climate Change Denial, "Social Dominance Orientation, "Domineering" and "Empathy" traits.

That's two out of nine pairings for men which reach the 90% level and four out of nine pairings for women which reach the 95% level. Moreover, these correlation scores:

ID                1.00  1.00
cc_denial         0.16 -0.17
SDO               0.12 -0.18
Nature_dominance  0.23  0.07
System_just       0.14 -0.05
pol_orient       -0.19 -0.05
Domineering       0.11 -0.18
Empathy          -0.07  0.19
Openness          0.04 -0.02
Anxiety_avoid    -0.01  0.07
Sex                 NA    NA

Are largely opposite one another, explaining why there are no "statistically significant correlations" when data for the genders are combined.

How could this happen? From a mathematical perspective, the odds of this happening by chance are extreme. If we randomly assign each respondent a new numbers, we won't get results anything like these ones. Conversely, if we re-number men and women individually (so men are 1-75 and women are 1-146), the results remain the same.

So how did this happen? The results aren't sorted by any of the data columns. As far as I can tell, they're not sorted by anything. You're welcome to look for yourself to see if you can find a pattern. The data is available here (data for this paper is in the third tab). It only contains the averages for each set of questions as I wasn't given the raw data (even though I asked), but that shouldn't matter.

The order people take a survey should not have any effect on the outcomes of their results. I don't know if all 221 people took the survey one-by-one in a single room, if they all took it at the same time in a large room, if they all took it online at different times or what. It doesn't matter. There is absolutely no reason me taking a survey after a hundred other people have taken it should cause me to be more likely to deny global warming. I shouldn't become more or less empathetic just because you've asked other people who empathetic they are first.

As a sanity check, I ran these same calculations on the first study of the first paper covered in this PhD dissertation. It doesn't have correlations like these. I randomly re-ordered the respondent IDs. There were no correlations like these. I've done every test I can think of, and I simply cannot come up with a data set that has these sorts of correlations without manually altering the data.

Am I missing something? Could there maybe have been some sort of data processing error? Can anyone offer any explanation why we would find "statistically significant correlations" between a respondent's ID number and their responses to a survey other than, "Someone has tampered with the data"?

I don't want to accuse anyone of fraud. I just cannot fathom why the 146th woman to take this survey should be expected to give different responses than the first woman to take it.

11 comments

  1. The three most striking correlations are climate change denial, SDO and domineering (whatever that means!). In these cases the correlations are basically opposite in men and women, as you note.

    I separated genders and wrote some simple code:
    plot(men$Climate_change_denial, men$ID)
    plot(women$Climate_change_denial, women$ID)
    plot(men$Social_dominance_orientation, men$ID)
    plot(women$Social_dominance_orientation, women$ID)
    plot(men$Domineering, men$ID)
    plot(women$Domineering, women$ID)

    Specially in the case of cc denial a pattern is clear: the lower right part of the plot is empty for men, whereas for women it's the upper part that's empty. So the 'denialist' responses happened in the early IDs for women, but in the late IDs for men. In the former case it's almost like there was a cutoff at ID 130 or so - there are virtually no denialist responses above that.
    https://www.dropbox.com/s/a9vy7ig0f61yjss/Rplot1.png?dl=0
    https://www.dropbox.com/s/qmcyfzclr7dpian/Rplot.png?dl=0

    Also: in table 1, how can a variable's correlation with itself be less than 1? e.g. natural dominance only has 0.77 correlation with variable 2.

  2. R student, thanks for the comment. Your results match what I see when I look at the data, though I like that you put the survey scale on the x-axis instead of the respondent ID. I hadn't thought about doing that. I think it looks cleaner that way. It shows just how different the distribution becomes as the respondent ID increases. I can't come up with any non-troubling explanation for it.

    As for Table 1, you have to be careful. If you look at the lines under the table, you'll see the diagonal line isn't the correlation of a variable with itself, but rather, the Cronbach's alpha for that variable. That's supposed to be a measure of how reliable the scale is.

  3. The people were recruited in different ways. Isn't it likely that ID is related to how they were recruited? Could that explain why some results covary With Id...?

  4. A olkoot, the authors make no mention of the respondents of this survey being recruited in different ways. The people surveyed in a different paper discussed in the dissertation (the third one) is taken from two different groups, but the authors explicitly say so and indicate it in their data/ For this paper, they describe the surveyed population as a single group. If it is true the people who took this survey were recruited in different ways, then the authors were obligated to say so and indicate which respondents were recruited in which way.

    I can't rule out the possibility what you say is true, but if it is true, then the authors did wrong. If they recruited people in different ways and these different recruitment processes led to different responses, then there is an important confounding factor the authors effectively hid by failing to provide essential information. That's not okay. It would't be as bad as tampering with data though.

  5. Szilard, that's funny. That sentence in the law is a bad one. The authors should have either used an Oxford comma there (exception to the rule, and whatnot) if that was their intended meaning. If it wasn't, they should have rewritten the sentence. Because they didn't, tons of money and time is being wasted. That shows why pedantry can be a good thing!

    For the record though, I don't consider myself a partisan in that debate. I prefer not to use the Oxford comma myself, but I have no problem with people using it. All I care about is that people are consistent in their decision.

  6. Suppose that all subjects were surveyed at the same time and those that finished first (took the least time, needed the least amount of thinking, had strong preconceived opinions) received low ID numbers. Just ordering by completion time could introduce a bias.

  7. I should have said ..." Just ordering by completion time could introduce a CORRELATION with some of the variables."

  8. Paul, that's an interesting idea though it seems implausible to me. The same effect isn't seen in other surveys by the same research group even though they use many of the same question sets. That doesn't surprise me as I'm pretty sure they didn't have 200+ people take the survey at the same time.

    I went ahead and plotted a bunch of the response results to investigate this. In some cases, the higher ID numbers tended to tended to avoid middling values. This can be seen in the female response to climate change questions as show by R Student. In other cases, low ID values tended to avoid middling values. An example of this can be seen in R student's plot for men's responses to climate change questions. I don't think any completion time effect causing that.

    But plotting this data reminds me just how pointless posts like this might be. While it would be good to resolve this mystery, take a look at this quick, crude set of charts I made while examining this issue:

    http://www.hi-izuru.org/wp_blog/wp-content/uploads/2017/03/3_18_plots.png

    The female responses to three sets of questions are on top. The male responses to the same questions are on bottom. There are some differences between these, but that's not the point. The point is the white space. These responses are so heavily skewed they can't possibly be used to support the authors' conclusions. The authors have practically no data from people who deny climate change under their criteria. They have practically nobody who supports inequality (social dominance orientation). They have practically nobody who expresses a lack of empathy.

    If the authors don't have any meaningful amount of data for these groups, why are they claiming to be able to draw conclusions about these groups? Did they just not look at their data to see what groups of people they managed to survey? Or did they look at their data, see they had no data for various groups of people and thought, "That's okay, correlations tests are wondrous forms of magic that let us come up with results we have no data for"?

    I'd like to know why this data set has patterns it shouldn't have, but at times I find it difficult to get past the fact this work is complete garbage dependent entirely upon misusing and abusing tests of linear relationships on data which is heavily skewed (non-normal). I don't understand how entire fields of science can accept and even embrace this. Anyone with an understanding of what these correlation tests are should understand the tests are built upon assumptions of normality in the data which these data sets violate.

    Or in other words, it is completely inappropriate to use these tests on these data sets. Any results one might get from doing so will almost certainly be wrong.

Leave a Reply

Your email address will not be published. Required fields are marked *