# Statistical Woes - Impossible Results

I've owed you guys a post for a little while now, and I apologize for how long it's taken. I just can't get past a certain problem. As you may recall, I recently discussed how "correlation is meaningless" in relation to a paper which claimed to demonstrate climate change "deniers" possess certain characteristics. For a quick refresher:

The reason the authors can claim there is a "statistically significant" correlation between these two traits is they collected almost no data from anyone who "denies" climate change. The approach the authors have taken is to draw a line through their data, which is how you normally calculate the relationship between two variables, then extrapolate it out far beyond where their data extends.

There are a lot of ways of describing this approach. When I've previously said correlation is meaningless, I used an example in which I demonstrated a "statistically significant" correlation between belief in global warming and support for genocide. It was completely bogus. I was able to do it because I used the same approach the authors used. Namely:

1) Collect data for any group of people.
2) Determine views that group holds.
3) Find a group which is "opposite" the group you study.
4) Assume they must hold the opposite view of the group you studied on every issue.

This will work with literally any subject and any group of people. You can reach basically any conclusion you want because this approach doesn't require you have any data for the group of people you're drawing conclusions about.

Today I want to move beyond simple correlation coefficients and get into some of the more complex modeling the authors performed. There's a problem though. You see, the results the authors published are impossible to achieve.

If you've been following along, you may remember I've previously discussed my inability to replicate certain results these authors published. The authors report:

We then conducted a stepwise regression analysis entering climate change denial as the dependent and the ideology variables as independent variables. The results showed that SDO was the strongest predictor of denial (b = .46, p < .001, R2 = .28). Also, left–right political orientation made a significant contribution in predicting denial (b = .21, p = .007, R2 = .04). The effect of RWA was not significant (b = .09, p = .28). The model accounted for a total of 32% of the variance in denial.

But as I showed before, my results are:

```Residuals:
Min       1Q   Median       3Q      Max
-1.13218 -0.33581 -0.03661  0.33352  1.65996

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.54759    0.22601   2.423  0.01676 *
SDO	     0.44488    0.09334   4.766  4.92e-06 ***
RWA	     0.11835    0.10914   1.084  0.28021
L-R	     0.10147    0.03668   2.766  0.00649 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5612 on 131 degrees of freedom
Multiple R-squared:  0.322,     Adjusted R-squared:  0.3065
F-statistic: 20.74 on 3 and 131 DF,  p-value: 4.671e-11```

Which are reasonably close. The R2 value I get matches what the authors publish (.32), meaning my model explains the same amount of variance as the authors'. The p-value I get for the "social dominance orientation" parameter matches what the authors publish (p < .001). The p-value I get for "right wing authoritarianism" matches what the authors publish (p = .28). Even the p-value I get for "left-right political orientation" matches if you allow for improper rounding (p = .007 and p = .00649). However, as I said before:

It seems peculiar I could match the authors statistical significance (p) values while not matching the calculated coefficients.

Today I can go beyond saying this "seems peculiar." I can safely say it is impossible for the authors to have gotten the model parameters they publish in their paper. One simple test we can use to prove this is to examine how much variance the model the authors publish would explain. To do so, we can plug the authors' parameters into the model:

`authors_model = b[,2] * .46 + b[,3] * .09 + b[,4] * .21 `

I used "b" as the name of the authors' data table. It has the columns: "Climate_change_denial," "Social_dominance_orientation," "Right_wing_authoritarianism," "Political_orientation," "Sex." As you can see, Column 2 is Social dominance orientation, which the authors say has a parameter value of .46. We multiply that column by its parameter value, do the same for the next two columns (the Sex column is excluded) and see how well they match up with Column 1, the climate change denial column.

While we're at it, let's also do the same for the model I came up with:

`actual_model = b[,2] * .445 + b[,3] * .118 + b[,4] * .102 `

To see how much variance in the climate change denial variable these models explain, we perform a simple correlation test to get the r value then multiply that by itself to get the r2 (or r squared) value. First, we'll try the authors' model:

```> cor(authors_model, b[,1])^2
 0.3023379```

This shows if we use the model parameters listed in the authors' paper, we can explain 30% of the variance in the data. The authors claim to explain 32%. That doesn't match up. However, if we use the correct parameters:

```cor(actual_model, b[,1])^2
 0.3219958```

This shows the correct model, which I got when I tried to replicate the authors results, explains 32% of the variance like the authors say. Combined with the fact the authors managed to get the correct p-values, this would seem to indicate the authors managed to come up with the correct model. They reported most of the model's results correctly. However, they somehow managed to report the parameter values for their model incorrectly.

Let's compare the correct values with the values the authors published:

```SDO	     0.44488    .46
RWA	     0.11835    .09
L-R	     0.10147    .21```

These aren't particularly large differences, but they are rather mysterious. I have no explanation for how they came about. What I do know is we can be certain the values the authors published are incorrect. If anyone doubts this, you can go beyond what I've done in this post and examine other diagnostic results.

For instance, we can use t scores (and number of observations/parameters) to calculate p-values or vice versa. The relationship between them will remain constant no matter how the parameter value and/or standard error might change. If the authors model and my model have the same p-values, they must have the same t-scores.

This is important because t-scores are simply the parameter value divided by the standard error. This means if right wing authoritarianism's parameter value is 0.11835 and its standard error is 0.10914, we can divide the two and get a t-score of 1.084. This corresponds to a p-value of 0.28, which both the authors and I got. Now, if the authors published results were correct, the authors would have had to have gotten a t-score of 1.084 with a parameter value of 0.09. Do the math, and that means they had to have gotten a standard error of exactly 0.083.

we could repeat this process for each parameter and each diagnostic result to generate a full diagnostic report based on what the authors published. We could then examine that table to see if it made any sense. It's more work than it's worth though so I'll just spoil the result for you and repeat what I said before: The authors published results are impossible to achieve.

Now, I'm not harping on this strange error just to go, "Gotcha!" The reason I want to spend some time on it is the difference between the two models I discuss in this post has a meaningful impact on the next step of our analysis. One of the questions we've looked at is which parts of the data set cause the authors to get the results they get. As it turns out, the answer to that changes quite a bit if you naively use the model the authors provide rather than the correct one.

Anyway, now that we've established what the correct results are, get back to the real issues. Stay tuned for the next post.