Gergis et al, (Again) Failing to Do What They Claim

Readers familiar with the 2012 Gergis et al paper will likely remember the paper was withdrawn after it had been accepted for publication (but prior to actually being published) because it turns out the authors did not do what they claimed to have done. Specifically, they claimed to use detrended series for their screening to try to avoid the "screening fallacy" when in reality they hadn't detrended anything. Today I'd like to show the authors have once again failed to do what they claim to have done.

For some background on this issue, there's a good post up at Climate Audit about this paper, and I'm going to try not to rehash the points it covers. There are also two posts I've previously written on the subject, here and here. Being familiar with these posts should be helpful but not necessary as the central problem I want to discuss is really as simple as, "They didn't do what they claim to have done."

As I mentioned in my previous post, a central issue when creating temperature reconstructions is what data you use. Gergis et al tested multiple approaches to determining which data series to use, as outlined:

Here we describe the proxy screening approach that was used to develop the reconstructions. In section S1.1.1 (S1.1.2) in the supplemental material, we discuss the theoretical implications and strengths and weaknesses of using raw versus detrended data (field mean temperatures vs local temperature variations) for proxy screening. The sensitivity of proxy selection to these choices, and the subsequent effect on the results, are then quantitatively assessed in section S1.2 in the supplemental material.

The final approach the authors went with was to use non-detrended data compared to "local" temperatures. "Local" in this case means the temperatures of any 5°x5° grid cell within 500 kilometers. They portray this choice as irrelevant by saying no matter which choice they make, there are only minor differences. Of course, those "minor" differences are like:


The blue line in this image is what the authors got if they used detrended data and compared it to the temperature of the "field mean," or the entire area they were trying to reconstruct temperatures for. This is the approach Gergis et al had claimed to use in their 2012 paper. As you can see, it terminates just prior to 1600 AD. That is, if they used the approach they claimed to have used in 2012, they would only get a ~420 year reconstruction, not a 1000 year reconstruction.

That's a pretty big difference. It effectively demonstrates the authors couldn't produce the results they wanted with their intended methodology. Because of that, they decided to change their approach. They explain:

For predictor selection, both proxy climate and instrumental data were linearly detrended over 1931–90. As detailed in appendix A, only records that were significantly (p < 0.05) correlated with temperature variations in at least one grid cell within 500 km of the proxy’s location over the 1931–90 period were selected for further analysis

While one might criticize a post hoc change of methodology done solely to allow one to "get" useful results like this, the issue I find more interesting is the authors simply didn't do this. Like with their 2012 paper, they claimed to do one thing but actually did another. Table S1.3 of the Supplementary Material shows the results of the test they actually did. It's a large table so you'll want to click on it to enlarge it:


Calculating the correlation between two series is one of the easiest things to do. That's why I was surprised when I was unable to replicate any of these correlation scores. The first problem is while the authors archived their proxy series and the "field mean" temperature series, they did not archive the gridded temperature data they used. This is a bit of an issue as they used a temperature data set that's about three years old. It is impossible to know how much that data set has changed over the last few years. Because the authors didn't archive the data they used, it's impossible to determine if any differences in results are due simply to different versions of the modern instrumental temperature data set.

To avoid that problem, I decided to look only at Options #2-4. These options examine the correlation to the regional temperature (field mean), which the authors archived. Option #2 was the4 obvious place to start. It doesn't require detrending any data or accounting for autocorrelation (AR1). All it should take is comparing the proxy series to the instrumental series and getting a correlation score. Here's an excerpt of a table showing how my results match up to the authors':

		Gergis	Replicated
Mt Read		0.63	0.6
Law Dome d180	0.1	-0.01
Oroko		0.26	0.31
Palmyra		-0.4	-0.23
Talos		-0.25	-0.03

At first I wasn't troubled at my results not matching up. I figured there were still a few minor things to account. For instance, I had used the data file the authors say did not have the proxy series lagged. That is, the authors explain:

To account for proxies with seasonal definitions other than the target SONDJF
season (e.g., calendar year averages), the comparisons were performed using lags of =1, 0, and +1 years for each proxy.

I struggle to imagine how a proxy could reflect temperatures from the next year, but at first I just wanted to replicate their results. Whether or not people agree with one another about what should be done, everyone should be able to agree about what was done. To try to account for that, I reran the correlation tests with proxies adjusted for each lag. This is what I got:

		Gergis	1930-89	1931-90	1932-91
Mt Read		0.63	0.56	0.6	0.55
Law Dome d180	0.1	0.06	-0.01	-0.04
Oroko		0.26	0.29	0.31	0.16
Palmyra		-0.4	-0.34	-0.23	-0.22
Talos		-0.25	0.02	-0.03	-0.12

I got stuck at this point for some time because it seemed no matter what I did, the results just didn't match up. Eventually I decided to start from scratch and redo every step in case I had missed something. I reread the paper, re-downloaded the data and just couldn't find anything. After being frustrated for some time, I finally happened to notice something I had missed. Take a close look at the Gergis et al correlation results:


Because I couldn't get the correlation results to match up, I hadn't paid much attention to some of the other columns. I had looked at the "sel" column, as it indicates which proxies were and were not used (1 = used) for that particular reconstruction, but I hadn't thought to look at the df column. That column indicates the "degrees of freedom" in the data, That indicates how many values in the final test could vary.

I hadn't thought much about those values because I knew they were wrong. When you shift your proxies around for a +1 or -1 lag to find the optimal correlation, you obviously increase the number of things that can vary. Similarly, when you test against many "local" grid cells, you increse the number of things that can vary. The authors didn't account for this in their tests, and as a result, I knew the values they provided would be wrong (as would the p values which change with the amount of degrees of freedom).

Eventually though, the repeated number "68" caught my eye. This caught my eye because if you're not accounting for any other factors, the degrees of freedom should equal your number of data points minus 2. With the 60 years of annual data between 1931-1990, you would have 58 degrees of freedom. When it finally sunk in the authors repeatedly wrote 68 degrees of freedom instead of 58, I had a thought. Here is what happens if we repeat the calculations I did above, using the 1921-1990 period instead of the 1931-1990 period:

		Gergis	1920-89	1921-90	1922-91
Mt Read		0.63	0.61	0.63	0.58
Law Dome d180	0.1	0.1	0.02	-0.06
Oroko		0.26	0.29	0.26	0.18
Palmyra		-0.4	-0.4	-0.27	-0.25
Talos		-0.25	-0.14	-0.18	-0.25

Taking the largest (absolute) value of the three tested periods (1920-1989, 1921-1990, 1922-1991) gave almost the exact results published by Gergis et al. In one case, Oroko Swamp, the authors explain:

If significant correlations were identified for more than one lag, the lag with the highest absolute correlation was used for the reconstruction (with the exceptions of the proxies Mount Read and Oroko Swamp, which are already calibrated temperature reconstructions; hence only lag 0 was selected).

Since they didn't lag Oroko Swamp, so for it, you have to take the lag-0 (1921-1990) result even if it isn't the highest. Accounting for that makes my results match up perfectly except for one proxy, Law Dome Accumulation (not d180) which I haven't been able to replicate their results for no matter what I try.

Regardless of that one result I couldn't match, I was happy to see I was happy to make progress replicating. I then did the same thing with detrended data, again using the 1921-1990 period and was again able to replicate their results. It turned out all I had to do get their results was understand when they wrote:

For predictor selection, both proxy climate and instrumental data were linearly detrended over 1931–90. As detailed in appendix A, only records that were significantly (p < 0.05) correlated with temperature variations in at least one grid cell within 500 km of the proxy’s location over the 1931–90 period were selected for further analysis

They didn't actually use the 1931-1990 period. Instead, they used the 1921-1990 period. Quantifying the full effect of this will be difficult given the authors didn't archive the gridded instrumental data they used, but it is definitely not inconsequential. In the first figure of this post, we saw what happened if the authors used the stated criteria from their 2012 paper on the 2016 data set. Only nine proxy passed that test. Here is how the correlation scores of those proxies change if we use the period the authors claim to have used (and accounting for optimal lags save for Mt Read):

		Gergis	Correct	Start Year
Mt Read		0.36	0.25	1000
Kauri		0.32	-0.09	1577
Fiji AB		-0.31	-0.07	1617
Rarotonga d180	-0.32	-0.13	1761
Fiji 1f d180	-0.27	-0.23	1781
Mentawi		0.34	0.17	1858
Bunaken		-0.37	-0.34	1863
Ningaloo	-0.3	-0.32	1878
Laing	-	0.43	-0.42	1884

Half of these proxies would not have passed the correlation test the authors claim to have used. It was only because they used the 1921-1990 period instead of the 1931-1990 period those proxies passed.

I've previously talked about how using this approach to screening, the one Gergis et al claimed to have used in 2012, makes it so you can't create a reconstruction that goes back beyond ~1577. This table shows shows if you used that approach with on the 1931-1990 period (even with allowing for lags), you couldn't get a reconstruction that went back beyond ~1800.

Mind you, this is if we allow for the optimal lag for each proxy without adjusting our statistical significance tests to account for the extra options - a lag which can change based upon the form of screening we use. For instance, the proxies Mangawhero and Kauri have the same stated lag (+1) for the main result of the Gergis et al paper. When these proxies are used in detrended screening, these are the correlation scores:

		1920-89	1921-90	1922-91
Mangawhero	0.46	0.35	0.28
Kauri		0.22	0.27	0.53

Gergis et al list these two proxies as having detrended correlation scores of 0.46 and 0.53, the results you get if you test Mangawhero's correlation for 1920-1989 and Kauri's correlation for 1922-1991 (both tests against the instrumental period of 1921-1990). Those are the results Gergis et al used for these proxies in their detrended calculations even though for their non-detrended calculations, Gergis et al tested both proxies for the 1922-1991 period.

It's not clear to me how detrending a proxy should shift its temporal relation to temperature by two years, but that happens with this paper in many cases. It's also not clear to me just how much all these various issues would affect the results of this paper. With the authors not having archived the gridded instrumental data they used, it's difficult to discern the full effects of these issues.

It is not, however, difficult to see the authors claimed to screen their proxies on the 1931-1990 period but actually used the 1921-1990 period. It is also not difficult to see this has a material impact on what proxies pass at least some of their screening tests.

Additionally, it is easy to see this choice of period has an effect on other aspects of this paper. The authors wrote things like:

For calibration of the reconstruction statistical models, we use subsets of the instrumental data from 1931 onward because prior to this time data were generally only present over southeastern Australia. The data from 1900 to 1930 are used for a separate, independent verification of the temperature reconstructions.


The reconstructions were calibrated with the instrumental data over the 1931–90 period, and the 1900–30 period was used for independent early verification

Which shows they meant to keep the 1900-1930 period separate from the 1931-1990 period. Using the 1921-1990 period instead of the 1931-1990 period for screening ruins this independence. By screening their proxies over the 1921-1990 period, they used the 1/3rd of their "independent verification" period as part of creating their reconstructions. That means the "independent verification" period was not actually independent.

There is some humor in all this. The 2012 version of this paper was withdrawn because the authors didn't do what they claimed to have done. Now, in this new version of the paper, it turns out the authors didn't do what they claimed to have done either. One would think after four years the authors could have figured out just what they did or did not do.

In any event, I'll detail more of the effect this has on the various screening tests the authors used in a future post. I don't know if a full replication/examination is possible without the authors archiving the gridded instrumental data they used, but I'll do as much as I can. I'm sure it will be a lot more than the reviewers of this paper did.


  1. Excellent reverse engineering!

    The multiplicity of correlations seems to be "trawling for significance". When the correlation scores differ so greatly between 1921-90-ish vs. 1931-90-ish cf. your examples of Kauri, Fiji AB, Rarotonga d180, Mentawi), it seems a logical conclusion that the method has stumbled upon a spurious correlation, and/or these are not good proxies of temperature.

  2. Great job.
    The good news is that you will not need to inform Gergis and co of this. They will have independently discovered this issue several days ago.

  3. Thanks guys. I was going to make the same joke AndyL made in this post, but I hadn't explained the backstory to it, so I figured it'd be out of place.

    And yeah, if adding 10 years of data changes a proxy from having almost no correlation to a significant correlation, it's difficult to see how that correlation could be meaningful. That's especially true if you have to pick and choose lags to get those correlations. If you allow for a certain amount of fiddling, it's hardly remarkable you are able to find a "significant" correlation.

  4. Wow. I did not realize until now that this paper was such complete garbage. I'm starting to wonder if the reviewers even read the paper before recommending it for publication.

    Good grief.

  5. Carrick, according to Joelle Gergis, the paper "was reviewed by seven reviewers and two editors, underwent nine rounds of revisions, and was assessed a total of 21 times." I cannot begin to imagine what the reviewers might have said. I'm particularly curious about one reviewer who Gergis quotes:

    One reviewer even commented that we had done “a commendable, perhaps bordering on an insane, amount of work”.

    Things like this make me doubt the value of peer review.

  6. There's a HadCRU3 gridded data series that is still online. I've downloaded and extracted austral summer month data. To calculate a summer average, one also needs to decide how many months of summer data can be missing in order to calculate an average. It's hard to figure what they've done. I wonder if they've used an infilled dataset.

  7. One has to wonder if the seven reviewers and two editors, instead of just identifying flaws, made helpful suggestions to patch this thing together.

  8. Well, done, Brandon. Of all of the manifold errors in this paper, this is the most hilarious one. They've inadvertently data snooped a third of their "independent" test data.

    Now, having inadvertently snooped my own data more than once, I do feel some compassion ... but dang, in this one the pull of the schadenfreude towards the dark side is nearly overwhelming ...

    All the best,


    PS—Let me say that my style for testing tuned models is to split the dataset in half, early and late. Then I develop the tuned model on one half and test it on the other. Then I reverse the halves and repeat the test. Obviously, if the model is valid the parameters in the two test should be close to the same values ... not probative, but it will detect obvious problems.

  9. Great work.
    "a central when creating temperature reconstructions" ... a missing word, "decision" ?

  10. Stephen McIntyre:

    There's a HadCRU3 gridded data series that is still online. I've downloaded and extracted austral summer month data. To calculate a summer average, one also needs to decide how many months of summer data can be missing in order to calculate an average. It's hard to figure what they've done. I wonder if they've used an infilled dataset.

    The problem that worries me is the HadCRUTv3 record they use gets updated over time, so if you use a different version than they did, you could potentially get different results. That makes it difficult to tell what the problem is if you fail to replicate their results, Are you doing something different than them, or are you just using different data?

    It wouldn't be such an issue except we've already seen the authors inaccurately described what they've done in this paper. What if there are other inaccuracies we don't know about yet?


    Great work.
    "a central when creating temperature reconstructions" ... a missing word, "decision" ?

    Yup, thanks. I fixed it.

  11. Brandon:

    Things like this make me doubt the value of peer review.

    Well, "seven reviewers and two editors, underwent nine rounds of revisions" reads of a paper that was universally recognized to be a turd.

    It likely got published then through political pressure on the editor. Wonder if the editor changed at some point, creating a gateway to publication?

    I have seen papers get accepted were all of the reviewers universally panned the paper, but accepted by the editor in spite of that.

    It is for papers like this why I think the referee and editors comments need to be part of the publication record. "Arent't enough papers of this sort and we need to be encouraging to authors" trumped "completely wrong" in this case.

  12. Let me rephrase this so it makes a bit more sense: " Wonder if the editor changed at some point, creating a gateway to publication?" => " Wonder if the fact that the editor changed at some point created a gateway to publication?

  13. I actually stopped doing peer review a decade ago because I kept getting asked to review papers where i didn't know the niche very well. From a reviewers point of view, it is almost always a thankless job, unless there is something new to learn. Replication is often beyond the possible or feasible because of access to the simulation codes used and/or lack of knowledge of all the details of how to use them. It can be a complex task.

  14. Thanks jonrt. Unfortunately, there is quite a bit more work to be done before contacting anyone. One of the most obvious questions still to be answered is, "Did they use the 1921-1990 period for their 'local' screening?" It seems probable the answer is yes given they used the 1921-1990 period for the screening against field mean (regional) temperatures, but it hasn't been proven yet.

    It would be easy to tell what period they used if they had archived the gridded instrumental data they used. They didn't. As a result, we're stuck trying to replicate the results they got with a data set from four years ago.. That makes it very difficult to tell if differences in results are due to not having a copy of the data the authors used or something else.

    Resolving that question and detailing exactly what effect this choice of screening has on the authors' proxy networks are things I'd like to spend more time on before contacting anyone. It's time-consuming though.

  15. Brandon,

    Your detective work is brilliant. My only regret is that your talent is diverted to Gergis rather than making breakthrough in fusion energy or cancer (or maybe finding Hillary's missing/deleted emails). Gergis surely feels the same.

    Glad to see you back in full form.

    Best, Ron

  16. Guarded admiration. You realise, as above, others will be reading this.
    Whether you contact the journal or not the journal and Gergis et al will be aware.
    Can they do another correction, will they have to withdraw the paper?
    One feels they should if the error/subterfuge is proven real.

  17. For what it's worth, I have absolutely no doubt the authors of this paper are aware of what's been written about their paper. I have no idea what their reaction to it might be though. That they've chosen not to say anything about the error I've highlighted speaks poorly of them, I think.

Comments are closed.