In Defense of a Hockey Stick

I'm a huge critic of Michael Mann's infamous hockey stick. I've written about it time and time again, going so far as to say Mann and his co-authors committed fraud while making it. I wrote two (relatively short) eBooks to give an introductory guide for people who wanted to know just what was wrong with it. I've also criticized a ton of other work said to "confirm" Mann's hockey stick, pointing out the systematic biases and flaws in the paleoclimate field used to reconstruct temperatures back 500-2,000 years in the past.

Another one of these papers was published recently. After I read it, I wrote:

While there are definitely things which deserve praiseabout this paper, especially the archiving of the data used, it is rather unfortunate the authors offered no explanation for how they actually choose which series to use. This post highlights the quote:

For N-TREND, rather than statistically screening all extant TR [tree ring] chronologies for a significant local temperature signal, we utilise mostly published TR temperature reconstructions (or chronologies used in published reconstructions) that start prior to 1750. This strategy explicitly incorporates the expert judgement the original authors used to derive the most robust reconstruction possible from the available data at that particular location.

But what it fails to notice is those "original authors" referred to obliquely in that quote are largely the co-authors of this paper. I haven't checked the entire list of data series used yet, but I counted something like 30 of them that were provided by the authors of the paper themselves. That means this quote is, to a not insignificant effect, saying "We picked series using a stratagey which explicitly incorporates our own, personal judgment, which is quite expert."

In fact, it appears a full third of the series are taken from a single paper, the PAGES 2K reconstruction for Asia. It'd be natural for people to wonder why those were all used while other data was not. My understanding is those 17 series were gridded results created by a reconstruction which used 229 tree ring series (that are not publicly archived), meaning it may be okay to use them all due to their gridded nature meaning they're spread out, but... some explanation of the decision seems warranted. If a concerted effort from a large group of people is going to be carried out, they would presumably have a plan and process they could share.

Incidentally, I feel kind of weird praising the authors of this paper for archiving their data. Because they're frequently the source of the series they used in this paper, they're also the ones responsible for data used to create some of these series not being archived 😛

While there are a somewhat embarrassing number of typos in that comment, I believe the point comes through quite clearly. After an author of the paper, Rob Wilson, responded to me I added this:

Unfortunately, that criteria which Wilson states demonstrates the exact problem I expressed with this paper. Because the co-authors of the paper are the ones who created nearly all the data sets used in the paper, the co-authors of this paper are ultimately able to pick and choose which data sets to use by simply choosing which data sets to publish.

This doesn't have to involve any nefarious intent to be a serious issue. There are tons of decisions that are involved in making these data sets. Which data gets included in them, and how that data is handled, involves a lot of subjective decisions. This is well-demonstrated by the fact there are data sets in this paper where one could point to six or more different versions that could potentially be used. The "latest PUBLISHED version" may happen to be the one an author of this paper thinks is best, but simply relying on his opinion that it is the best makes for a poor criteria. I'm sure the authors have reasons for their opinions, but that doesn't change the fact this paper's results depend largely upon the opinions of its authors.

The lack of objective selection criteria means any potential mistakes and biases by these people can have an unknowable effect on its results. In the most mundane manner, a person may examine a number of potential data sets but only take the time to publish ones which seem to give good results. That's understandable. However, it means (some of the) agreement between the resulting data sets may not accurately reflect the real-world signals, but rather, the fact people had similar expectations.

There was more to my comment, but that is the central part. As you can see, I do not have a high opinion of this paper. In fact, I would call it useless. When people can arbitrarily choose what data to use, based on whatever criteria they want, they can produce basically any results they might want. I don't think what they come up with tells us much of anything, other than perhaps what they expected to come up with.

But that's not what I want to talk about today. This post's title refers to a defense of a hockey stick, and that's what I want to write about. Well really, I'd rather not have to write about it, but because I saw some ridiculous and stupid criticisms of the paper over at Watts Up With That (WUWT) by a person who clearly didn't read the paper, I feel I am obliged to. I just want to be clear that while I think the WUWT piece is terrible and it is absurd the author wrote his post without reading the paer, me saying so doesn't mean I like this paper. I don't. I just dislike glory-hounds spreading misinformation even more. That sort of behavior helped Mann and his co-authors get away with fraud, and it can only help Wilson and his co-authors get away with publishing a terrible paper.

The WUWT post was written by one Willis Eschenbach, who I've written about on this site before. Oddly enough, I wrote about him because he had called authors of another paper dishonest even though he... hadn't read their paper either. It seems to be a matter of form with him. I'm not going to rehash past issues though. I want to focus on Eschenbach's new post, where he says:

At first I was stoked that they had included an Excel spreadsheet with the proxy data. Like they say in the 12-step programs, Hi, my name’s Willis, and I’m a data addict … anyhow, here’s a graph of all of the data, along with the annual average in red.

53 proxies wilson 2016

But as always, the devil is in the details. I ran across a couple of surprises as I looked at the data.

First, I realized after looking at the data for a bit that all of the proxies had been “normalized”, that is to say, set to a mean of zero and a standard deviation of one. This is curious, because one of the selling points of their study is the following (emphasis mine):

This is an interesting thing for Eschenbach to "realize[] after looking at the datafor a bit" given it was explicitly stated by the authors in their paper in the second sentence of their methodology description:

3. Reconstruction methodology

A similar iterative nesting method (Meko, 1997; Cook et al., 2002), as utilised in D’Arrigo et al. (2006) and Wilson et al. (2007), was used to develop the N-TREND2015 NH temperature reconstruction. This approach involves first normalising the TR data over a common period (1750 – 1950)...

Personally, I would read a paper before spending much time looking at its data. Or I would at least read enough of the paper to see what the data they're providing is. It seems silly to grab someone's data without even knowing what it is and try to draw conclusions from it. To each their own though. This isn't a big deal. That Eschenbach hadn't read the paper before looking at the data doesn't mean he never got around to reading the paper. I bring it up, however, because Eschenbach then goes on to say:

Like the song says, “Well, it was clear as mud but it covered the ground” … I was reminded of a valuable insight by Steve McIntyre, which was that at the end of the day all these different systems for combining proxies are simply setting weights for a weighted average. No matter how complex or simple they are, whether it’s principal components or 37 backwards nests and 17 forwards nests, all they can do is weight different points by different amounts. This is another such system.

In any case, that explained why they put the normalized data in their spreadsheet. This normalized data was what they used in creating their reconstruction.

I got my second surprise when I plotted up their reconstruction from the data given in their Excel worksheet. I looked at it and said “Dang, that looks like the red line in Figure 1”. So I plotted up the annual average of the 53 normalized proxies in black, and I overlaid it with a regression of their reconstruction in red. Figure 2 shows that result:

average and interative reconstruction 53 proxies

All I can say is, I hope they didn’t pay full retail price for their Nested Reconstruction Integratomasticator. Other than the final data point, their nested reconstructed integrated results are nearly identical to a simple average of the data.

That sounds like a rather serious problem until you realize the reason the iterative nesting method used by the authors is almost the same as simply averaging the data is... it's basically just a way of averaging the data. After they normalize their data, the next step is:

averaging the series to derive a mean series and iteratively removing the shorter series to allow the extension of the reconstruction back (as well as forward) in time.

So the authors begin by averaging the proxies over the period in which they all overlap. They then remove series with less data than one another and repeat the process, creating a separate average for each period based on the data available for that period. That would be exactly identical to simply averaging all the data. The primary difference between their methodology and simple averaging arises from the fact:

Each nest is then scaled to have the same mean and variance as the most replicated nest (hereafter referred to as NEST 1) and the relevant time-series sections of each nest spliced together to derive the full-length reconstruction.

If you have four proxies with an average temperature of 5C and another proxy with an average temperature of 10C, the average between them would be 6C. Suppose, however, the warmer proxy only went back to 1500 AD while the others went back to 1000 AD. For the 1000-1500 AD period, the average temperature of the proxies would be 5C, not 6C. That shift doesn't reflect any change in past temperatures though. It just reflects a change in what data is available.

So rather than take a simple average of the data, the authors attempted to account for changes in the data like that by recentering each segment. Basically, if the average for one period is 6C but is 5C for another period, you might subtract out that 6C and 5C from both and just set them on an equal baseline of 0C.

The same thing is true for variance in the data. The more data you have, the smaller fluctuations in it will tend to be. That means periods with less data will tend to have greater fluctations than ones with more data. The authors didn't want their results to change based solely on when data was available for though, so they adjusted each period so their fluctuations would be roughly equal in size. I don't think that's actually the right way to address the issue, but still, it is quite simple.

Neither of these steps are anything particularly special or difficult to understand. Ideally, neither should have much impact on the results. The point of this methodology is to use something very simple, on the basis an average of the data should be enough to find the signals of interest. As such, it should come as no surprise the methodology's results are little different from those gotten by simply averaging the data -that is the entire point. The methodology does do a little bit more:

For each nest, separate average time series were first generated for 4 longitude quadrats (Fig. 1). These continental scale time series were then averaged (after again normalising to 1750 – 1950) to produce the final large-scale hemispheric mean to ensure it is not biased to data rich regions in any one continent. 37 backward nests and 17 forward nests were calculated to produce the full reconstruction length from 750 to 2011.

Which is intended to account for spatial sampling so that having more data in one area than in another doesn't bias the results, but again, this methodology is very similar to simply averaging the data because that's the point. The authors didn't want to use any fancy or complex methodologies because doing so raises concerns that it's the methodology which creates the results, not the data.

I don't know why Eschenbach failed to understand this. As his comparison of the two approaches shows, the differences are minor but important for avoiding anomalous results like that you get at the final point which is abnormally high with a simple average of the proxies because of the difference in baselines between the proxies which were available for that point and the ones which weren't.

Still, Eschenbach failing to understand the methodology doesn't mean he didn't read the paper. Maybe he just didn't understand it. Only, there's one important detail I haven't mentioned. While Eschenbach says things like:

• Whatever their iterative nested method might be doing, it’s not doing a whole lot.

He's basing his conclusions on the data provided by the authors which was already normalized. The first step of the iterative nesting method is to normalize all the data over a common baseline. That means you cannot tell what the full effect of the methodology is by looking at the normalized data since part of the methodology is to normalize the data. If that first step hadn't been used, who knows what the results of averaging the data might have been? Certainly not Eschenbach.

Now normally, this might just be an embarrassing mistake. Eschenbach claimed the methodology doesn't do much while comparing the output of one step of the methodology to the output of another. That's silly, and it makes him look like he has no idea what he's talking about, but it's not a big deal in and of itself. It does become a big deal, however, when Eschenbach puts a great deal of focus on the normalization step in question. For instance, when he says:

So to summarize the whole process: for most of the data used, it started out as various kinds of proxies (ring width, wood density, “Blue Intensity”).

Then it was transformed using the “expert judgement of the original authors” into temperature estimates in degrees celsius.

Then it has been transformed again, this time using the expert judgement of the current authors, into standard deviations based on the mean and standard deviation of the period 1750-1950. Why this exact period? Presumably, expert judgement.

Finally, it will be re-transformed one last time, again using the expert judgement of the current authors, back into temperatures in degrees celsius

This strikes me as … well … a strangely circuitous route. I mean, if you start with proxy temperatures in degrees C and you are looking to calculate an average temperature in degrees C, why change it to something else in between?

He asks why the authors would even use this first, normalization step. Portraying the step as wrong, or even just strange, makes his mistake far more serious. The truth is Eschenbach doesn't actually have any idea what effect the authors' methodology has on their results because he's never actually examined what effect the first step of the methodology, this normalization step he derides, has. This gets worse when one realizes despite Eschenbach explicitly stating all the proxies used in this paper were given in units of temperature, going so far as to quote the authors:

This strategy explicitly incorporates the expert judgement the original authors used to derive the most robust [temperature] reconstruction possible from the available data at that particular location.

While inserting the word "temperature" into their text to make it more clear all these proxies were in units of temperature, the reality is... that's a fabrication on Eschenbach's part. Not all these proxies were in temperature units.

In fact, most of the series used in this paper were processed via a methodology known as Regional Curve Standardisation (RCS), as stated by the authors. Anyone who knows anything about RCS knows it doesn't produce a series in temperature units. It's difficult to say much more than that. There's really nothing else to it. Willis Eschenbach completely fabricated the claim this data was initially in temperature units, then he used this fabricated claim to say the authors did a weird thing of converting series from temperature units to normalized series back to temperature units.

In reality, RCS is a method for detrending tree ring series to try to account for the fact trees grow at different rates based on their age. Eschenbach cites Steve McIntyre, so as proof of this, one can look at this post where McIntyre discusses his replication of the RCS methodology, complete with code and figures. The figures McIntyre provides do not show series in temperature units. The code he provides does not output anything in temperature units. This entire idea of Eschenbach's is a fantasy. Or perhaps delusion would be the better word. Whatever you want to call it, it is clearly absurd for Eschenbach to ask:

But since that is the case, since they are depending on their own prior transformation of a record of, e.g., tree ring width in mm into an estimated temperature in degrees C, then why on earth would they convert it out of degrees C again, and then at the end of the day convert it back into degrees C? What is the gain in that?

And it is just obscene for Eschenbach to modify the quote he provides from the authors to falsely insert the claim they said the underlying series for the paper are temperature reconstructions. So while Eschenbach says:

In closing let me add that this post is far from an exhaustive analysis of difficulties facing the Wilson 2016 study. It does not touch any of the individual proxies or the problems that they might have. I hope Steve McIntyre takes on that one, he’s the undisputed king of understanding and explaining proxy minutiae. It also doesn’t address the lack of bright-line ex-ante proxy selection criteria. Nor does it discuss “data snooping”, the practice of (often unconsciously or unwittingly) selecting the proxies that will support your thesis. I can only cover so much in one post.

The reality is I'm not sure anything he wrote was right or remotely useful. He didn't understand the methodology used by the authors, causing him to take similarlities between output of one step of the methodology with the output of another step as proof the methodology doesn't do much. He didn't understand the paper's data, causing him to falsely claim it was all temperature reconstructions, a false claim he even attributed to the authors of the paper.

To be fair to Eschenbach, I have no way to know if he actually read the paper or not. Maybe I"m wrong to say he didn't. However, it is clear if Eschenbach did read the paper, he didn't try to understand it. He didn't try to understand the methodology described in the paper, and he didn't try to understand the description the paper gives of its data. Despite that, he wrote a post condemning the paper. Naturally, his post was riddled with inaccuracies and fabrications.

This paper has problems, and those problems deserve to be highlighted. However, people like Willis Eschenbach aren't doing that. They're being lazy and saying stupid things that have no basis in common sense, facts or... reality as a whole. I'm sure it'll get them attention and adoration, but other than that, posts like Eschenbach's won't accomplish anything. Or at least, not anything good. Many WUWT readers praised Eschenbach's post, but the fact he was able to bamboozle people into thinking he had any idea what he was talking about isn't a good thing.

People like Eschenbach should shut up until they do the work necessary to understand the things they talk about. As long as they don't, they're only going to help the people they criticize. False accusations only make it easier for a person to get away with doing things wrong. The greatest ally anyone could ask for is someone like Eschenbach as an enemy.


  1. Unfortunately Hans Erren, it is nowhere near as simple as you make out. Trees do, in a vague and general sense, grow better at higher CO2 levels, but that is no absolute statement of truth. Different trees will have different factors limiting their growth. Some trees may have CO2 levels as their primary factor in limiting growth, and for them, increasing CO2 levels will cause them to grow faster. Other trees, however, may find CO2 levels are high enough the effect is saturated. For them, increasing CO2 levels will not cause them to grow faster. This is particularly true for trees whose growth is most limited by temperature or precipitation levels, but it can be true for all sorts of other trees as well. For instance, there will be trees where increasing CO2 levels will not change how quickly they grow unless you also increase nitrogen levels in the soil.

    Because the problem is much more complex than simply saying, "More CO2 = More Tree Growth," tree ring series are generally not adjusted for CO2 fertilization. There are some exceptions, with the Mann 1999 paper having my favorite. In it, Mann and co-authors wound up adjusting their data for 1000-1400 AD to account for CO2 fertilization beginning around 1850. Why data for 1000-1400 AD would need to be adjusted for an effect which only began half a millenium later is an interesting topic for discussion, but I don't think delving into it would help address your question.

    To address your question, rather than try to adjust data for rising CO2 levels, what authors will normally try to do is pick out tree ring data from areas where rising CO2 levels wouldn't be expected to cause an increase in growth rates. They'll do this by doing things like looking for areas where it is so cold/warm the primary limitaiton on tree growth rate is temperature. The idea is because temperature is (expected to be) so important to the trees they examine, any change in CO2 levels would be expected to have, at most, a miniscule effect.

    How well they actually accomplish that is a topic I can't speak to. I'm not familiar enough with the biology of trees to know just when CO2 would be the primary factor limiting growth. I do know, however, that dendroclimatologists discuss the issue regularly and are well aware of it. People more interested in the issue than me could find plenty of material regarding it.

  2. By the way, I should point out data being processed with RCS doesn't mean the resulting chronology couldn't have then been converted to temperature. That RCS was used on many of these series doesn't inherently mean they are not given in temperature units since some other method could have been applied afterward.

    That detail doesn't change anything about the points I made in this post, but it's worth bringing up to clarify things in case any of the series processed with RCS were then converted to temperatures and archived that way. I don't know that any were, but I also haven't checked all of them.

  3. What did Wilson say? I responded at Bishop Hill's when you posted and I think at WUWT that perhaps the authors posting their own works is because they were added as authors in order to use their work.

  4. MikeN, you can see Rob Wilson's responses in the thread not far after mine. He didn't really address what I said, save to explain the series were chosen in the manner I depict. Namely, each series is made by someone (or someones) based on a variety of processing decision that involve a significant amount of subjectiveness, and Wilson et al just used whichever ones they came up with. He had not response or rebuttal to the point I made, that this lack of anything resembling an objective selection criteria means it is impossible to know the results of the paper actually reflect real-world signals.

    As for the idea the authors were included so their data could be used, this paper is stated to be made by a consortium. One would expect a large amount of authors using their own data in such. That it is expected doesn't make it okay, much less good, though. The result of it is the entire paper rests on a series of arbitrary choices made by its authors, and nobody reading the paper could possibly know what effect those choices may have had. Even the authors can't know.

Leave a Reply

Your email address will not be published. Required fields are marked *