How BEST Overestimates its Certainty, Part 1

I'm supposed to be working on the follow-up to the little eBook I published last month explaining the Hockey Stick Controversy (see here). My goal has always been to get the second part finished by the end of January. Unfortunately, I keep getting distracted. It had been bothering me I'd never gotten around to filing a complaint with the IPCC about the backchannel editing in its latest report so I spent some time writing and sending that complaint (see here).

I've also been bothered by people saying 2014 was the hottest year since that claim was based in part on the BEST temperature record, which has undergone a number of undisclosed/undocumented changes. I wrote a simple little post about that as well. Discussing that post led me to getting interested in more issues with the BEST temperature record, and now I'm thoroughly distracted. I have no choice but to take some time to discuss just how wrong the BEST approach is.

The last post I wrote about BEST highlighted the fact there have been a multitude of different versions of the BEST temperature record, none of which have been archived by BEST for comparison purposes. It showed the differences between versions can exceed the stated uncertainty in the BEST record, calling into question BEST's claims of precision. A previous post called into question the breakpoint calculations used by BEST, suggesting they artificially inflate BEST's calculated precision.

Today's post is going to focus on a more central issue. This is a graph I made two years ago:


It shows the uncertainty levels published by BEST for its temperature record beginning in 1900. I made the graph to highlight the step change in BEST's uncertainty levels. At about 1960, the uncertainty levels plummet, meaning BEST is claiming we became more than twice as certain of our temperature estimates practically overnight. Here is an updated version of the graph, with better formatting and using more recent BEST results:


As you can see, the problem still exists. In fact, the new version of the graph shows there is more to the problem. There is a clear seasonal cycle in the uncertainty levels prior to the step change. After the step change, there is almost no seasonal cycle. In fact, there is practically no autocorrelation at all. Here is a comparison of the autocorrelation of BEST's uncertainty from 1910-1960 to those of 1960-2010:


There is a seasonal cycle in the uncertainty's autocorrelation after 1960, but it is tiny in comparison to that present prior to 1960. This shows the dramatic changes in BEST's uncertainty levels at about 1960 aren't limited just to the amount of uncertainty. They also affect the nature of that uncertainty.

The reason for this is quite simple. BEST's code sets two variables:

% Benchmark years used for determining the record alignment in the
% statistical analysis.
options.StatisticalUncertaintyBenchmarkMinDate = 1960;
options.StatisticalUncertaintyBenchmarkMaxDate = 2010;

These values determine what period is used to align, or calculate baselines for, series in the BEST uncertainty analysis (the second variable was set to 2000 in earlier versions). They are used in a function called alignGroup described:

% Align a group of time series to share the same mean over a reference
% interval.

f = ( times > options.StatisticalUncertaintyBenchmarkMinDate & ...
times < options.StatisticalUncertaintyBenchmarkMaxDate );

The first two lines are comments explaining the function aligns "time series to share the same mean over" the 1960-2010 interval. The other two show the code which sets the period. This shows the BEST code sets its baseline periods for these calculations to the 1960-2010 period, the exact same period we see changes in the nature of BEST's uncertainty levels. That's not a coincidence. BEST describes its method for calculating uncertainty, known as "jackknifing":


Basically, BEST removes 1/8th of its data eight times (with no overlap), repeating its analysis each time. It then takes the variance in the eight "independent" series generated as its uncertainty.* Doing so requires combining the eight series, but those series do not all have the same baseline values. To account for that, BEST aligns all eight series to a common period, 1960-2010.

Readers may already be aware of why that is bad. The problem with this has been discussed on climate blogs before because it came up with the Marcott et al temperature reconstruction (see here). I'll demonstrate it with a simple example. I created five series with linear trends and white noise. This is what they look like:


Each series has 100 data points, going up at a rate of 0, 0.02, 0.04, 0.06 and 0.08 per "month." These will be fake temperature stations I can apply some basic tests to. The first test will be to simply align the five series. The following graph shows two ways to:


As you can see, how you align the series determines where the variance in the series comes through. The period you use to align the series will have less variance than any segments outside that period. There is no "right" period to use for alignment. What matters is what period you choose will affect the visual impression you create.

That is only a matter of visual impressions. The choice of alignment matters much more once you start doing actual calculations. To demonstrate, let's consider what happens if you use BEST's jackknife approach with the three forms of alignment used in the previous graph. We'll start with no alignment:


The colored lines in this graph show what happens if you remove one series and average the rest (each color representing a different series removed). The black line shows what would happen if you just averaged all five series. The variance in these lines represents the "uncertainty" calculated via jackknifing.

The results are what you'd expect. As the series diverge, the uncertainty levels increase. That is not the case, however, when we perform the same calculations after aligning the series. The following graph shows the same calculations performed with the two alignments shown before:


Aligning the series over their entire record causes the uncertainty levels in the middle portion to be smaller than early portion even though the early portions of the five series were the most similar. Even stranger, aligning the series over their most recent portion causes the recent portion to have almost no uncertainty even though that is the portion where the series diverged the most.

The reason is when you align series over a particular period, you force them to match in that period. Matching means there is little variance. According to the jackknife calculations, that means there is little uncertainty. That is why BEST's uncertainty levels have a step change at ~1960. By aligning the runs of its jackknife calculations to the 1960-2010 period, BEST artificially deflated the variance in the 1960-2010 period. This artifically reduced BEST's uncertainty levels in the modern period.

This is a trivial mistake. It's remarkable BEST made it. It's even more remarkable since BEST's code for calculating uncertainties has this:

% Uncertainty in global average
stat_uncertainty.global_average = mean(baseline_unc);

Showing BEST is aware there is uncertainty in the baselines it uses (baseline_unc stands for "baseline uncertainty"). Despite this, when it calculates its uncertainty levels:

% Build complete uncertainty from the two halves...
sp = results.spatial_uncertainty;
st = results.statistical_uncertainty;
for m = 1:length(types)
sp_unc = sp.(['unc_' types{m}]);
st_unc = st.(['unc_' types{m}]);
unc = sqrt( st_unc.^2 + sp_unc.^2 );

results.(['uncertainty_' types{m}]) = unc;

It ignores that uncertainty. BEST combines its calculation of the spatial uncertainty (which I haven't discussed in this post) and combines that with the statistical uncertainty generated via its jackknife process, and that's it. BEST simply discards a major source of uncertainty in its uncertainty calculations, and by doing so, it creates an artifical step change in its uncertainty levels which makes it appear we can be more certain of modern temperatures than the data actually lets us be.

*This claim of independence is incredibly misleading. BEST estimates breakpoints prior to running "the entire Berkeley Average machinery." It does so by examining every station in its data set, comparing each to the stations around it. This is effectively a form of homogenization (BEST even stores the code for it in a directory named Homogeniety).

That means BEST homogenizes its data prior to performing its jackknife calculations. Whatever series are removed in the jackknife calculations will still have influenced the homogeneity calculations, meaning they are not truly removed from the calculations as a whole.

It's trivially easy to show homogenizing a data set prior to performing jackknife calculations means those calculations cannot reflect the actual uncertainty in that data set. I'm not going to do so here simply because of how long the post has already gotten. Plus, I really would like to get to work on my eBook again at some point.

January 1st, 3:00 AM Edit: I intentionally got this post wrong as part of a trick. You can find an explanation of it here, but the short version is the baseline issue I highlight in this post is real but is not the cause of the step change I showed. I intentionally messed up this post to provoke BEST into acknowledging the problems I claimed exist with its uncertainty calculations (which have smaller effects than I portrayed).

I apologize to anyone who is bothered by this, but I want to stress the fact it worked. BEST has acknowledged I was right about this problem's existence, making it the first time BEST has ever disclosed the fact the problem exists.


  1. This is less a "mistake" than an undocumented assumption. BEST assumes that sometime after 1960 the accuracy of the world's weather stations reached a plateau of stable, "good-enough" certainty, while prior to that time many stations were inaccurate, or of undocumented and unverifiable accuracy, although generally improving. From the part you quote: "there exist regions of the world poorly sampled during interesting time period" -- there USED to be poorly sampled places, but by 1960 there were very few such.

    This may or may not be a valid assumption. It'd be nice to see the question explored.

  2. Pouncer, the issue you raise is a relevant one, but it is tied to spatial uncertainty, not statistical uncertainty. In the code:

    sp_unc = sp.([‘unc_’ types{m}]);
    st_unc = st.([‘unc_’ types{m}]);
    unc = sqrt( st_unc.^2 + sp_unc.^2 );

    Issues with how much of the globe is covered by temperature stations would go in the sp_unc variable. The issues I'm discussing would go in the st_unc variable.

    That said, I get the improved coverage for the 1960-2010 period might make someone think that is the best choice of baseline period. The problem is as soon as you choose a segment of your record to be your baseline period for comparing series (such as the eight series created in the BEST jackknife calculations), you reduce the variance of that period and inflate the variance outside that period. That distorts your uncertainty levels. Even worse, it distorts them in a way which fits your expectations (e.g. decreasing the uncertainty in the 1960-2010 period while increasing the uncertainty before it) so you're less likely to notice it.

    The bias this mistake causes may fit BEST's assumptions, but you can't just take any answer that fits your assumptions as proof of those assumptions. If BEST hadn't made a boneheaded mistake here, the uncertainty of modern times would be higher.

  3. Pingback: 검증사이트

Leave a Reply

Your email address will not be published. Required fields are marked *