Outstanding Issues With BEST

I've long believed one of the worst things a scientist can do is overstate the certainty of his results. Unfortunately, it happens all the time. Today I'm going to discuss one of my favorite examples, the Berkeley Earth Surface Temperature (BEST) record.

There's a lot of history to BEST which I won't go into, but basically, previous efforts to estimate the planet's temperature over the last ~200 years or so were widely criticized by global warming skeptics. BEST was created to address these criticisms. One of those criticisms was a lack of openness and transparency. BEST was supposed to address this criticism by being completely open and transparent. Instead, BEST has hidden a number of things it felt were inconvenient.

Perhaps the best example can be found in BEST's discussion of the temperatures of 2014. The media was quite interested in whether or not 2014 was the hottest year on record. Different groups said different things about this, and BEST jumped in by publishing a report. The key result can be see in this table:

3-30-BEST-Table

Which shows the margin of uncertainty in the BEST results is too small large to say which year was the hottest on record. BEST refers to this uncertainty, saying:

3-30-BEST-Statement

Which sounds fairly impressive. Only, it's incredibly misleading. BEST wants people to believe it knows the planet's temperature to the hundredth, or even of a degree, based upon its uncertainty calculations which go to the thousandth of a degree. That's complete bunk, and BEST knows it. BEST knows those numbers are not true, but it promotes them anyway.

I'll give an example. Back when BEST released its preliminary results, blogger Jeff Id discovered a problem in their uncertainty calculations. To estimate uncertainty, BEST removes a portion of its data and reruns its calculations. It does this multiple times then looks at how different the results are. This variance is considered to be an estimate of "uncertainty." It's a fairly typical approach, but as Jeff Id noted:

In their case, the weights are calculated 8 times with 1/8th of the data removed. Equation 36 creates an upweighted version of the residual differences between the full reconstruction and the reduced data reconstruction. The reduced data reconstructions contain tempreature stations which are re-weighted to produce the trends. Now, the authors claim that this variation in result represents the true uncertainty of the total method mean temperature, but I disagree. What this represents is the ability of the model to chose (upweight/downweight) the same stations in the absence of a small fraction of the data. The resampling methods will necessarily generate a very small CI from this but the truth is that their algorithm is generating the same mean values within the CI’s as presented in the paper. Think about that. They always get the same result within that CI so they are getting the same result inside a very narrow band. So are these methods a true representation in our confidence in the mean temp?

The problem is that equation 36 generates independent datasets by upweighting residuals from different runs containing the same data. Each run though changes the weight of the root data by reweighting individual temperature series. The central values (series most like the mean) of the reduced data runs are effectively upweighted more when data is removed while outliers experience the opposite effect. The central value is therefore non-normally and non-linearly preferred, invalidating the assumptions of the subsampling methods. More simply this weighting of the preferred values means that you really don’t have 1/8 less data which is THE central assumption of the Jackknife methods. Because of the weighting algorithm, they have functionally removed less than 1/8th. This is likely the primary reason why subsampling produced even tighter CI’s than Jackknife, as mentioned in the paper.. This is a significant error in the methods paper which will require a rework of the CI calculation methods and a re-write of the CI portion of the methods paper.

How significant is this problem? Nobody knows. Jeff Id tried to talk to people at BEST about the problem, but they were not very cooperative. This is a shame because the problem is obvious and is one they should have addressed from the start. Instead, for years BEST still hasn't squarely addressed it. All they've done is include this in one of their papers:

3-20-BEST-Note1

The details of these tests have never been published. BEST has never publicly discussed them or invited people to verify them. The specific results of the tests haven't been posted to their site with an explanation of what was done. BEST does have an SVN repository (username installer, password temperature) which may have material for the tests, but if so, BEST has done nothing to draw attention to the fact.

But that's just one way uncertainty in BEST's results are understated. And to BEST's credit, they have acknowledged it to some extent. Similarly, they acknowledge their spatial uncertainty calculations are misguided, to some extent, in one of their papers:

3-20-BEST-Note2

Put simply, BEST assumes the spatial relations (correlation structure) observed for the planet in recent times has been constant for the last few hundred years. This assumption underlies BEST's estimates at spatial uncertainty for the past when we don't have data for the entire planet. It's also stupid.

One of the key aspects of global warming is the entire planet is not expected to warm at the same rate. The very phenomenon BEST is trying to measure contradicts an assumption it relies on for its uncertainty estimates. Even worse, that the assumption is wrong is obvious to anyone who has bothered to examine BEST's results. It is trivially easy to see the correlation structure in BEST's results changes over time. Either that proves BEST's assumption is wrong, or it shows BEST's results are very inaccurate.

I've brought that point up with BEST team members multiple times. I've never gotten any sort of meaningful response. Nobody has even attempted to discuss or disclose how it affects BEST's uncertainty levels. I find that baffling. I don't understand how a group can be created on the platform of openness and transparency then turn around and ignore obvious problems it knows exists.

One could make an excuse in this case. After all, BEST does at least acknowledge the problem exists. It doesn't always do that. A couple months ago, I tricked BEST into admitting it knows its uncertainty calculations are underestimated. You'll recall BEST estimates statistical uncertainty in its results by seeing how its results change when it removes portions of its data.

A problem with this is when you compare multiple series to see how different they are, you have to choose a "baseline" on which to align them. BEST uses the baseline of 1960-2010. Aligning its test series over the 1960-2010 period means BEST forces those series to match the best in 1960-2010, reducing the variance for that period. This in turn increases the (relative) variance outside the 1960-2010 period.

This is a well-known problem. It's the sort of simple thing anyone with much statistical knowledge would be mindful of. I was surprised BEST's results would be subject to it. As such, I decided to write a post about it. Only, I didn't feel like getting blown off like pretty much everyone who has pointed out problems with BEST's work have been thus far. As such, I decided to play a simple trick. As I explained afterward:

You see, for the last couple years I’ve tried to draw attention to a number of problems with the BEST temperature record. It didn’t work. I couldn’t get anyone at BEST to care, much less to acknowledge the problems. I won’t rehash the whole history here. Suffice to say I’ve tried talking to BEST members Steven Mosher, Zeke Hausfather and Robert Rhode about some issues that are pretty much indisputable. Zeke was friendly but didn’t know enough to answer, Rhode didn’t respond and Mosher was… well, Mosher. I think it’d be best if I don’t say more.

The point is I’ve been unable to get meaningful responses to simple points for over two years now. It’s been said insanity is doing the same thing over and over expecting different results. That was on my mind when I discovered some new problems with the BEST temperature record. I considered writing those last two posts about BEST in an accurate way, but I knew I’d just get ignored if I did.

So I decided to play a trick. I knew BEST members would just ignore me if I pointed out these problems in a normal fashion. I also knew at least one BEST member, Mosher, would quickly respond to me if I made a bad argument he knew he could rebut. The solution was obvious. If I wanted a response to my valid points, all I needed was to make my valid points in a bad argument.

That's what I did in this post. I pointed out BEST's choice of baseline was arbitrary and introduced temporal biases in its uncertainty calculations, explaining in some detail how it happens. I then intentionally exaggerated the effect of the problem by making a stupid mistake. My hope was in correcting the stupid mistake, BEST would acknowledge the actual problem. It worked. BEST team members responded to say I was wrong, and part of what they said was:

2) In the statistical calculation, the choice of a 1960-2010 baseline was done in part for a similar reason, the incomplete coverage prior to the 1950s starts to conflate coverage uncertainties with statistical uncertainties, which would result in double counting if a longer baseline was chosen. The comments are correct though that the use of a baseline (any baseline) may artificially reduce the size of the variance over the baseline period and increase the variance elsewhere. In our estimation, this effect represents about a +/- 10% perturbation to the apparent statistical uncertainties on the global land average.

Unlike the previous two issues that affect BEST's uncertainty calculations, BEST had never warned anyone about this issue. Prior to this point, BEST had never said a word about this choice of baseline. According to their response to me, they've looked at the issue, but for some reason they've just never disclosed it. In what world is it "open" or "transparent" to not tell people about problems you know exist in your results?

But there's more. Part of BEST's methodology for trying to determine the planet's temperature is something called the "scalpel." With it, BEST splits individual temperature station records when it believes there is a problem with the data. The idea is if something has happened to a temperature station, such as it being moved to a different location, there is no reason to adjust the station's data. Instead, you can just treat it as two separate stations.

It makes sense. It's a form of what is known as homogenization, and I have no problem with the idea. What I do have a problem with, however, is the idea there is no uncertainty in this process. That's what BEST claims. You see, while BEST estimates uncertainty by rerunning calculations with portions of its data removed, it never reruns its homogenization calculations. That means BEST's uncertainty levels assume there is no uncertainty in its homogenization calculations.

That's obviously wrong. It's also something which should obviously be disclosed. It wasn't. BEST never bothered to warn anyone about that assumption. It was only after I pulled the trick I mention above that BEST even acknowledged it. In a response, BEST addressed the post I wrote about this issue by saying:

3) With regards to homogenization, the comments are only partially correct. The step that estimates the timing of breakpoints is presently run only once using the full data set. However, estimating the size of an apparent biasing event is a more general part of our averaging code and gets done separately for each statistical sample. Hence the effect of uncertainties in the magnitude, but not the timing, of homogeneity adjustments is included in the overall statistical uncertainties.

I get BEST says this issue is small. I get BEST says the last issue only causes a "+/- 10% perturbation" in its statistical uncertainties. What I don't get is why BEST feels that explains not disclosing these issues. BEST happily gives its uncertainty levels to the thousandth of a degree, calling them "remarkably small." How can it do that then turn around and say it's okay to not warn people those uncertainty levels are underestimated?

And why should anyone believe BEST when it says these issues don't matter? Left to its own devices, BEST wouldn't even admit these issues exist. Are we supposed to just take the word of people who hid a problem that the problem doesn't matter? It's not like BEST provided anything to support what it says. It didn't say what tests it performed, much less provide the code, data or results for such tests.

And what if these issues don't matter? It's only chance anyone found out about them. How many other issues are there BEST hasn't acknowledged? Even if these issues don't matter, BEST hid them from everyone. BEST hid them from everyone while publishing "remarkable" results the problems BEST hid show are inaccurate.


But it gets worse. The problems above all deal with BEST's uncertainty calculations. They don't necessarily affect its temperature estimates. It turns out other issues do. Most notably, it turns out BEST's "homogenization" process has a significant effect on its results. Here is a graph showing BEST's results with and without homogenization (blue without, red with full homogenization):

Best-Homogenization

Some people have said that difference doesn't matter because it doesn't disprove global warming. That's silly. We're talking about a difference of something like 20%. BEST publishes uncertainty calculations to the thousandth of a degree. You can't do that then turn around and say a change in your results of 20% is irrelevant.

Even worse, there is no apparent basis for this change in BEST's results. One of the central issues I've had with BEST's results is its homogenization process creates ridiculous results. I've written multiple posts about this (e.g here and here), but it can be summed up with one remark:

I get almost everybody seems to agree BEST gets things right at the global scale, but couldn’t we all agree there’s a problem if BEST can’t come close to the right answer when looking at entire continents?

The point I've been making for over a year now is BEST's results show very little difference from area to area. Entire continents seem to warm at the same rate, something no other group studying the planet's temperatures finds. I've long suggested this is due to BEST's homogenization process, but it's recently confirmed that to be true with this image:

2-9-BEST-Homog-Maps

The first map shows BEST's results if it doesn't do any homogenization. There is significant spatial variability. The second map shows what happens if BEST does homogenization only when it has documented reasons to do so. It has a bit less variability than the first map, but the differences are fairly minor.

The third map is where things get crazy. It shows what happens when BEST performs homogenization by estimating "empirical breakpoints." These breakpoints are where BEST believes a problem in a temperature station's data exist. Here is an example of what effect "correcting" for such problems can have (black-raw, red-adjusted):

5-3_bp_ex-1

I'm not kidding with that image. There are tons of temperature stations BEST feels need changes like that. Then there are stations where BEST believes 30+ adjustments are needed. Why? I don't know. As far as I've been able to tell, BEST's "empirical breakpoints" have nothing to do with any data problems. They're just arbitrary adjustments which make the data more homogenous... and increase the total amount of warming found by ~20%.

And BEST didn't bother to tell anyone this. Two months ago, nobody knew BEST's "homogenization" increased the amount of warming it finds by so much. At least, nobody outside BEST knew. BEST has apparently known this for quite a while. It just didn't tell anyone. In its effort to be completely "open" and "transparent," BEST chose not to inform people it increases the amount of warming it finds by ~20% with its decision to make adjustments for "empirical breakpoints."

And 20% may be an underestimate. That ~20% back is for BEST's results back to 1850. BEST's temperature record goes back to ~1750. BEST hasn't bothered to show the effect of its "homogenization" back that far. Why? I don't know. BEST hasn't said. I guess this is just more of BEST's "openness" and "transparency."


I won't claim this list is exhaustive. There are other issues with BEST. There's even one where a BEST team member, Steven Mosher, says I am wrong which I'd like to resolve. This post is just long enough already (nearly 3,000 words). The point should be clear.

BEST claims to be completely open and transparent. At the same time, it publishes uncertainty levels to the thousandth of a degree despite knowing those uncertainty levels are underestimated, in part due to effects it simply never told anyone about. One of those effects is BEST ignores uncertainty in its homogenization process, a process which increases the amount of warming it finds by something like 20%. As I've said before:

I think that’s something BEST should have highlighted, or at least disclosed. I think BEST should have been up-front about this and said, “About 20% of the warming we find is due to us adjusting our data.” I think it should have been made clear most of those adjustments are made without any documented reason for them.

But it wasn't. Because BEST is all about being "open" and "transparent."

8 comments

  1. I was taught that the best (heh) you could do was one decimal place smaller than the lowest accuracy of your measurements. Despite electronic devices in many places that can measure to hundredths of a degree, the old-style thermometer readings included in the calculations only can be accurate to half a degree. So aren't numbers at two decimal places at the edge of credibility? Or beyond it?

  2. Gary, I don't know if that rule is necessarily true. I think it's more of a rule of thumb. Given a sufficient number of data points, I'd imagine you could violate it. For instance, if you have a million data points that are either 0 or 1, you can probably get better precision than a tenth of a degree.

    That said, that rule of thumb seems like it might well be accurate in this case. There are so many confounding factors in creating a temperature record I don't see how anyone could hope to estimate uncertainty to a hundredth of a degree. We don't even have a precise definition of what groups like BEST are trying to measure. There is no literal temperature for the planet (or region). It's just some artificial index that's been created to attempt to measure far more complicated things. That's okay, and it does have its uses, but if there's no precise definition of what is being measured, how can the results be that precise?

  3. "For instance, if you have a million data points that are either 0 or 1, you can probably get better precision than a tenth of a degree."

    I think you should differentiate between accuracy and precision.

  4. I concur with Russ, more or less. It is a common thing in electronics measurements. 10 measurements by the same instrument increases precision but not accuracy; the instrument could still be off by 2 percent (for instance). It is a problem of quantization error when going from analog (measurements) to digital. The same phenomenon exists in digitized audio, music for instance, so a 1/2 step jitter or noise is added to the least significant bit so you don't have this sudden drop-off quantum-like effect. It is also known as dithering.

    Anyway, if you make ten measurements with ten different instruments, then you increase both precision andaccuracy, provided one assumes random variation in calibration. Consequently, ten thousand measuring stations ought to be averageable to 0.001 degree precision and maybe accuracy unless of course all thermometers are calibrated to the same standard.

    By itself it won't mean anything but until someone changes something I suppose you could get some sort of trend out of it.

    "One way to find the random error in a particular measurement is to measure the same quantity many times in the same manner. The higher the number of trials, the closer the average gets to the true value (assuming no systematic errors and a normal distribution of errors about this true value). "
    http://www.physics.uc.edu/~bortner/labs/Physics%201%20experiments/Measurement%20and%20Uncertainty/Measurment%20and%20Uncertainty%20web.htm

  5. Brandon, I am becoming a fan. Your writing is clear and to the point, a refreshing break from a seeming contest take the most abstruse means possible to make the obvious impenetrably technical. Because of this I hessitate to remark that perhaps you meant to say BEST uncertainty margin is too high to make the claim of which years on the list are in which order rank. You wrote "Which shows the margin of uncertainty in the BEST results is too small..."

    The USA has the highest density of documented 20th-21st century weather reporting it should be the most accurate. The only reason one might think BEST did better with the world data is only because the must not have manipulated it from the already published values.

  6. Russ R, a few weeks ago I had a fairly lengthy exchange because of doing that. I clearly distinguished between the two, but a person came along using a different definition of accuracy (which was also an acceptable definition) and complained. It was a huge mess. Did you really have to remind me of it? >.< Michael 2, I agree with everything you say. Getting that sort of increase in both precision and accuracy isn't likely. It just isn't impossible. That's why I'd say what Gary described is be a good rule of thumb, but we can't take it as an absolute truth. I'd wager there are times when it is not only possible, but actually used to solve some problem. Ron Graf, glad to hear it! When I first started writing about things in the global warming debate, I decided my main goal would be to make things more accessible. There was a lot of material out there I knew most people wouldn't understand if they stumbled upon it. It seemed "interpreting" that material for the average person would be more useful than trying to create new material. I've obviously branched out, but I've tried to keep the same goal in mind. It's good to hear it's working. And yeah, you're right. I said "small" when I should have said "large." I wrote so much about BEST's published uncertainty levels being too small the wrong word came out in this case. I'll get that fixed.

Leave a Reply

Your email address will not be published. Required fields are marked *