How BEST Overestimates its Certainty, Part 2

My last post had this footnote:

*This claim of independence is incredibly misleading. BEST estimates breakpoints prior to running “the entire Berkeley Average machinery.” It does so by examining every station in its data set, comparing each to the stations around it. This is effectively a form of homogenization (BEST even stores the code for it in a directory named Homogeniety).

That means BEST homogenizes its data prior to performing its jackknife calculations. Whatever series are removed in the jackknife calculations will still have influenced the homogeneity calculations, meaning they are not truly removed from the calculations as a whole.

It’s trivially easy to show homogenizing a data set prior to performing jackknife calculations means those calculations cannot reflect the actual uncertainty in that data set. I’m not going to do so here simply because of how long the post has already gotten. Plus, I really would like to get to work on my eBook again at some point.

It occurs to me I ought to demonstrate this is true rather than just claim it. I tried to show what effect this has on BEST's results by fixing BEST's mistake and rerunning the analysis, but I couldn't because my laptop doesn't have enough memory to handle all the processing. As such, I'll just provide a couple of excerpts from BEST's code to help show what is done.

The main entry point for BEST's code is a file named BerkeleyAverage.m. The bulk of the work is done by a file it calls named BerkeleyAverageCore.m. Here is the code for that call:

% The following code executes the "scalpel" method of breaking records into
% multiple pieces at metadata indicators and empirical breaks.
[se, sites, start_pos, break_flags, back_map] = scalpelData( se, sites, options );

% This is where the real heavy lifting is actually done
if nargout > 1
    [results, adj_flags] = BerkeleyAverageCore( se, sites, options );
else
    results = BerkeleyAverageCore( se, sites, options );
end    

As you can see, prior to calling the main file, BEST calls the scalpelData.m file. This file holds the code which determines where BEST believes breakpoints exist. It is only once those breakpoints have been determined that BEST passes the results to the main processing file.

The scalpelData.m file is never called again. When BEST performs its jackknife calculations, it does so with this call:

        % The real effort for statistical uncertainty
        results.statistical_uncertainty = computeStatisticalUncertainty( se, sites, options, results );

The parameters passed to computeStatisticalUncertainty are the same as those which were passed to BerkeleyAverageCore.m plus the results doing so generated. computeStatisticalUncertainty.m doesn't call scalpelData. It just goes through some loops to see what happens if you remove 1/8th of the data used in the main calculations and repeat those calculations.

BEST does not recalculate its breakpoints when doing its jackknife calculations. It does not examine how much variance there is in its data set. BEST only looks at how much variance there is within its homogenized data set. This causes BEST to underestimate the actual uncertainty in its temperature record.

It is incredibly misleading to take the amount of variance in a homogenized data set as the amount of variance in the unhomogenized data set. It's baffling BEST does so.

11 comments

  1. https://noconsensus.wordpress.com/2011/10/30/overconfidence-error-in-best/
    https://noconsensus.wordpress.com/2011/11/01/more-best-confidence-interval-discussion/
    https://noconsensus.wordpress.com/2011/11/20/problems-with-berkeley-weighted-jackknife-method/

    I have not reviewed the BEST code to see how it works on individual passes but it seems to me that the Jackknife method won't be accurate with re-weighting of data per the BEST homogenization process. From the papers, it reads that they remove 1/8, homogenize etc.. for the jackknife process. It basically makes the result of the CI in the procedure an unexplored function of distribution. Re-weighting through re-homogenization means that you aren't truly removing 1/8 of the information. It could be greater or less than 1/8 each time.

    Hopefully that makes sense because I have made no progress on it with the BEST group. I like the series better than the others still, I just don't trust the CI calculation.

  2. Jeff Id, what you say makes sense. They do their jackknifing on a homogenized data set (which my trick got them to admit) rather than the underlying data set, violating the assumption the subsamples in the jackknife are independent. They also reweight the series between jackknife runs, meaning the samples do not have direct, linear relations like their calculations assume. How much of an effect those two issues have is unclear. BEST seems to believe the issues don't matter, but they've done absolutely nothing to demonstrate that is true.

    In any event, I don't like BEST more than the other records because BEST clearly screws up its regional results. I've written a few posts about this. My favorite example is all the data agrees the southeast United States has seen cooling while most of the country has seen warming. The signal is clear if you look at the station records. GISS, CRU and everybody else agrees about this. BEST has even published a figure showing it is true.

    But if you look at the BEST results, they show the opposite. Despite the data showing SE US has a clear cooling trend, BEST finds a strong warming trend for the entire area. That's baffling. We're talking about an area 1/3rd the size of Europe. If BEST is smearing warming so much it can't get things right on that large a scale, I can't see any reason to prefer it to something like GISS.

  3. Thanks Brandon. It's nice to know someone else understands what I was saying, I was starting to wonder. The problem really jumped out at me when I read their paper and I thought it would be something they would want to correct.

  4. Jeff Id, I'm just happy you raised that point in the first place. It was good to see BEST's poor handling of people's feedback wasn't limited just to me. There's always that concern things are happening just because people don't like me!

    On the issue of blending trends, here is a quick road map to help you find what's been said. My first post here on BEST was a simple one which showed BEST systematically adjusted the trends for all of Illinois in an upward direction:

    http://www.hi-izuru.org/wp_blog/2013/12/illinois-sucks-at-measuring-temperature/

    Later, I wrote a post drawing attention to the fact BEST's calculations for "empirical breakpoints" seem to be messed up, seeming to find breakpoints where there are none (and failing to find them when there are). The more interesting issue came up in comments where, thanks to discussion with Carrick, it was discovered BEST changes all the Southeast United States from having a cooling trend to having a warming trend. I'd start about here:

    http://www.hi-izuru.org/wp_blog/2014/04/a-small-challenge/#comment-994

    A couple months later I brought that issue up when discussing how a peculiar aspect of the BEST temperature set. Namely, I showed (for my area) GISS and BEST produce results which are nearly identical in their high frequency component yet BEST adds a strong warming trend. I thought it was fascinating the low frequency component of the sets are completely different yet the high frequency components are virtually identical. One could be forgiven for thinking all BEST does differently than the other groups, at least in some areas, is get the long-term trend wrong.

    http://www.hi-izuru.org/wp_blog/2014/07/is-best-really-the-best/

    But that result was shown only for my local area. To address the possibility of it being a strange artifact or a case of cherry-picking, I then wrote a post offering to compare BEST and GISS results for any portion of the world. A number of people took me up on the offer, and the results were found to hold in each case:

    http://www.hi-izuru.org/wp_blog/2014/07/pick-a-spot/

    Finally, I decided to approach the matter in a more systematic method, writing code to allow me to examine the trends of the entire globe for any period people were interested in. There were some interesting points about the shape of the distribution of trends between GISS and BEST, but the most interesting thing (to me) is I found absolutely no part of the world showed cooling since 1960 in the BEST data set:

    http://www.hi-izuru.org/wp_blog/2014/07/cooling-is-not-impossible/

    That is incredible. I am confident it is an artifact of the BEST methodology, and I have a couple ideas as to what may cause it. Unfortunately, I was doing this about the same time I had found the "secret" data for the Cook et al consensus paper. Because of that, I got distracted and wound up not pursuing things further. I regret that because I think the issues I was examining are important. Also, I had just gotten code ready to allow me to compare BEST and GISS to the other temperature sets as well, code which I've now managed to lose.

    Anyway, I know that's a fair amount of material, and you can jump to the end if you'd like. I just figure it might be good to see how my thoughts on this issue have developed.

    Plus, there's a bit of ego involved. I like pointing out how long I've been discussing issues most people seem to be completely unaware of!

  5. Carrick, I'm not sure what you mean. The only time I know 1960 should matter is BEST uses 1960-2010 as its baseline period for aligning its jackknifed series. I think it is used in their spatial uncertainty calculations as well. That's not an issue for what I've said in this post though. The only time 1960 came up here is in reference to my previous post. I forget just why I picked 1960 for my first test in that post, but I tested a lot of periods when writing that post:

    Of course, this is just for one period. I tested to see if this pattern held for other periods. Try periods as far back to 1900, I found the same thing. The world simply didn’t cool.

    I just realized the typo in that third sentence. I don't know why I wrote try instead of trying. That bugs me. I think I need to fix it.

    Anyway, I don't think 1960 is particularly special for this. BEST has said they used the 1900-2000 period to calculate their climatology field. We know there was a general warming trend over the 1900-2000 period. My current supposition is by calculating their baselines over a period with a known warming trend, BEST biased its results toward having a warming trend in that period. That wouldn't cause a warming trend to appear out of nothing in their global results, but it could certainly influence the trends of individual locations.

    If I'm right, the influence would manifest as a bias toward showing warming trends in all locations over the 1900-2000 period. Any comparisons of other periods, like the 1960-2013 one shown in my previous post, would be influenced depending upon how much overlap there is with the baseline period (relative to how much non-overlap there is). 1960-2013 would have 40 years overlap, 13 years non-overlap. That would mean it is more heavily influenced by the bias I suggest exists than the 1980-2013 period with its 20/13 breakdown.

    That does fit with what BEST's results show, but I'm not sure how conclusive it is. One could always argue using shorter periods will result in more variance in your trends. One can look at how the variance in BEST's trends change depending on what period you use (keeping a constant period length), but that has a confounding factor in that the quality/amount of temperature data isn't the same over time.

    What I'm basically saying is it's really tricky to test for a artifact you think exists due to a 100 year calibration period if you only have a 13 year verification period.

Leave a Reply

Your email address will not be published. Required fields are marked *