So if you've read my last few posts, you can probably tell I could use a bit of a break. I was considering taking a few days off. I probably should. I just don't know if I could bring myself to. So instead, I'm going to try doing something different. I'm going to talk about math.
You might not know it, but I love math. I think math is beautiful. One of my biggest regrets in life is I never got much formal training in math. Even so, I think I still understand it in a more fundamental way than the average person. That's because to me, math is more philosophical than practical. I like the logical structure it imposes on arguments. I like how it makes you think in a rigourous manner.
So today I'm going to discuss math. It's about some work by Richard Tol, and if you've followed my blog, you'll know there is history there. That's not important though. The data errors previously found in Tol's work aren't important either. All that matters is the implicit argument found in one of his papers: the less data you have, the more certain you are of your results.
That argument is obviously absurd. We could just laugh it off and move on. That'd be a bad idea though. Silly ideas often get disguised with verbiose terminology and complicated equations. Once that happens, it stops being easy to tell they're silly. Then you can't just laugh them off. Then you have to be able to explain what makes them wrong.
To begin with, Tol uses this data:
More or less. The data might not match up exactly because the data set has been changed more than half a dozen times due to people finding errors in it. That's pretty remarkable given the values are supposed to just be taken straight out of papers. As in, Tol was supposed to have read ~20 papers, looked at numbers in them, copied them down and plotted them on a graph.
Regardless, any differences between this version and the one used in the paper we're looking at should be irrelevant for our purposes. What matters is we have ~20 data points plotted on an x-axis ranging from 0 to 5, almost all of which are below 0. Only one point is above 0 by any meaningful amount, and that is at 1 on the X-axis. Suppose you wanted to determine what relation this data indicated between the variables for the x and y axes. What would you do?
One option would be to draw a line through the data. Tol did this. The first time he did it, it was with an earlier version of this data set. That got him this image:
Later on, after he corrected some errors in the data and added new data in, he tried drawing a new line. He got this one:
That's a pretty different looking image. The reason is when you only have ~20 data points, it's hard to tell what relationship there is between two variables. You pretty much just have to guess. Tol guessed there was a quadratic relationship. That is, he guessed: f(x) = ax^2 + bx. But the relationship could have just been parabolic: f(x) = ax^2. Or linear: f(x) = ax. Or cubic: f(x) = ax^3 + bx^2 + cx.
All of these are "models" he's "fitting" to the data. That is, he's guessing at a relationship the two variables might have to one another, and he's then finding out what numbers would best fit that relationship. There are an infinite number he could try, and none of them are "right." It's just a matter of which give results that seem to be reasonable.
So with that in mind, it's obvious a model that changes greatly when a small amount of data changes is not a very good model. In order to have much confidence in one's work, it should be robust to the removal of small amounts of data. Since Tol's earlier model wasn't, finding a new model seems wise. The work I want to discuss today does. It's a fairly complicated one. I'm not going to discuss it.
Confused? Don't be. The model itself isn't important (which might be why Tol keeps changing it from paper to paper). What is important is the output of the model. I'll show you that in a minute. Before I do though, I need to explain one thing real quick. Tol splits his data up into four groups labeled AR2, AR3, AR4 and AR5. Those refer to the Second, Third, Fourth and Fifth IPCC Assessment Reports. Each group includes only those papers published before that particular report. You can see the break down in this table:
So AR2 includes the first four data points, AR3 includes the first 9 data points, AR4 includes the first 15 data points, and AR5 includes all 21 data points. With that in mind, here is the first set of outputs from Tol's model:
Ah-hah, you say. There is more data, but the bands shrink. That means I was wrong, right? The bands represent uncertainty levels, right? Well, let's see. Tol says:
Figure 1 shows the restricted Nadaraya-Watson kernel regression and its 95% confidence interval for the studies published before the Second Assessment Report, the Third, the Fourth and the Fifth, respectively. Before AR4, estimates of the impact of were limited to warming of 2.5°C and 3.0°C. The kernel regression is therefore valid only for a limited range of climatic changes. This range shrinks between AR2 and AR3 as the number of observations rises from 4 to 9, and the standard deviations shrink accordingly.
So yeah, that would seem to contradict what I said. As more data was added between AR2 and AR3, the confidence interval shrank, meaning we grew more certain of our results. That's exactly what we would hope would happen.
But let me ask you something, do you think it's just chance Tol singled out the jump from AR2 to AR3? What about the jump from AR3 to AR4? How about the one from AR4 to AR5? Do you think he just wanted to save space and felt they weren't important?
It'd be easy enough to check. Let's go ahead. Here's the output for AR4 and AR5:
Well would you look at that. The confidence intervals for AR5 are huge. They are way larger than the ones for AR4. They're way larger than the ones for AR3 or AR2 too. How is that? AR5 had the most data. How could it be that having more data decreased our certainty? More importantly, how could having less data increase our certainty? How could knowing less mean being more sure of what we know?
It's simple. It's because of how Tol defined uncertainty. A common way to define uncertainty is to look at variance. The more different your data points are from one another, the more uncertain you are of your results. It's a common sense approach. If all of your data points are very different from one another, you obviously can't be too certain of any results you draw from them.
The problem is the inverse is not inherently true. We all know data can be biased. If your data is biased, then constantly getting the same results doesn't mean those results are accurate. But that's not the (only) problem here. The problem here is, you only have 21 data points!
The values on the x-axis range from 1 to 5.4. The values on the y-axis range from -11.5 to 2.3. With all the possible combinations you could have between those, 21 data points is nowhere near enough. It's even worse if you then start splitting the data up into four groups.
Now, the primary reason for the increase in uncertainty in the AR5 chart is the one data point at -11.5. That point is a clear outlier. It actually shouldn't be in the data set, as it is given in PPP GDP instead of nominal GDP.* Still, if a single outlier can cause such a dramatic problem for your model, it's clear your model has serious problems.
Also, if one negative outlier has serious problems, it's likely one positive outlier could have serious problems. As I pointed out before, there is only one data point that is notably above 0. That one is introduced in the AR4 group, and as we see, that is when the model output becomes positive in the early potions. It would appear the model is highly sensitive to both major outliers.
In fact, you can see the model is so sensitive to outliers its uncertainty increases when they are added primarily because they are added.
Normally, if one data point is very unlike the rest, you assume the outlier is more likely to be wrong and give more weight to the rest of the data. This model does the opposite. This model asumes the outlier is more likely to be right and gives it more weight.
*It turns out this error was corrected in Tol's subsequent paper. Given I pointed the error out last year, it appears Tol corrected the error because of me. I'm not sure if he gave me credit. I am sure he introduced at least one other data error into that paper though. It's incredible really. This is at least the tenth revision of the data set I've seen now.