2012-03-18 19:35:15Looking forward to the Quality Control stage
John Cook


We are smoking through the rating and before we know it, will have completed that stage and be onto the quality control section. So I will need to do some programming so that when we finish rating (approximately in a week's time), we can jump straight into quality control. The concept for quality control is so far a little fuzzy so I'd like to pin down the process now, get more specific. Here is what I have in mind.

Once all the papers have been rated twice, I will add a new section to TCP: "Disagreements". This page will show all the instances where someone has rated a paper differently to you, in the following form:

Title Other Rating Your Rating
Paper Title (mouseover for abstract) Category
Endorsement Level
Dropdown: Category
Dropdown: Endorsement
Textbox: Comment
Checkbox: Confirmed 

I've suggested a new feature - a checkbox "confirmed". What I suggest happens here is we all look through all the instances where we disagree with another rating, see what ratings/comments they have. If we agree with their ratings (perhaps it was an early rating back before some of our clarifying discussion or just a mistake), then we upgrade our rating to make it consistent with the other rating and it disappears from the list. If we're happy with our existing rating, then we click "confirmed" and perhaps add a clarifying comment for the other rater.

What might happen then is both raters both confirm their own rating, so we end up with two ratings where neither rater won't budge. Then I suggest we bring in one or more extra judges to also post their rating on the contentious paper. I'm not sure exactly how this will play out - ideally just having a third set of eyes will offer enough perspective to come to a consensus but it may be we need more voices - which may add clarity or just more noise. Perhaps it comes down to if 5 or more ratings have been cast on a single paper, majority rules?

Any paper with 2 or more ratings is "locked", that becomes the "official TCP rating". 

In the forum header, I'll have a display like:

Locked Papers: 9343
Papers Yet to be locked: 3343

And maybe something specific like:

My Ratings that disagree with others: 343
My Unconfirmed Ratings: 121 

So the goal being that you whittle your own unconfirmed ratings down to 0.

Thoughts, comments?

2012-03-18 22:43:39
Ari Jokimäki


If there is a disagreement, I suggest we take those cases directly to third parties without any confirmation phase. It is possible that someone might change their ratings just because someone else had rated differently, not necessarily because they think their previous rating is not accurate. I think it would be safest to keep ratings of each individual hidden from others until we have official rating for all papers. At any case, I think it might be good idea to take a backup of phase 1 ratings before proceeding so that we can keep track of what happened, which might be important when we consider uncertainties of the rating phase.

Also, I think the first phase of quality control has to be everyone looking back at their ratings if there is something to correct. For example, we changed the rules on mitigation papers during the rating phase, so at least I need to check back on those.

2012-03-19 03:52:54


John's scheme has the advantage that we are forced to check our ratings but only when there's disagreement. It should not be intended as a critic or an error checking, only that there may be some ambiguity in the classification. If the disagreement remains, ask for a third opinion.

Having some time to review our ratings, as Ari says, would be good. If it is possible to add the numberng when we search for our rated papers we can keep track of where we are with the process and do the job in small steps.

2012-03-19 04:38:42
Andy S


Perhaps it would be possible to fix category disagreements first. I would imagine that those should be fairly quick to resolve. That exercise may provide some insight as to how to homogenize the ratings. I have probably underreported both "methods" and "opinion" categories.

I supect that we have all felt that our rating criteria have drifted with time and it would be useful to review those rating where there is disagreement. Some ratings discrepancies may also be fat-finger entry mistakes and those, of course, need to be fixed. It may be helpful to have some stats., for example, the percentage of certain ratings that we have done compared to the group overall. This may reveal some systematic bias of certain individuals, well, maybe not bias but failure to grasp the criteria correctly. Eg, I have awarded very few #1's and it's entirely possible that others have understood the criteria differently and perhaps more correctly than me.

I'm not sure about whether all ratings disagreements need to be resolved, as long as they are adjacent on the scale. Some cases are real borderlines and it might be better, more accurate even,  to classify an abstract as a #2.5 if there is one #2 and one #3 vote. Of course, this would make the results more complicated to assess. Getting more votes to force a consensus to a single value as John has suggested may be simpler.

Regardless, Ari's comment about keeping backups is exactly right (I'm sure John was planning to do this anyway). Basically, what we will be doing in QC is a filtering process and it needs to be transparent and fully documented. The favourite (unjust) criticisms of SkS by "skeptics" is that we are biased, that we bury dissent and that we change our posts on the sly.

2012-03-19 15:04:55
Sarah Green

Following Andy, I would like to see the percent I rated in each category relative to the whole sample, i.e. 50% of mine are neutral vs 55% of everyone's, 10% are methods, etc. I imagine this as a bar graph (or two- one for category, one for rating) with way to choose "time" periods (rating number) so we can see drifts since we started.

John, do you have any sense of how many disagreements have showed up? 5%, 10%, 40%? I'm guessing 15-20%. (Already I disagree with myself maybe 2% of the time, based on how many I change when I look at the last 50 lists.) 

If 15% is the ballpark range then the re-rating system sounds fine. I don't think we'll change ratings just to match the other one. Mainly I'd change mine based on an evolving understanding of the rating system, simple errors, and missing details from sometimes reading too quickly. 

If disagreements account for more than 25% of the total then asking for third ratings immediately might be better. 

I suppose if you wanted to speed the process you could simultaneously ask for 3rd ratings while resolving disagreements on the theory that additional input can't hurt. Might lead to complications in explaining the process, though; better not.

I suspect that the vast majority of disagreements will be between 'implicit' and 'neutral'. If you can pull out data on which pairs of non-agreement were most common, perhaps we can re-iterate/discuss/clarify the categorization rules for those.

Does it make sense to ask for a 3rd opinion for all rejection papers? We really want to be sure those are firm. (and there aren't many!)

2012-03-19 16:48:32
Ari Jokimäki


My point is that at the end of the rating phase we have 2 ratings for each paper, but if we start consulting ratings of others, there is a possibility that there is some amount of unwanted changes towards other ratings, which means that after this confirmation phase we would actually have less than 2 ratings per paper (as one consensus rating cannot be counted as two individual ratings). This would make our already bad polling statistics worse (and introduce yet another impossible to quantify source of uncertainty). On the other hand, if we take third ratings for all disagreements without confirmation phase, we get 2 or more ratings for all papers.

I also think that it would save us some time if we would start the third person rating immediately (after the personal editing phase I suggested above) without confirmation phase.

2012-03-20 07:09:22



paper rated not climate related will be deleted anyway or in case of disagreement the will be passed to 2nd phase? A few papers I rated were borderline and if they get deleted I'd like to re-check them.

2012-03-20 12:52:37Okay, brace yourself for these numbers
John Cook

Have done a database query of how many disagreements there are between ratings so far - both for category ratings and endorsement level ratings:

Number of Category disagreements: 4850

Number of Endorsement disagreements: 5738

However, I also did an SQL query for all endorsement ratings that agree with each other and got 11,056. So I'm not sure how that works and whether I'm doing the SQL query correctly.

A few follow-up points:

  • Yes, will definitely back-up ratings before we start quality control
  • Having everyone review their own ratings sounds like a good idea and I need to beef up the My Rating page to make this easier to do.
  • Ari's idea is we get a third rating for any disagreements. So if the 5000+ disagreements figure is correct (not yet confirmed), we're looking at re-rating at least half our sample again - possibly close to the entire sample. Who's up for another month of abstract rating? I don't particularly relish the idea, I'm keen to get to the result in the most efficient manner.
  • The way I see it, the most practical and efficient approach is to go directly to disputed ratings and sort all those out. I see the "checking disagreements" as a way of flagging ratings where we may have suffered from rating drift, or not understanding the guidelines or simply papers that are tricky to categorise. But I also want to include the "confirm my rating" checkbox so there's no pressure to have to conform to another's rating. That way, you can fly through all your disagreements, either confirming your original rating or if you realise your ratings drifted or you misunderstood the guideline, then update the rating. If we all go through our own disagreements, then we will cull the more egregious disagreements and be left with genuine disagreements where both raters have confirmed their rating. It will be those "impasse" disagreements where we bring in a third party.
  • I can start this process with category disagreements as they are not as "crucial" as endorsement levels and as Andy says, probably less contentious and easier to do.
  • Lastly, this is a pretty fundamental question but I always assumed that the final result would be discrete values for each paper. Eg - a paper will be rated a 2 or a 3 but not taking the average of multiple ratings to get a 2.5. The purpose of multiple rating wasn't to generate a statistical rating for each paper (we'd need a lot more ratings per paper to get that) but as a form of quality control so we're all checking each other. The upcoming phase 2 is the implementation of that concept. Is this assumption of discrete ratings valid? Have others been operating under a different assumption?
  • BTW, hypothetically, if we want to do third person rating immediately as Ari suggests, we can just keep motoring on after we've crossed the 24,544 mark and I can filter them to only show disagreement papers. But let's resolve this "do we reduce our disagreements" question first.
  • Riccardo, good point - we should eliminate opinion papers from the analysis before we start agonising over disagreements over those papers. There are probably a few hundred opinion papers.
2012-03-20 13:02:43
Andy S


Hmm, that's a lot more disagreement than I expected. John, did you exclude the unrated papers from your query? If you didn't that would exclude about 4000 from both those totals of diagreement, which would be more in the ballpark.

2012-03-20 13:28:57
Sarah Green

I've been mulling on this all day (night for some of you).

I see Ari's point that if this were like an opinion poll we'd need to avoid influencing each other and changing ratings. And we'd want at least two independent ratings. In that case adjusting disagreements would not be appropriate (except re-visiting our own, perhaps). 

But, this is clearly not an independent poll, nor really a statistical exercise. We are just assisting in the effort to apply defined criteria to the abstracts with the goal of classifying them as objectively as possible. Disagreements arise because neither the criteria nor the abstracts can be 100% precise. We have already gone down the path of trying to reach a consensus through the discussions of particular cases. From the start we would never be able to claim that ratings were done by independent, unbiased, or random people anyhow. The goal should be to match the criteria as closely as possible.

So, I'm leaning toward John's initial scheme. 

Wow, over 30% disagreement- that's high. Can you pull out how many disagreed by more than one point?  Or how many are '2 vs 3' disagreements?

(I don't want to ask who is most disagreeable....)

2012-03-20 13:33:18
Andy S


Hmm, that's a lot more disagreement than I expected. John, did you exclude the unrated papers from your query? If you didn't that would exclude about 4000 from both those totals of diagreement, which would be more in the ballpark.

OK I made some assumptions about John's calculation.

  • I assumed that he didn't remove the unrated papers.
  • I assumed that the papers that agree with each other is a double count.

This means that (at the time I did this):

8175 papers had been graded twice

933  of these papers had a category discrepancy 11%

1821 of these papers had a ratings discrepancy 22%

5528 were graded consistently 68%

2674 were graded differently in one or more ways 32%

107 were graded differently on both ways 1%

Which seems more like it and, if I'm right (big if) then the next stage should be easier. It also means that our measurement error is not that bad.

2012-03-20 14:00:46
Sarah Green

Andy- thanks for digging deeper into the numbers. That range is closer to what I expected.

2012-03-20 14:06:15
John Cook


Andy, no, I didn't do it the way you think. I basically did a query of all pairs of ratings done by different people (to exclude comparing your own rating to itself) of the same paper where the two ratings were different.

Okay, one piece of good news. I realised that my query was counting every pair twice. So I updated my query to make sure it didn't double dip and the updated quantity of endorsement disagreements is now 2876.

I decided to dig a little deeper and probe how big the disagreements are. So here are the numbers on how many "rating pairs" are different by 1 endorsement level or 2 endorsement levels:

# of disagreements by 1 endorsement level: 1112

# of disagreements by 2 endorsement level: 146

2012-03-20 17:12:11
Ari Jokimäki


I wonder how long the confirmation phase will last? John has 25 % of disagreements. That would make my disagreements = 765. With a pace of hundred per day it would take me about a week.

After that would be third party rating phase which would then last how long? Hard to say how much disagreements would be solved in confirmation phase, but thinking of myself, I don't think I'll change my ratings that much.

Would confirmation phase + third party rating phase be really quicker than just a full third party rating phase?

2012-03-20 22:11:38
John Cook


Yes, much quicker. The confirmation phase would enable us to relatively quickly knock off a significant chunk of the disagreements.

However, the third party rating phase, who knows how long and complicated that's going to be. I'm not quite sure what's going to happen there. So the less of those we have to handle as third-party-ratings, the better.

I think it would be quite useful to go through all my disagreements to see what others have rated, see their comments and cause me to relook at my own ratings. If I decide I still stand by my ratings, I'll confirm my own rating and post a comment which will allow the other rater to see my thoughts and then decide what to do. It seems the quickest, most efficient way of getting consensus "official ratings".

It will also be a collaborative process where we directly interact over difficult ratings - I see it as being like all of us getting into a room together and nutting over the hard-to-rate papers.

2012-03-20 22:58:24
Ari Jokimäki


I'm not sure why you think third party rating would be complicated. It's just like the rating phase we are doing now, right?

2012-03-21 00:12:45
John Cook


Well, if we assume that our final results will be to assign discrete endorsement levels to each paper, rather than a statistical average, then how exactly do you think the third party phase will work? Deciding vote? Keep getting votes until a consensus emerges?

2012-03-21 01:43:45
Ari Jokimäki


Yes, I think the third party rating as deciding vote. It is likely that most disagreements get the third vote that is one of the original two votes. Some cases you get third different vote so you need to get fourth. However, the plus side is that you get a sense of uncertainty on those cases that require even more than three votes.

I'm not surprised that there are disagreements that are 4 levels apart. I'm surprised to see that there aren't any disagreements even more apart. I have seen cases where the AGW contribution has been defined as 50%. I have rated these neutral (the rules say higher than 50% are endorsements and lower than 50% are minimizations but rules don't mention what to do with exact 50% - perhaps a minor error in the classification rules) but some others might have rated them explicit quantifications either way, so you could get cases that are 7 (?) apart.

I also want to see differences from my ratings, but only after rating has been completed. I had one thought relating to this; if you are going to go with the confirmation phase, one option would be to do it so that there would only be an indication of disagreement but not information on the other rating. This way you would get the pointres to cases needing confirmation but you wouldn't get biased by other ratings (at least not that much). This would be sort of compromise between your suggestion and mine.

2012-03-21 14:19:00
Dana Nuccitelli

I'd like to look at how my ratings compared to others and see if I agree with theirs or think mine are correct (i.e. the John method).  Maybe we can even add a text box to explain why we think our rating was correct in a disagreement, if that's the case?

I imagine this would be the faster method, as we'd be looking at abstracts we've already read.

2012-03-21 16:10:46
Sarah Green

I'm still thinking about how this process differs from a poll or survey. In an opinion poll you ask the same question to a large number of people. In this project we are asking 11,000 different 'questions' to two people each. So we can't really do any kind statistical analysis based on differences in rating a specific paper. The stats have to come from the overall collection of ratings. Therefore it's ok to collude and adjust to resolve differences. Using a 3rd (or 4th) opinion to break a dispute is fine. 

Actually, one could argue that John or Ari should decide on disputed papers with the following reasoning:
If the goal is to apply the criteria objectively then the person who wrote the criteria should be the final arbitrator (John). Or if the goal is consistency, then perhaps the person who rated the most papers should arbitrate (Ari).
Ok, I'm not seriously suggesting that. (whew)

But let's consider that the disagreements actually contain some information. Certainly they reflect the clarity of the criteria, and maybe other interesting things as well.

  • Which categories of papers have the most disagreements? 
  • Which rating levels are most disputed?
  • Is agreement better at the ends of the scale? (1-2, 6-7)
  • Can we make any generalizations about disputed ones? Do papers on a particular topic show up more often? Are some inherently falling between the definitions? 

e.g. I often have trouble with those looking at soil carbon stocks. Some look at the impact of warming on soil carbon stocks or emissions (impacts, neutral or implicit); some look at impact of fertilization, etc. on C- emissions from soil (mitigation, usually implicit); some are inventory papers (methods); some combine various aspects (?); some are focused on feedbacks (impacts or mitigtion,  neutral or implicit, depending on details).  

And I'm sure we all bring our own slants as well. I put very few into 'not climate related'. I also had few 'explicit>50%', because I didn't find many that quantified to that degree and I didn't try to guess.

A practical issue: if 2 people disagree, then only one needs to change to fix the issue. If that happens will the paper disappear from the list of disputed ones? Or will it still show up for the second rater? (who could also change their rating!)

The large number of 4+ differences baffle me. I know I have found a few simple inattention errors in mine where I just mis-hit the drop-down list, but I hope that's not many. 

I'm willing to go ahead with the disagreement check.

2012-03-21 17:58:35Green thumb for disagreement check
John Cook


The general "consensus" is to have the disagreement check so I will begin programming that shortly. I will use my own ratings as a beta-test and any lessons learned from that experience, will share here. Of course, we still have 3000+ ratings to go and our progress is slowing so will get onto the ratings while I'm coding.

Sarah, the idea was if someone changes their rating to agree with another, then that particular disagreement disappears off the disputed list.

I suspect that the discussion from the 'Disagreement Check' phase will lead to clarifications of some of the guidelines.

2012-03-21 18:17:37
Ari Jokimäki


It feels to me that both parties should be granted the possibility to view and act on the disagreement, even if it means that both change their rating.

2012-03-21 18:30:53
Andy S

I struggled with a lot of the mitigation papers between neutral and implicit and I probably have not been that consistent. And I agree with Sarah, the soil papers were sometimes tough, too. I was amazed at the number of botany papers that looked at the effects of either CO2 or temperature increases but rarely both at once. I scored all the single factor papers neutral and the. Rare ones that looked at both. At once implicit. I think we'll have to find a way to deal with the noise. If we try too hardto eliminate it, we run the risk of making the study open to criticism that the data have been fudged. These are subjective calls using at times insufficient data (the abstracts only). The results should be noisy. That's why we must keep and present the raw data. I'm sure that the main result, overwhelming consensus, will emerge nicely from that anyway.
2012-03-22 12:56:36
Sarah Green

I agree with Andy that it's ok to allow some noise to remain. I would hope that most disagreements by the end will be +/- 1, not 4, though.

Three ratings could both keep the noise (by averaging to say 3.3), and allow it to be conveniently rounded away when  discretevalues are more useful (3.3 rounded to 3).