2012-01-22 18:09:04Nuts and bolts of how we're going to collect ratings
John Cook


While we hash out the categorisation of papers, to keep the ball rolling, would like to simultaneously discuss the nuts and bolts of how we will collect and quality control the ratings. This is a general idea of what I had in mind:

Rating papers

There will be a "TCP Hub" where you are given a random batch of papers, possibly 25 or 50 at a time (this figure can be tweaked). Papers that have received the least number of ratings are shown first. The papers are listed by title, author and journal and mousing over the title will produce the abstract. You rate each paper twice by assigning it a category (impacts, mitigation, paleoclimate, methods, opinion) and endorsement level (endorse, neutral, reject, exact details still being discussed). I may include a little text box so you can record notes with your rating if required. Note it takes me around one hour to rate 100 papers. So 12,000 papers would be 120 hours work.

As we all rate papers, the "TCP Hub" webpage will spit out how many papers have received ratings, how many yet to go, etc. Along the way, there will probably need to be clarification on the guidelines on how to categorise as we encounter specific papers that may be ambiguous.

Update: also, the methodology is we'll rate papers based solely on the paper title and abstract. The exception is rejection papers. To prevent the possibility of a rejection paper slipping through the net, if there's a suspicion that a paper MIGHT be a rejection paper, we should read the full paper just to be sure. This will impose a bias towards rejections as we won't be reading every possible endorsement paper but that's a bias we should be prepared to accept (and Phase 3 should remove that bias).

The goal is each paper gets rated at least twice so we can confirm against each other.

Quality Control

There are several ways we can approach quality control and maybe do a combination of the following options:

  1. If two people rate a paper differently, this paper is flagged and a third person can rate the paper also. Then the three or more raters can discuss the paper and hopefully come to a consensus
  2. You have the option of looking at all papers where someone has rated differently to you, compare your ratings to the other and if you still decide your rating is correct, discuss it with the other rater or involve another person
  3. This quality control can happen throughout the rating system or we can remain oblivious to what other people are rating until we've gotten through the whole process. There are pros and cons for either but having quality control during the process would mean if someone is approaching their rating incorrectly, it can be identified early rather than all of us going in different directions and only finding out after we've done thousands of ratings each

Displaying Results

It's hard to speculate on how best to display the results until we've actually done the rating (although I do have a few ideas). But the intriguing thing is there really are a multitude of ways we can skin this cat. So I suggest when the rating is done, I'll make the data available for all to import into their stats packages or Excel in order to play around with various ways to display the data.

2012-01-22 20:51:12
Glenn Tamblyn


Can I simply lodge a basic objection here. Phrases like "The exception is rejection papers." raise my hackles in a purely objective sense. The serious issue I see here is that there is a bias being built into the study. We all know which side the frog will jump on, but ANY bias in methodology will emasculate the benefits of the process. This process MUST BE BRUTALLY IMPARTIAL in it's methods. Let the data show what we want. We MUST NOT INCLUDE ANY bias in the analysis methodology. Make it utterly bullet-proof and let the data speak for itself.

Oreskes was attacked - validly or not is irrelevent - over methodology. The effort required to assess so many papers must not be squandered because of something that is not a brutally clinical methodology.

Imagine taking this beyond the current 12K papers to everyting in the IPCC;s 4 reports and so on. Including all the old papers

The purpose of this exercise is not to demonstrate a consensus. It is to discover it! We know its there but the methods we use must totally ignore that. Including being open in the methodology to the possibility that there is no consensus. Have faith that the data will demonstrate this. The analytical techniques, code assignments etc must be Clinically and Brutally dispassionate.

We dont get many chances at something like this. We can't blow it.

2012-01-22 21:15:56
Ari Jokimäki


To me it also seems that being more thorough with rejection papers introduces a bias to the study. Also, basing the classification only to the abstract removes the need to hunt down or buy full texts of papers. I also think that title of the paper should not be used in classification, because titles can be misleading. I would suggest that in order for a paper to be classified, an abstract must be available.

I would like to request an exception on selection of which papers to classify. I would like to have papers on certain journals only if that's possible to do easily in your data.

2012-01-22 21:59:38The alternative to this pro-rejection bias
John Cook

I understand the objection. The problem is by going on abstract only for potential rejections, we may miss a rejection. So for there to be a rejection paper in our sample that we don't rate as a rejection would expose us to much criticism. But I guess we cross that bridge when we get to it. For practical purposes, asking raters to stop regularly to consult full papers is not practical when poring over 12,000 papers.

The SkS database does contain journal info. But on what basis do you want to exclude certain journals? And which journals?

2012-01-22 22:27:11"To be determined" category?


To not get stuck on papers needing more time to analyse how about a marker of some kind for "To be determined later"? These papers could then be checked out in more detail, perhaps by a separate team of reviewers who actually read the paper in detail?

That way, these papers will not hold up evaluating the large number of papers which can be tackled quickly. Also, the tricky papers might be the ones where it could make sense to actually ask the original author(s) how they evaluate it.

2012-01-23 01:22:38
Kevin C


Agree with Glenn.

Also BaerbelW - 'need to read the full paper' is a good option to have.

Also on QC, approach 3 - blind rating - is the only acceptable option. You don't get to see any of the data until you've done rating. Once you've looked at the data, you are forbidden from any further rating. If you want to intervene and correct someone who is rating incorrectly, then that has to be done by someone who is no longer doing ratings. However, that is still bad practice - better to simply accept that you may have to bin some raters as a hazard of the process. The other option would be to do a dummy run on a different data set - older papers say.

I don't think two ratings per paper is enough.

Another cross check - don't exclude a paper from being rated more than once by the same person. You'll be surprised how often the same person will give the same data a different rating based on mood, tiredness etc, especially if there is a reasonable gap between seeing the paper. You won't get much data this way unless you rig the system to insert set one duplicate in 10 after the first 100.

The absolutely fundamental issue is cognitive bias - we all have it. Those of us who agree with the AGW consensus in some cases just as much as the deniers - just in different places. That's why double-blind, symmetry and consistency checks are simply not negotiable. This needs to be approached with the same rigour as a drug trial. Anything short of this and a real statistician with no dog in the fight will come along and tear the whole thing apart. (In fact, if you know a stastician who knows something about survey design, now would be a good time to get their advice).

2012-01-23 01:42:56
Tom Curtis


It is more work, and may raise issues about accessing texts for some of us (including me), but I recommend papers be rated based on a reading of their abstract, introduction and conclusions.  I have seen some apparently innocuous papers (based on their abstract) hide a denier kick which you don't get to untill you reach the conclusion.  I think expanding the workload like that would make the rating much more secure.

Secondly, BaerbelW is correct about the need for an ability to flag borderline cases for more detaile analysis by a larger number of people.

Kevin C suggests a double blind system, but to achieve a double blind system, we would need to recieve the abstract (or abstract plus introduction plus conclusion) identified only by a randomly assigned number which changes with each review.  (Obviously the article will have to retain a constant identifying number not known to reviewers.)  In that manner, reviewers would not know the title, authors, or journal in which the paper was published.  Nor would they know from the identification number they access whether they have reviewed that paper before, or which other people might be reviewing it.  Further, for a double blind system, it is important that we do not discuss the papers we are reviewing among ourselves, except (perhaps) when papers need more extensive classification. 

2012-01-23 08:30:48Abstract, introduction + conclusions
John Cook


Tom, that's just not a practical option with 12,000 papers - it would require access the full PDF of all 12,000 papers and capturing that information. Out of our 12,000 papers, about 400 of them came into the database without abstracts. Jim tracked down around 200 of them and I've been tracking down some as well - just getting the abstract can be incredibly time consuming. There was one French paper that I simply couldn't get online, even with my UQ library access, and had to email the author, still waiting to hear back from him.

Rating based on the abstract is just going to have to be the framework we are constrained to. It will mean there are a lot of false negatives with endorsement papers where the paper doesn't endorse the consensus in the abstract but will in the full paper. But that's what Phase 3 will do - track down the false negatives. So one of the disclaimers of Phase 2 is that it's limited in the sense that it only restricts itself to the abstract and hence underestimates the consensus (but that will make the result all the more powerful).

Baerbel, when Tom, Dana and I rated the Category of each paper (impacts, etc), we had an Undecided category which we shoved the paper for later inspection rather than get bogged down with it. I can also add an Undecided category to the Endorsements so we can breeze through the papers and tag the tough ones for later without tripping over all the time.

I'm not so sure about one person rating the same paper twice - doesn't that add a bias also by weighting the result towards one person's judgement?

Tom & Ari, I'm not convinced that the title should be excluded from the rating process - surely the Title and Abstract should be combined to give the rater the maximum amount of information. Titles can be misleading. Abstracts can be misleading. But the title is the best indication of a paper's content, as chosen by the author - it is the most succinct communication of what the paper is about.

2012-01-23 14:00:38
Tom Curtis


I understand the difficulty with going beyond abstracts.  If I can make a suggestion, perhaps we could have a three stage rating process.


Stage 1:  Email the lead author requesting that they rate their own paper based on a standard questionaire without advising them of our rating;

Stage 2: Independently rate the Title, abstract, introduction and conclusion of papers for which we have all four, with reviewers being kept unaware of authors ratings if any;

Stage 3: Independently rate the Title and abstract of all papers, with different reviewers for papers reviewed in stage 2, and reviewers being kept unaware of prior ratings of the papers.  This means papers for which we have both the full paper will be reviewed in house at least four times (and possibly more if suggestions of more than two reviews are taken up).

In the paper you then report the results for all three stages, including a report of the error rate, and rate of false positives and false negatives with Stage 1 ratings being considered more accurate than stage 2, and stage 2 more accurate than stage 3.  In that way we gain the accuracy of self rating, but also can extend the results to all the papers, with a good estimate of the reliability of the extension.

The extra work involved would make the methodology bullet proof.  We could even extend the process by asking the lead authors to rate not only their paper, but also their opinions so that we could report not just a concensus in papers, but also on the consensus of scientific opinion.

2012-01-23 17:15:33
Glenn Tamblyn


John is right about the workload of sourcing and extracting the Intro/Conclusions.  But Tom's idea has real merit. So how about a compromise process.

Contact the authors. Report their rating and whether they replied at all.

Rate by Title/Abstract on every paper.

Rate a subset of papers on Intro/Conclusion as well. Ideally this would be a completely random subset but access to the text of a paper is the limiting factor. So this allows a balance between workload and robustness. If we can do enough papers this way we can start to develop a metric on how many other papers may be mis-rated on Abstract alone.

Idea. Although you need an account to get past a pay-wall, there is an old fashioned way of getting access to them. They are sitting there in physical libraries all over the world. If we were targetting say a certain sub-set of the papers from PNAS for example, all you need is a University Library that has the full series from PNAS and you can work your way through them. Surely among SkSers there are enough of us that could sit in a library occasionally and compare the Intro/Conclusion with the Abstract - Laptop or IPad, put in the details and off you go.

2012-01-23 18:31:05
Kevin C


John: When a rater gives multiple ratings the paper, then you need to only take one of their ratings - the first one, to maintain consistency with their other ratings. The purpose is to gain an indication of how consistantly that person is rating papers, and also how well the rating options correspond to objective categories of papers, not to gain more ratings for the paper. That does need to be taken account of in the analysis software of course!

2012-01-23 18:57:53
Brian Purdue


This is another very important reason why the Consensus Project must succeed in getting the truth out.

The obvious question to ask is is the project going to drill down far enough to expose the consensus on the 101 questions in Plimer’s book so student can stop asking questions of teachers and then not get expelled.

It would be great if project can but that might be a lot more work?

2012-01-23 22:04:50
Ari Jokimäki


John: "The SkS database does contain journal info. But on what basis do you want to exclude certain journals? And which journals?"

This relates to my own project I have told you about where I go through all papers journal by journal. I'm currently going through Journal of Climate and Climatic Change. If I'm going to take part in your project as a classifier, I would like to do it so that I can at the same time advance my own project also. So, I would like to take my classification load starting with papers from Journal of Climate and Climatic Change so I could do the classification for papers there same time for both projects, if that can be done easily.

2012-01-23 22:15:31
Ari Jokimäki


On using titles and full text:

I understand your point of view on using titles, but unfortunately the authors don't always construct the title as you describe. It might not be a good reflection of the content. My objection is not in using the title as additional information as such, but I just wan't to make sure that we don't classify any paper by their title alone.

Using full text only for only some papers introduces a bias to the study. There's a discussion of going through possible rejection papers by looking at their full text, but if you don't do the same for all papers, your analysis on rejection papers will be different than the analysis of other papers. If you have possible rejection papers, you also have possible endorsement papers. After going though rejection papers more thoroughly, you end up with more uncertainty in the analysis of endorsement papers than in analysis of rejection papers. I don't understand the need to introduce this bias.

2012-01-24 07:45:27Journal project
John Cook

Ari, how about at the end, I set up a data feed that output all results for your journals - a comma separated feed of all proAGW, neutral, antiAGW ratings per paper. You don't want to reinvent the wheel.

Agree re title alone, will have to use title and abstract together.

There was one simple reason for the idea of looking at full PDFs for rejections. If we miss an actual rejection paper, all hellfire will descend upon us from the denialosphere. Think Benny Peiser's attack on Oreskes 2004 but this time with a valid point. But I understand the need for a non-biased methodology.

2012-01-24 11:38:50
Tom Curtis


I'll again draw attention to my suggestion above.  The three stage process allows us to quantify the rate of false negative (as well as the rate of false positives)  without introducing any bias to the analysis.  If a Peiser style attack is then made, we can then repply that, yes we missed that paper; but the more thorough analysis indicated that we were likely to miss x% of rejection papers by analysing the abstract alone, the y% of positive papers (where presumably y > x).  That would satisfy any reasonable person.  We need not try to satisfy the deniers because nothing other than blatant falsehoods they have scripted would.