The mechanisms of peer review were defined at a simpler time before big data, more complex algorithms and millions of papers published annually. Here’s how to fix it.
Peer review is not that old. From all accounts, it acquired its modern form in the early 1970s although it has existed in various forms for about 150 years before that. Its premise is that the quality of a scientific publication can be ensured by peer evaluation of manuscripts. On the face of it this seems reasonable of course, but when you delve into the details of how the process works, you might see it quite differently.
Let’s first start with a definition of what is involved in a high quality scientific publication. Assuming it involves empirical research, then the most important aspect is the data itself and whether or not it was accurately obtained. This is not trivial. Next is the issue of how the data is analyzed. Long gone are the days of simple sums and averages. For good science, particularly where the content and complexity of the data is large, algorithms and analytical tools are ever more complex. Assuming the data and analysis are sound, the next most important thing is whether the data is logically represented and makes a sound argument. Then of course, there is the question of whether these results are ‘novel’. Some journals further assert to make judgements about the importance of the work as a threshold to review. Peer review today approaches the evaluation the other way around – first judging if it’s novel and important (initially by the editor or gatekeeper), then seeking the review of the manuscript by two or three peers in the field to opine on whether it is indeed novel and important and to evaluate the logical consistency of the manuscript. For the most part the foundational details of data and algorithms are wholly ignored. Let’s break this down and look at each of these aspects, starting with what peer review does do.
Is it novel and important?
Relative to even 50 years ago, journals contend with a huge volume of manuscript submissions. To handle this, most journals have installed a hierarchy of editors who become the gatekeepers. It is their job to decide if this is sufficiently novel and important to send it on for peer review. These editors do have PhDs but having chosen to opt out of science are not current in the practical intricacies and advance of its practice. Further, in the less specialized publications, it is not atypical for instance to have an editor with just one publication from their PhD in a field like stem cells, passing judgment on the novelty and importance of a paper in EEG. New journals such as eLIfe have tried to solve for this by bypassing this step all together and appointing only practicing scientists as editors and open access journals such as PLOS One claim to withhold judgement on ‘importance’ in their review process. However, considering that there are some 50,000 EEG papers and almost 300,000 papers on Stem Cells, the significant point is that such judgement is not possible, even within an editor’s own field, let alone another.
Even as an editor speed reading three papers a day, you cannot cover even 5% of recent research in any field (read more here on the Diminishing Value of Scientific Journal Article). Keyword searches help narrow down the literature of comparison but is not always enough since a result buried in Fig. 6 of a paper 4 years ago may be the same as the main point of the paper under editorial judgement and who would ever know? Consequently, the gatekeeping decisions are based on heuristics such as the reputation of the University and the past publication history of the authors. In addition, it is likely also influenced from the editor’s own sphere of knowledge which will naturally be attributed greater importance (as described by Daniel Kahneman in Thinking, Fast and Slow, this is our natural bias to attribute greater importance and scale to knowledge easily available to us). Who then, expert or non-expert can really claim this all knowing expertise of judgement?
See related post EEG and fMRI Journal Stats
Is it logically consistent and well represented?
Once past the gatekeeper, the next step is to identify the appropriate peer experts to review the manuscript and comment on the logical consistency and appropriate proof of the claims of the manuscript. Online journals that now publish huge numbers of papers each month have put together large review editorial boards in different fields and use an automated system to reach out to reviewers who can click a button to accept or decline based on reasons such as ‘I don’t have time at the moment’ and ‘this is not my field of expertise’. Authors are also asked to suggest three experts in their fields. In cases when the authors are well known or at least known to the editor, the editor may make a personal request to specific people to review a paper to speed things along. The question is – who is the expert peer reviewer for any particular paper?
Manuscripts today have become an ever more complex cobbling together of results to make a compelling story and some can involve complex combinations of measurements and analysis. The more novel, innovative and interdisciplinary the work, the more difficult it is to find an expert of all its parts. The same person who is intimately familiar with the type of measurements or instruments used may not have any background in the nature of analysis applied. Take for instance someone applying a novel analysis to EEG measurement from people with schizophrenia. Who is the expert? Psychiatrists who treat schizophrenia and understand intimately the dimensions of the disease? Signal processing experts who understand intimately the mathematics of the time series? EEG experts who are familiar with the instrumentation and measurement? Each has expertise in one dimension but not all and each of them may decline to review on account of not having the necessary expertise, or on the other hand make a valiant effort anyway, but miss essential mistakes or inconsistencies outside their area of expertise. A paper published in 2015 by a group of statisticians reported that over half of all published psychology papers had at least one statistical error – perhaps arising by the lack of expertise by reviewers in this domain. Indeed, you would want a different person to review each different aspect of a paper, but this is not how it works. The result of peer review is therefore, for lack of a better word, a crapshoot, depending on who you happen to get.
Are the analysis and algorithms sound?
Unlike editors, peer reviewers are not paid. Given that they volunteer time which they have little of in their own effort to publish as much as possible, time is of essence. Consequently, peer review takes a cursory approach that does not go beyond what is written in the manuscript. The methods used to analyze data are presented in a methods section but may fail to mention its many parameter choices. Sometimes an algorithm of over a hundred lines of code may be represented by a single sentence. It just has to sound reasonable. Code or excel files are not presented along with manuscripts, and mistakes, which are quite easily possible, as computations get more complex will not be detected. There is the famous example of NASA code that due to a typo, failed to incorporate a simple smoothing of a function causing thrusters to fire too soon and explode the rocket. You could not have found this from a methods sections of a manuscript. Small analysis and code errors can fundamentally change the nature of the results. The more complex the algorithms and parameter choices the more chance for mistakes and error. Unlike the NASA example, the analytical methods and algorithms behind most papers are never put to the test so it is anyone’s guess which papers have correct analysis.
Is the data high quality?
Finally, the bedrock of all empirical research is the data itself. Data is never part of the peer review process. As anyone will attest, going through someone else’s data is an excruciating and time consuming exercise and no one ever wants to do it. Even if one assumes that all researchers are honorable and always represent their data accurately, instrumentation used in science today is ever more complex and measured signals easily subject to errors and artifact arising from the complexity of measurement parameters. Take EEG for instance, every manufacturer builds hardware with their own electronics, signal referencing and software some of which is a black box but can easily influence outcome. Further, if the signal has not been properly processed, the results can be entirely misleading. Such issues are never checked.
But more important, let’s revisit the assumption that all data is at least represented in good faith. Given that novelty and importance and logical consistency are the primary criteria of peer review, and data and analysis are never looked at, the incentives of science stack against accuracy in favor of good storytelling. Indeed, science has had its share of shocking data scandals and these are only the very miniscule number of studies that are investigated. Based on my own experience, I am skeptical about the literature: I run a company where over 1500 people collect primary survey data from individuals. When we first ran a data audit to estimate the accuracy of the data collected we found that it was a horrifying 12%. And this is not specific to my company. A recent study, based on statistical tests rather than close field auditing, suggests that 1 in 5 surveys contain fabricated data. When the incentives for staff are around how many surveys a person did, rather than whether they were accurate, it was simply easier to take shortcuts rather than actually expend the effort to do accurate data collection. Today my company has a separate 100 person strong data verification team that samples from the work of each of the 1500 people. Now that each person knows they are audited and get data accuracy scores reported to them each quarter we are up to accuracy levels above 85%. It was not easy to get there. Scientists may be a more honorable group than the average population, so the number may be higher than 12%. Still even if scientists are as much as five times more honest, that is still only 60%. That’s scary!
What can be done about it?
First, requiring scientists to submit/publish large manuscripts telling complex stories rather than individual results plays a big part of making the peer review process too complex to handle. Second, peer review could specifically request different people to evaluate different aspects of a result. As in the example of the EEG and schizophrenia study described above, this could be divided up based on the instruments and data collection, the analytical method and the particular disease. Finally, until things like code and data are shared and available for the community to validate and comment on, we may be building a house of cards. That absolutely should be priority No. 1.
For this to work the first line of presentation of scientific results should not be publishing of journal articles. Rather this needs to give way to platforms for publishing individual results associated with data and analysis files where the community can be alerted to posted results that could require their area of expertise. By allowing open comments and validation of results this could become a constructive forum for iterating results until they are strongly accepted and validated. This is a radical change but eventually science has to evolve. Still there will be a place for articles where community reviewed and validated results can then be taken together and spun into stories for journals where editors can continue to make judgement on their logic and story value.