Curating Scientific Manuscripts for Truth

This is the third of four blog posts on the current state of science and the promise of web3 technologies to improve it.

Part 1, Why Science is Facing a Credibility Crisis, discusses how the incentives of scientists in the current system contribute to the widespread replication crisis in science.

Part 2, The Business Model of Scientific Journals, explains the problems arising from the current business model of scientific publishers.

Here, we contemplate what a better system should look like.

In Part 4, How Web3 Technologies Can Help Improve the Scientific Record, takes a deep dive into DeSci and how web3 technologies can offer a path to substantial improvements in science.


To improve the current publication system, it would be useful to define an objective function that describes what journals or DeSci environments should select for to maximize the contribution of publications to the scientific record. Based on such an objective function, different selection mechanisms could be compared and ranked in their ability to contribute to the creation of knowledge. This is what we attempt to do here.

As a first step, we can conceptualize journals as prediction pipelines designed to sort and classify scientific work according to its expected value. Each participant in the evaluation process of a journal has a model of the world, or more precisely – of what constitutes valuable science. Participants may or may not agree on what they view as valuable science. And, typically, neither referees nor editors are explicit about what their personal evaluation criteria are. Let us call these potentially heterogeneous models of the world “black boxes”.

At each stage of the scientific publication process, these black boxes produce signals which are combined into a final prediction of expected scientific value by the editor. Provided the expected scientific value exceeds a certain journal set-standard, the work is accepted for publication. If it misses the mark, the work is rejected or invited for resubmission provided the referee’s requests can be thoroughly addressed.

Machine-learning framework: Scientific journals as ensemble learning

The current scientific publication system can be thought of as a 3-stage predictive engine that combines predictions from different black-box algorithms.

Stage 1: The editor, generally a senior scientist, performs an initial prediction (“the desk”) which constitutes the initial filtering on expected scientific impact.

Stage 2: Passing the desk brings a paper into the next stage, which involves sending out the submission to peer-reviewers. The reviewers perform their own predictions on the expected scientific value of the work.

Stage 3: In the final stage, the editor weighs and aggregates these signals with his own to form his final prediction.

In machine learning, this is known as ensemble learning. Ensemble learning is the process of combining different predictive algorithms to increase predictive accuracy (1,2).

Agent-based framework: Effort and truth are necessary to prevent noise, collusion and sabotage

In an ideal world, every “black box” (i.e. each referee and the editor) involved in the review process a) expends maximum effort and b) truthfully reports its prediction.

Expending maximum effort is required because the detailed and minute work required to evaluate the soundness of the methodology and the justifications for the conclusion is a time-consuming process. Every submission is a high-dimensional input that needs to be broken down and evaluated on multiple dimensions to determine its expected scientific impact. By expanding insufficient effort, the prediction turns to noise. For example, having a cursory glance at the title and abstract of a paper is not sufficient to determine its quality.

Truthfully reporting predictions is important because otherwise, we run into the risks of unwarranted gatekeeping. For example, a negative review of a scientifically sound paper that does not align with a referee’s view of the world would be unwarranted gatekeeping. Likewise, there is a threat of collusion between authors and peer-reviewers to provide each other with inflated reviews. For example, friends or colleagues writing reviews for each other is a form of collusion. Noise, sabotage and collusion are three failure modes of the peer-review process and can only be averted through effort and honesty. This is a particularly acute problem because peer reviewers (and often editors) work pro-bono for established scientific journals, and there is little to no benefit in providing honest and effortful reviews (3,4).

Formalizing the scientific journal

In an abstract sense, we can think of research work as determining the truth of a hypothesis by offering new evidence that is, ideally, very convincing (but may in fact not be so). A hypothesis has the form that condition X leads to outcome Y. The quality of the research contribution (Q) depends on how much we learn (L), i.e. how much the information increases our confidence in the hypothesis, and how important the hypothesis is to the scientific enterprise overall (V). That is, let Q=V∙L.

The value of new knowledge depends on its implications, given our existing knowledge base, and on the potential proceeds from those implications, for example, new inventions. These things are difficult to observe. Even similarly qualified referees and editors may disagree to an extent on what V is, because of their subjective understanding of current knowledge, their skill and imagination in envisioning future impact, and their perception as to which problems are most important to solve. We just take it as given here that there is a meaningful true V, and that readers of scientific work “guess” at it. Greater ability tends to produce better guesses.

How much we learn can be formalized by Bayes’ rule, P(Y|X) = P(Y)∙P(X|Y) / P(X), where P(Y) is the prior likelihood that outcome Y occurs, and P(Y|X) is the posterior likelihood (when condition X holds in the data). P(Y|X) measures the strength of the inference that X entails Y. We denote this by R. P(X|Y)/P(X) measures how much more likely it is that condition X is observed when the outcome is Y. In other words, P(X|Y) / P(X) captures the information contained in X about Y. We define P(X|Y) / P(X)= 1+I, so that I=0 reflects that X is as likely to occur with Y or without Y, and therefore nothing was learned from studying condition X. If I is different from 0, then X changes our expectation of Y. We can write P(Y) = R/(1+I), and therefore LP(Y|X) - P(Y) = R-R/(1+I). (Here we assume that positive relationships between X and Y are being tested, i.e. I≥0. There is no loss of generality, since Y can always be relabeled as the opposite outcome to make a negative relationship positive.)

The quality of a contribution can now be expressed as Q = V∙(R-R/(1+I)), where V is the (projected) value of being able to predict outcome Y, R is the degree to which Y depends on condition X, and I captures how our beliefs about Y changed due to this research. Note that R and I both affect Q positively, and QV. When nothing new was learned (I=0), or when the condition does not predict the outcome (R=0), or when predicting the outcome is irrelevant (V=0), then Q=0.

Note that replication of a prior result can be a quality contribution since it might significantly increase support for a hypothesis, especially when it is one of the first replications (5,6). A negative result (where Y does not occur under condition X) can also be a quality contribution if it corrects the current prior.

An interesting, and probably common case, arises if a paper reports surprising results that are potentially paradigm-shifting, but the results turn out to be false. Intuitively, Q might be smaller than zero in this case, because an influential result that is false could do substantial damage both in terms of wasted time and effort by scientists, but also considering the welfare consequences for society. For example, irreproducible pre-clinical trials create indirect costs for patients and society (7). Furthermore, future research that builds on the false discovery may not only waste resources, it may also derail scientific progress into further false discoveries.

When an error is made in the Bayesian model, the evidence does not justify the conclusions. Suppose the hypothesis is misspecified, and the relationship between condition and outcome is actually negative (I<0), but mistakenly reported as positive. Then L = R-R/(1+I) < 0, which would make the quality of the contribution Q negative.

Selecting scientific manuscripts for quality - the ideal case

If we think of scientific progress as a linear process, a positive Q value implies that the new discovery makes some kind of positive contribution to scientific advance. A false discovery may not only not contribute to our knowledge, but it may also add confusion and entropy, resulting in scientific regress. Nevertheless, an editor might publish such a paper, misjudging Q.

The stated purpose of scientific journals is to publish contributions that advance knowledge (Q > 0). It is useful at this point to differentiate between what journals should evaluate in order to advance knowledge (i.e. the normative case) and what journals actually do in practice (i.e. the descriptive case).

In the normative case (i.e. an ideal world), the predictive algorithm of journals should try to identify papers that have high Q values. This is complicated by the fact that the true value of a contribution is inherently difficult to assess and influenced by subjective insight and preferences. In addition, referees and editors need to exert effort to confirm the objective validity of the analysis, but they are not rewarded for doing so.

We shall denote the predicted quality of the contribution by Q'=f(V',R',I'), where primes indicate estimated quantities. Referees and editors will assign subjective weights to each. V’ is obviously subjective, depending on what editors and referees believe to be “important” lines of research. R’ and I’ can in principle be determined more objectively, but getting them right is effort-intensive, so the task is left mostly to referees. The referees make a report m, the accuracy of which depends on effort e ∈ [0,1]. In general, m(e) = t+ρ∙(1-e), where t is the true value and ρ is a random variable that is symmetrical (e.g. normally) distributed around zero. Note that the larger the effort, the smaller the potential error ρ∙(1-e).

Thus, in an ideal world, referees and editors correctly assess the strength of the presented evidence (I' = I) and weigh the finding by the perceived importance (V') of the hypothesized effect (R'). This work requires substantial effort (e = 1) and complete objectivity.

What happens in practice

If journals overemphasize the novelty and strength of findings, this can contribute to a replication crisis in science that threatens both human progress and the legitimacy of science. On the other hand, if replicability and the strength of the presented evidence are over-emphasized, the literature would be dominated by true findings, but there would be little or no advances in what we reliably know. Therefore, it is important that both the novelty and the replicability of a reported finding are appropriately taken into account.

In practice, most journals primarily evaluate the (subjective) importance of a studied topic (V') and the effect size of the studied condition (I') but pay too little importance to the strength of the presented evidence (R'), especially to whether the reported results are likely to replicate or not.

The adverse incentives we described in the first blog post imply that journal editors tend to bias their decisions toward novelty and against replications. Controversial, or otherwise attention-grabbing, results will tend to garner more citations as researchers try to verify them. If maximizing reputation through citations is a goal, then inducing verification in other journals is a more effective strategy than publishing replication work in one’s own journal.

Replications also suffer from the dilemma that they are, provocatively put, “not interesting” or “not credible.” If a replication study confirms the original result or negates a result that was published recently and is not yet widely known, it may not be viewed as noteworthy. If it fails to confirm a well-known result, it will likely face doubt. Moreover, if only negative replications are “novel” enough to be publishable in a well-regarded journal, researchers face substantial risk (as well as bias) in attempting such a study, given that it might yield a positive result.

These aspects suggest that the “estimated quality” of an article will be based on weights that do not correspond to the Bayesian learning framework and may reflect differences in priorities between the editor and the referees, who are less motivated to generate future citations for the journal. Ultimately, referee judgments may be reflected in the final decision to a lesser extent than appears, and this would further reduce the referees’ incentive to commit effort.

To summarize the above points:

  • Editors and referees will not necessarily evaluate articles according to consistently weighted criteria, and their judgments may well deviate from the best possible prediction of true quality.
  • In particular, editors have incentives to weigh novelty more strongly than replicability, and referees have incentives to limit their efforts to verify scientific accuracy. This can lead to published literature with many low-quality papers (even if referees exert maximum effort due to intrinsic motivations).

Furthermore, given the small number of referees and editors that evaluate each paper for each journal and their potential heterogeneity, the distribution of realized quality of publications will have a high variance across journals, rendering each submission of a paper to a different journal akin to a lottery draw. Since journals require that the papers they evaluate are not under consideration at a different journal at the same time, this implies a substantial loss of time between the moment of the first submission to a journal and the point where an article actually gets published. It also implies substantial costs for the authors of the submission, given that many journals have different formatting requirements etc. Thus, the current practice of curating and evaluating scientific contributions is inefficient and a waste of (public) resources.

We conclude that in an ideal world where journals or DeSci environments achieve their stated objective of publishing papers of the highest possible quality:

(a) A logically derived rule is employed for predicting quality from the estimated value and novelty of research work.

(b) Referees are given incentives to put effort into verification and report truthfully.

(c) Replication efforts are explicitly rewarded, given their importance for verifying the truth of reported novel discoveries (5,6).

If (a) - (c) are fulfilled, progress in the scientific literature would be faster if journals were to allow the simultaneous submission of papers to different publication outlets, and if more researchers were involved in the evaluation process.

In the fourth blog post, we discuss the potential of web3-enabled technologies to get the scientific curation mechanism closer to the optimal state.

Authors: Philipp Koellinger (A,B,C), Christian Roessler (D), Christopher Hill (A,B)

A - DeSci Foundation, Geneva, Switzerland

B - DeSci Labs, Wollerau, Switzerland and Amsterdam, The Netherlands

C - Vrije Universiteit Amsterdam, School of Business and Economics, Department of Economics, Amsterdam, The Netherlands

D - Cal State East Bay, Hayward, CA, USA

References

  1. Polikar, R. Ensemble Learning. in Ensemble Machine Learning: Methods and Applications (eds. Zhang, C. & Ma, Y.) 1–34 (Springer US, 2012).
  2. Sagi, O. & Rokach, L. Ensemble learning: A survey. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 8, e1249 (2018).
  3. Aczel, B., Szaszi, B. & Holcombe, A. O. A billion-dollar donation: estimating the cost of researchers’ time spent on peer review. Res Integr Peer Rev 6, 14 (2021). doi.org/10.1186/s41073-021-00118-2
  4. Biagioli, M. & Lippman, A. Gaming the Metrics: Misconduct and Manipulation in Academic Research. (MIT Press, 2020).
  5. Ioannidis, J. P. A. Why most published research findings are false. PLoS Med. 2, e124 (2005).
  6. Moonesinghe, R., Khoury, M. J. & A Cecile J. Most Published Research Findings Are False—But a Little Replication Goes a Long Way. PLoS Med. 4, e28 (2007).
  7. Begley, C. G. & Ellis, L. M. Raise standards for preclinical cancer research. Nature 483, 531–533 (2012).
Subscribe to DeSci Foundation
Receive the latest updates directly to your inbox.
Verification
This entry has been permanently stored onchain and signed by its creator.