Why Science is Facing a Credibility Crisis

March 14th, 2022

This is the first of four blog posts on the current state of science and the promise of web3 technologies to improve it. In this first article, we discuss how the incentives of scientists in the current system contribute to the widespread replication crisis in science.

Part 2, The Business Model of Scientific Journals, explains the problems arising from the current business model of scientific publishers.

Part 3, Curating Scientific Manuscripts for Truth, asks the normative question of what scientific journals or DeSci environments should focus on in the evaluation of research papers to advance science.

Part 4, How Web3 Technologies Can Help Improve the Scientific Record, takes a deep dive into DeSci and how web3 technologies can offer a path to substantial improvements in science.

Science philosopher David Deutsch has stated that the purpose of science is the discovery of explanatory knowledge about the world that is both true (i.e. replicable and universal) and “hard to vary” (i.e. producing non-arbitrary explanations that are empirically falsifiable and don’t rely on appeals to authority and doctrine) (1).

We believe that the current paradigm of scientific production is at least partly misaligned with this purpose of science because of how the current system works and the incentives that it creates.

Scientific journals as ranking and curating devices

In the current system, scientists must constantly provide evidence of their “productivity” in order to advance their careers (i.e. get hired or promoted) and to obtain funding for their future research plans, because this is how they are evaluated by their employers and funding agencies.

One obstacle in the scientific evaluation process is that evaluators hardly ever have the time to engage fully with the body of research each scientist has produced. Thoroughly studying all previous work of just one scientist could potentially require days, weeks, or even months. This is an unrealistic demand on evaluators, even the most diligent and well-intentioned ones. Their time constraints, therefore, force them to rely on heuristics that make it easier to assess a scientist’s body of work.

One popular proxy for the importance and quality of a scientific publication (its “impact”) is the number of citations it receives. The more citations an article receives, the more important it is perceived to be for the scientific discourse in a particular field. Citations are easy to count and to compare and have thus become popular quantitative heuristics for judging how successful scientists are.

An obvious issue with using citations as a proxy for impact or quality (2,3) is that scientific work takes time to disseminate and to accrue citations. On average, scientific papers reach their citation peak 2-5 years after publication (4,5). This makes it impractical to use citation counts to evaluate the impact of scientists’ most recent work.

Because funders and institutions need to make allocative decisions prior to the completion of a discovery’s citation lifecycle, a more immediate cue is used: the prestige of journals is used to judge the impact of recent work by scientists instead of the number of citations. In many fields, it is almost impossible for a scientist to get hired or promoted without having at least one or several recent publications in “top journals”, i.e. those that are perceived as most prestigious and most difficult to get into.

Prestigious scientific journals have therefore emerged as the gatekeepers of scientific legitimacy in many fields of science (see here for an example of how journals are ranked). The main role of scientific journals is the selection and publication of important scientific contributions. The selection process is based on editors’ decisions about which submissions fall within the journals’ scope and are “good enough” to be evaluated in detail via peer review. If an article sparks the interest of an editor, the editor decides who is being invited to review the paper. Based on these reviews, the editor then makes a final decision about whether an article is accepted, invited for revision, or rejected. Thus, the editors and referees of prestigious journals exercise a great deal of influence in the scientific world (6).

Here is a good summary of the current state of the peer review system conducted in 2018 by Publon. Notably, most peer reviews are anonymous (i.e. the authors don’t know who their referees were) and they are not shared publicly, even if an article gets accepted for publication. This implies a lack of accountability in the review process, exposing this critical part of the scientific production function to turf wars, sloppiness, arbitrariness, and conflicts of interest that can be easily hidden.

Furthermore, the review process of journals is typically slow, taking months or years until a submitted paper is finally published. It is also often riddled with journal-specific, arbitrary submission requirements such as formatting instructions that waste scientists’ time as they bounce their submissions around from journal to journal until they finally find a publication outlet (7).

Thus, scientific journals play a key gatekeeping role in science, but the way in which articles get selected or rejected by journals is typically not transparent to the public, inefficient, and unaccountable.

The impact factor

The most salient proxy for the prestige of a journal is its impact factor (8), which measures the yearly mean number of citations of articles published in the last two years.

The impact factor is a metric that pools reputation across all the papers published in a journal by design, irrespective of their actual individual quality and impact. However, the distribution of citations within journals is typically highly skewed — about half the citable papers in a journal tend to account for 85% of a journals’ total citations (3).

Because of these dramatic differences in citation patterns between articles published in the same journal, the impact factor of a journal is only a very crude proxy for the quality and importance of papers published within a journal (9). Furthermore, the impact factor of small journals can be highly sensitive to the inclusion of one or a few articles that amass a high number of citations quickly.

Journal impact factors also vary substantially across fields, partly as a function of the prevailing citation culture and the absolute size of an academic discipline, but also as a function of journal size and which types of publications are counted (e.g. letters, editorials, news items, reviews) (4). Thus, the impact factor of a journal is partly driven by aspects that are unrelated to the quality of the articles it publishes.

The impact factor metric was not originally intended for its current usage as a proxy for journal quality. It was first devised by Eugene Garfield in 1960. Librarians soon began adopting it to help decide which journals to subscribe to, and it grew from there to a widely recognized signal of journal quality, even though this wasn’t its originally intended use (8).

Since the impact factor has become an important part of journals’ reputations, for-profit subscription-based journals have since learned to optimize their impact factor using a wide variety of tactics to game the system (10). When a metric is optimized for as a target, it often ceases to be a good metric of the underlying object of interest (i.e. the quality and importance of scientific publications) (10).

With the impact factor being used widely by journals, scientists have adopted the norm in turn – even while often decrying it — as a result of institutional demand for immediate proxies of scientific productivity. Problems associated with optimizing on the basis of this crude yardstick have been well documented (10), and despite calls to abandon journal rank as a measure of scientific productivity and quality (11), it remains the most widely used metric for that purpose, partly due to a lack of agreement about what alternative measure should be used instead (12,13).

Under the current incentive structure, novelty beats replicability

Independent replication of empirical results is critical to the scientific quest for better explanations for how the world works (14, 15). Without replicability, novel findings can be based on error or fabrication, and we are essentially relying on someone's authority instead of objective proof.

Unfortunately, replicability does not score nearly as high in the prestige hierarchy of scientific publications as novel and surprising results. For example, only 3% of all journals in psychology explicitly encourage the submission of replication studies, while many journals explicitly state that they do not publish replications (16).

One of the core issues surrounding the use of the citations and impact factors as metrics for scientific productivity is that they do not account for reproducibility of the published discoveries. Novel, surprising and provocative results are more likely to receive attention and citations and are therefore sought after by editors and journals — even though novel and surprising findings are also less likely to be true.

Thus, scientists have little or no incentive to produce replicable research results. Instead, they face a “publish-or-perish” or even an “impact-or-perish” culture. Their ability to produce novel and impactful findings, whether or not those findings are replicable, is what shapes their success in the academy (10).

The decoupling of replicability from commonly-used performance indicators has contributed to a raging replication crisis in many fields of science (14, 17–21). The incentives for scientists to produce novel, attention-grabbing results are so strong that many cases of downright data manipulation and fraud have been reported (22–24). Furthermore, poor research designs and data analysis, as well as many researcher degrees of freedom in the analysis of data, encourage false-positive findings (14, 17, 25, 26).

Recent large-scale replication studies of high-impact papers in the social sciences found that only ~60% of the original results could be replicated (18, 20). More than 70% of researchers have tried and failed to reproduce another scientist's experiments, and more than half have failed to reproduce their own experiments (27). This means widely circulated results are about as likely to be right as wrong. To make matters worse, non-replicable studies tend to be cited more often than replicable studies (28), and citation patterns of papers after strong contradictory replication results adjust only modestly (29).

As a result of this bias in favour of novelty and against replicability, the scientific endeavour is not self-correcting efficiently. Because citations in published articles are only looking backwards in time (i.e. they only reflect what parts of previously published literature were cited), it’s nearly impossible for readers of an article to ascertain whether a study’s novel findings are replicable and trustworthy or not. Journals are also disincentivized to facilitate replications because successful replications are not novel enough to garner a lot of attention (i.e. impact and citations), while unsuccessful replications undermine journals’ claims of quality assurance.

Thus, the incentives for scientists are heavily skewed towards producing novel, attention-grabbing results, at the expense of robustness and replicability. This is a significant impediment to scientific progress, and a solution is urgently needed.

In the interest of proposing such a solution, in a related post, we consider what an 'ideal' criterion might look like that maximizes the value of the overall research enterprise. We also explain why replications, particularly the first few (15), should receive significantly more weight in an ideal system of how science is evaluated.

Authors: Philipp Koellinger (A,B,C), Christian Roessler (D), Christopher Hill (A,B)

A - DeSci Foundation, Geneva, SwitzerlandB - DeSci Labs, Wollerau, Switzerland and Amsterdam, The NetherlandsC - Vrije Universiteit Amsterdam, School of Business and Economics, Department of Economics, Amsterdam, The NetherlandsD - Cal State East Bay, Hayward, CA, USA

References

Deutsch, D. The Beginning of Infinity: Explanations That Transform the World. (Penguin Books, 2012).
MacRoberts, M. H. & MacRoberts, B. R. Problems of citation analysis. Scientometrics 36, 435–444 (1996).
Adam, D. The counting house. Nature 415, 726–729 (2002).
Amin, M. & Mabe, M. A. Impact factors: use and abuse. Medicina 63, 347–354 (2003).
Min, C., Bu, Y., Wu, D., Ding, Y. & Zhang, Y. Identifying citation patterns of scientific breakthroughs: A perspective of dynamic citation process. Inf. Process. Manag. 58, 102428 (2021).
Goldbeck-Wood, S. Evidence on peer review—scientific quality control or smokescreen? BMJ 318, 44–45 (1999).
Huisman, J. & Smits, J. Duration and quality of the peer review process: the author’s perspective. Scientometrics 113, 633–650 (2017).
Garfield, E. The History and Meaning of the Journal Impact Factor. JAMA vol. 295 90 (2006).
Aistleitner, M., Kapeller, J. & Steinerberger, S. Citation patterns in economics and beyond. Sci. Context 32, 361–380 (2019).
Biagioli, M. & Lippman, A. Gaming the Metrics: Misconduct and Manipulation in Academic Research. (MIT Press, 2020).
Brembs, B., Button, K. & Munafò, M. Deep impact: Unintended consequences of journal rank. Front. Hum. Neurosci. 7, 291 (2013).
Seglen, P. O. Why the impact factor of journals should not be used for evaluating research. BMJ 314, 498–502 (1997).
Moed, H. F. Citation analysis of scientific journals and journal impact measures. Curr. Sci. 89, 1990–1996 (2005).
Ioannidis, J. P. A. Why most published research findings are false. PLoS Med. 2, e124 (2005).
Moonesinghe, R., Khoury, M. J. & A Cecile J. Most published research findings are false — But a little replication goes a long way. PLoS Med. 4, e28 (2007).
Martin, G. N. & Clarke, R. M. Are psychology journals anti-replication? A snapshot of editorial practices. Front. Psychol. 8, 523 (2017).
Smaldino, P. E. & McElreath, R. The natural selection of bad science. R Soc Open Sci 3, 160384 (2016).
Camerer, C. F. et al. Evaluating replicability of laboratory experiments in economics. Science 351, 1433–1436 (2016).
Open Science Collaboration. PSYCHOLOGY. Estimating the reproducibility of psychological science. Science 349, aac4716 (2015).
Camerer, C. F. et al. Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015. Nat Hum Behav 2, 637–644 (2018).
Dreber, A. et al. Using prediction markets to estimate the reproducibility of scientific research. Proc. Natl. Acad. Sci. U. S. A. 112, 15343–15347 (2015).
Verfaellie, M. & McGwin, J. The case of Diederik Stapel. American Psychological Association https://www.apa.org/science/about/psa/2011/12/diederik-stapel (2011).
Grieneisen, M. L. & Zhang, M. A comprehensive survey of retracted articles from the scholarly literature. PLoS One 7, e44118 (2012).
Callaway, E. Report finds massive fraud at Dutch universities. Nature 479, 15 (2011).
Schweinsberg, M. et al. Same data, different conclusions: Radical dispersion in empirical results when independent analysts operationalize and test the same hypothesis. Organ. Behav. Hum. Decis. Process. 165, 228–249 (2021).
Menkveld, A. J. et al. Non-Standard Errors. Tinbergen Institute Discussion Paper, doi:10.2139/ssrn.3981597 (2021).
Baker, M. 1,500 scientists lift the lid on reproducibility. Nature 533, 452–454 (2016).
Serra-Garcia, M. & Gneezy, U. Nonreplicable publications are cited more than replicable ones. Sci Adv 7, (2021).
Hardwicke, T. E. et al. Citation patterns following a strongly contradictory replication result: Four case studies from psychology. Adv. Methods Pract. Psychol. Sci. 4, 251524592110408 (2021).

Subscribe to DeSci Foundation

Receive the latest updates directly to your inbox.

Verification

This entry has been permanently stored onchain and signed by its creator.

Arweave Transaction

rmPEL8g44tPW_ns…jPPXIw9wdFicVhg

Author Address

0xe861856C961F853…1fa2b9eA20e4E88

Content Digest

MejiI_proEoelPa…o6O9MJ4zs3C0DV0