A series of scandals in 2011-2012, ranging from data fabrication to ordinary errors, entailed a debate on scientific practice far beyond scientific circles. The problems and the debate are much older, of course, and resurface every now and then.
The first elaborate study of scientific misconduct (among NHI-financed scholars in the USA), which listed a range of behaviors and their frequencies from plagiarism to data polishing, was published in 2005 (Martinson, et al. 2005). To see how difficult or easy it is to publish gibberish papers, a spoof medical paper was submitted to 305 open-access journals and accepted by 157 of them (Bohannon 2013). Moreover, a computer program that can detect fake papers that are written by another computer program detected 120 of them in scientific journals and conference proceedings. Even prestigious journals publish non-replicable results, because the social sciences have no generally accepted standards and values, and editors do not perceive their own biases and blind spots; see Starbuck’s (2016) important but shocking review of bad editorial practices, senior scholars being the wrong role models, and junior researchers becoming cynical careerists. All this looks very bad indeed.
When focusing on the social sciences one may ask, among others, what percentage of published articles contains incorrect statistics, for whatever reason. Roughly half of controlled experimental studies turns out to be replicable (Camerer e.a. 2018; Open Science Collaboration 2015). On top of errors and fraud, a serious problem is that positive results, which confirm hypotheses, are much easier publishable than a lack of confirmation, even though the latter is equally important for the audience to know. All this leads to a systematic bias of published results (Fanelli 2010a, 2010b; Franco e.a. 2014). A main culprit is misuse of p-values as evidence for hypothesized effects (Nuzzo 2014). What you want to know is how likely your results are true, but that depends much stronger on other factors, in particular the validity of your theory (Colquhoun 2018), than on p-values, which in turn say next to nothing if in these other factors there is something amiss. Currently fashionable big data can contain hidden flaws, for example, and limited access to those data (e.g. Facebook) hinder critical inspection by others (Huberman 2012; Ruths and Pfeffer 2014). Yet, the problems are most severe outside the social sciences – in biomedicine – which suffers from a poisonous combination of small samples, small effect sizes, few replications, and enormous financial interests (Carpenter 2012).
Fraud typically starts small, but escalates easily (Crocker 2011). The damage of fraud and errors can be large, because spectacular or much-wanted results are more often cited, and may get policy implications until long after the flaw is exposed and debunked (Miguel et al. 2014). Some flaws, such as the Hawthorne effect and Protestants’ higher suicide rate than Catholics’ are too pleasant to give up, and continue to be reproduced in many sociology textbooks (Nolan 2003). The mnemonic TRAGEDIES (Temptation, Rationalization, Ambition, Group and authority pressure, Entitlement, Deception, Incrementalism, Embarrassment and Stupid systems) elaborate where scholars can go wrong (Gunsalus and Robinson 2018).
What are the remedies? PhD advisors, journal editors, data providers and practicing scientists can all be agents of change. PhD advisors should socialize their students appropriately before they become professional scientists (Neaves 2012). This involves teaching them to document their work well, and to set up a transparent structure that makes their work easily replicable (in theory, that is, as in practice few scholars undertake the endeavor to replicate substantial portions of existing work). Whereas forcing a change of scientists’ minds or checking their behavior permanently are not feasible, establishing a new work ethic is viable. Exposure of results – with requisite reporting of the methods and data – to criticism and replication (Fanelli 2013; Stojmenovska et al 2017; Freese and Peterson 2017) can further strengthen the scientific body by – in the worst-case scenario – discovering scientific fraud (see for example the case of LaCour and Green 2014; Broockman et al. 2015) or simply – in the case of successful replication – boosting the confidence in a particular finding. The dedicated R package statcheck can help authors and critics alike; see also the general manifesto for reproducible science (Munafo e.a. 2017) Complementary to replication, new findings should also be consistent with well-established theory. There are exceptions and conceived wisdom is sometimes wrong, but then there are even stronger reasons to replicate.
There has been visible progress towards embracing replication, at journals such as the American Economic Review requiring authors to submit data and replication files, and in the Reproducibility Project in Psychology (Open Science Collaboration, 2015). Yet, a reluctance to replicate and/or be replicated lingers on. An important cause of this reluctance appears to be the fear of reputational damage upon unsuccessful replication. The fear of replication, however, appears not to be well grounded according to Fetterman and Sassenberg’s (2015) study of hypothetical failed replication scenarios. They find that scientists tend to overestimate the negative effects of reputational damage, and that admitting to have made a mistake is the best thing one can do.
There are also intrinsic difficulties to replicate. Field observations, for example in remote villages, are often impossible to replicate, and well-replicated experimental results may not hold true in the field. There was a discussion about these issues in five papers from different fields in a special issue of Science 334 (2 December 2011). Also Nature (16 May 2013) had a special on reproducibility of research. Moreover, not all failures to replicate are unambiguous signs in favor of rejection, and this can lead to fierce debates, e.g. on social priming (Abbott 2013). Finally, in some types of qualitative research, knowledge is seen as situated in the researcher. Standpoint theory (Harding 1991), for example, stipulates that when conducted by a different scholar, the results are bound to be different, thereby rendering any replication as failed by default.
To reduce the problem of data mining in the natural and the social sciences it has been proposed that journals require plans for data analysis to be submitted before the actual analysis is conducted and the article written (see Olken 2015 for economics). This decreases the likelihood that articles with statistically significant and sensational findings will feature disproportionately in journals, or that authors will attempt to inflate the significance of their findings post hoc (Brodeur et al. 2016). Another remedy is encouraging authors to discuss substantive significance – effect size of findings – rather than, or along with, statistical significance. In a review of all articles published in the European Sociological Review between 2000-2004 and 2010-2014, Bernardi et al. (2017) found that only one in three articles discussed the findings in terms of effect sizes, while half of the articles incorrectly interpreted statistically insignificant regression coefficients as zero effects. These practices are slowly changing, though, and ever more scholars become convinced that discussing findings solely in terms of statistical significance is inappropriate.
If falsehood has been established by an independent committee, flawed papers should be listed in a public database such that ongoing diffusion of false knowledge can be brought to a halt (Flutre et al. 2011). Moreover, if the flaw is exposed in the same journal as the original paper, and if the author admits the flaw, the effect on the audience is larger than if it’s in another journal or the author denies (Eriksson and Simpson 2013). Fortunately, ever more journals require public access to data. However, subject’s privacy must be guaranteed (Wicherts 2011). Also software scripts oftentimes have errors, or worse, and should be made public together with the publications for which they are used, and possibly be peer-reviewed (Joppa et al. 2013). To make this possible, these authors recommend improvements in education and propose that journals educate too, by publishing tutorials. Finally, the review process itself should be reviewed, and reviewer-author and editor-author interactions should be opened to investigators (Lee and Moher 2017).
If nothing else helps and you do find results that are too good to be true, blow the whistle! (Yong and Van Noorden 2013)
Editorial note: Misconduct, mistakes and replication were newly debated from 2011 onwards, and important enough for a dedicated AISSR webpage. Now these topics have progressively become mainstream, I (J.B.) won’t continue to keep up with the rapidly growing literature, and only update this site with some out of all papers from 2019 onwards. See for example the continually updated collection at Nature.