Solutions for enhancing Research Integrity – keynote by Prof. Jelte Wicherts

Last modified: May 24, 2018

The keynote lecture of the second NRIN Research Conference 2018 was given by prof. Jelte Wicherts. He explained some of the faulty shortcuts scientists take in the route from formulating a hypothesis to presenting results. Solutions were given for incentivizing the parties involved to carry out proper research. These solutions “would be implemented immediately, if I were the King of Science”, as he said so himself.
He demonstrates how we should do research by showing the empirical cycle. It consists of five steps: First, deduce a specific hypothesis from the literature that can be tested empirically. Second, design an experiment where you make a very specific prediction based on the hypothesis. Third, collect and analyze the data to see whether that prediction is born out of the data. Fourth, evaluate the results. Fifth, produce new literature. However, what happens more often is skipping steps or changing the order of these steps.

Ways to reduce questionable research practices

Photo: Fenneke Blom

Research misconduct, or simply fraud, is not so prevalent but severe. Here, the step of collecting and analyzing data is skipped entirely. A famous example of fraud in the Netherlands was the case of Diederik Stapel. He admitted to have fabricated data in over fifty articles. Prof. Wicherts shares his ideas to deal with researchers who are willing to lie in order to get ahead in science. For instance, one can improve regulations, procedures and codes of conduct. Another idea is to improve responsible conduct of research including training . Once could also change the descriptive norms of how we should carry out science and thereby lower the bar for questionable research practices (QRP’s). Currently, QRP’s are not yet clearly defined, and can be different across fields. Another solution is to stimulate open science practices. By sharing the data alongside articles the raw data can be exposed to the world. Thereby it will be very difficult to fake data. It also contributes to reproducibility, since other researchers can use the data to carry out more research and compare each others’ findings. It would no longer be possible to analyze the data in many different ways and to only present one positive outcome when the data is available for others to scrutinize. Thereby, reporting false positives can be prevented.


Hypothesizing after the results have been known (HARKing) is another way of not following the empirical cycle correctly. The results are presented as if they have been hypothesized from the start although the hypothesis has been formulated only after data acquisition. A random white noise stimulus illustrates why HARKing is a problem. This particular noise signal has a dip at time point t = 365. When hypothesizing after the data has been collected, one could come up with a hypothesis that explains why there is such a dip exactly after 365 days or one year. And of course, noise is not reproducible so in this case obtaining more data will not confirm the effect. Hence, HARKing can lead to false positive results. Prof. Wicherts calls scientists who do HARKing cowboys: they explore the data and are not bound by the rules of the empirical cycle. He adds that explorative research is a perfectly valid research format that can be crucial for finding novel effects. However, it should be carried out in a format other than the empirical cycle, in order to prevent HARKing.

Observer bias & underpowered studies

Photo: Fenneke Blom

Improperly designed studies can have many confounds that can bias the interpretation of a study. An famous example is the horse that seemed to be able to do arithmetic’s shown by its tapping with the hoof. In reality, it responded to the experimenter’s response when the taps approached the right answer. It is a classic example of an observer bias (tendency to see what you expect to see). Another aspect of poor designs is an underpowered study. As the sample size increases, the power to detect an effect increases as well. The typical power in psychology is below fifty percent. This means that if all researchers are right with their hypothesis, they cannot be finding positive significant results in more than 50% of their research. This is a problem because research by Fanelli (2010) shows that in this field more than 90% of published studies were reported to be successful. In Neuroscience the power is very low, with a median of 8% power, yet most of these studies are published with significant results. Underpowered studies can be fixed by using a larger sample size, by using more powerful designs with better controls, by doing better quality measurements and by collaborating with other researchers.

Publication bias

Prof. Wicherts mentions another bias that has negative consequences for the scientific system: publication bias produces a substantial amount of research waste. Most non-significant results are not published, and hence are put in the file drawer or deleted. Negative results will not be known although that would be important knowledge. Eighty percent of psychologists appear to overestimate the power of their studies. Even if a hypothesis is true, a low powered study has a high chance of yielding insignificant results. These results are then disregarded as evidence. The study is regarded as a failure and will not be published. A solution for this problem is to publish all studies that are carried out as planned. More and more journals are examining the methodological rigors and do not consider the results as a criteria for publication.

Selective outcome reporting

Photo: Fenneke Blom

There is pressure from peer reviewers and editors for perfect significant results. They incentivize positive outcome reporting. A study showed that many of the clinical trial variables documented in the preregistration protocol were not reported in the published paper. Additionally approximately as many new variables were silently added. The pressure to publish positive results can lead to another problem: miscalculating the p-value. A study shows that 50% of the investigated articles had an error in the p-value. 12.5% of the articles had one or more gross inconsistencies. Here, the results were said to be significant, where in fact they were not significant. One of the solutions to counteract for selective outcome reporting is to fix the reporting guidelines. Another solution is to have peer reviewers check if there was a protocol and to check the protocol against the reported results. One could also let an algorithm check for simple discrepancies, like Statcheck.


Also, related to the pressure to publish positive results is P-hacking. The planned analysis do not yield a good result and more analyses are tried until a significant result is found. The difference in results between analyses is especially large for studies with low power. It is a serious problem, considering the many ways to analyze data. For example, for a randomized controlled trial with 34 different choices that you can make, there are 234 ways of analyzing this data. Observer bias are dangerous in a poorly designed study since researchers and participants themselves can have an impact on the outcome depending on their own predictions. It is not yet known how severe this bias is. In order to know this, data from studies needs to be reanalyzed to verify the results. Prof. Wicherts is anxious for the results of such a study. He wonders how many papers contain p-hacking, when already 12.5% of papers have implemented the less eloquent solution of simply misreporting the p-value.

Preregistration can help

Photo: Fenneke Blom

Since a couple of decades, the medical field requires the preregistration of studies to prevent some of the problems. The practices are not yet ideal, since there are a lot of discrepancies between the preregistered plans and the results. Prof. Wicherts is convinced that preregistration is the only right way to carry out hypothesis testing. An ideal preregistration consists of a testable hypothesis and an analytic plan a priori. The preregistration should be published in a fixed way. If you deviate from your plan, openly indicate how and why you did. Another option is to publish a registered report in which the peer review is focused on the rationale, hypotheses and methods and the article is published regardless of the results. Currently, around a hundred journals have adopted this format.

Researchers’ incentives for misbehaving

Prof. Wicherts continued to discuss why there are so many wrongdoings in science. On the one hand, most scientists are willing to do the right thing. They want to help society by getting results. But on the other hand, they have many incentives for positive, significant results. The reason for this behavior is that generally only positive, significant results will get published. Adding to this, not publishing is perishing in the scientific world. When unexpected results are obtained, the researcher might genuinely believe that the data points are wrong. They believe their results should be different, based on their hypothesis and ideas about the experiment. This in turn, will make researchers deal with biases which then leads them to take one of the shortcuts in the empirical cycle, often not knowing that taking this shortcut is wrong.
Almost all the incentives for incorrectly carrying out research are for the researcher conducting this research; publish in a good journal, get recognition, get funding, obtain tenure, etc. Often these researchers work alone and are not regulated in any way. Hence, it is very important to implement the proposed solutions for enhancing Research Integrity. Researchers who want to carry out research in a valid way, but need a helping hand, will prevail and the scientific cowboys who purposely mean wrong are weeded out of science.

Big Money

Photo: Fenneke Blom

One last problem in the empirical cycle model is not blamed on the individual researcher, but rather on the publishers. The key issue is that the business models that big publishers use are not always in line with the goals of good science. Publishers mainly focus on novel interesting results and many other stakeholders in science are focused on publications in high impact journals. Hence, the big money is not going to steer the researchers in the right direction. With the result that handing out these monetary rewards will motivate researchers to take high risks so that they can find novel results which in turn leads to irresponsible conduct of research. A solution for this could be non-profit publishing. However, open access publishers can accept numerous articles, with the disadvantage of lower quality standards and the risk of irrelevant research. Another solution is to ask oneself how important the journal is in which the researcher has published, when making decisions about grants and tenure.
Prof. Wicherts notes that it is important to involve all the stakeholders in making the incentives for conducting research in a good way. One initiative for this was the ‘open access badge’, created in a high impact psychology journal in 2014. The researchers that published their data openly received a virtual sticker that was put on the front page of the article. This initiative led to a raise in data sharing from 3% to over 40%. This also demonstrates that researchers want to do well and present rigorous research.


Photo: Fenneke Blom

We should publish null findings

There was time for a few questions. One attendee argued that it is good practice to publish all failed experiments. Firstly, we should not let scientists judge whether the experiment has failed. Secondly, failed experiment data can be useful to select biases. An example for this is the Odyssey project, where clinical data from drugs and side effects that was completely unrelated were correlated in order to find biases. Prof. Wicherts agreed that we should learn from failures. However, we should be aware of privacy laws by asking for consent from the research participants.
What can reviewers and editors really do?
Another attendee asked: How well equipped are peer reviewers and editors to address the issues mentioned in the talk? And what can they do as gatekeepers to make sure the results are of high quality? The response was that reviewers are already working with all the resources they have, which is a very limited amount of time. Hence, they need to use simple heuristics to test for their approval. Therefore, it is not so much about what they can do, but more about what we as a community can do to enhance the incentives for peer review. Prof. Wicherts asked rhetorically if we should pay or train reviewers. Prof. Bouter pitched in with a development where some journals are publishing reviews with the accompanying doi.

Is Bayes the solution?

Lastly, Prof. Wicherts was asked about his thoughts on moving away from the focus on p-value, since there are other statistical tests that can draw conclusions from data. He admits that there are more aspects to the statistical debate. However, discussing these additional aspects as well will not solve the deficiencies of research integrity in science because we still would be overlooking the bigger picture: He merely uses the problems around the p-value as an example to illustrate that we should go towards a system where there is more openness to how you obtain and process your data.

We are grateful to Prof. Wicherts for giving this overview of the most prevalent problems in RI. Perhaps he is not the King of Science, but here at the NRIN we can offer you a virtual sheriff’s badge for your work on Research Integrity. (Apparently, scientists like badges)

By Sanne Joon

Do you want to read more?

Discussion 1: Future or scholarly communication
Discussion 2: Do we need to redefine questionable research practices?
Closing remarks
Taxonomy of non-integrity – a reflection on the conference

NRIN devotes a great deal of attention to the website’s content and would greatly appreciate your suggestions of documents or links you believe to belong on this website.

This selection is an incomplete convenience sample, and does not reflect NRIN’s vision or opinion. Including an item does not necessarily mean that we agree with the authors nor does it imply we think unmentioned items are of poorer quality.

Please report any suggestions, inaccuracies or nonfunctioning hyperlinks etc. that you discover. Thanks in advance!