P-Hacking, HARKing, and Science’s Replication Crisis - SimpliFaster (2024)

P-Hacking, HARKing, and Science’s Replication Crisis - SimpliFaster (1)P-Hacking, HARKing, and Science’s Replication Crisis - SimpliFaster (2)

You might have heard that science has a replication crisis. Both experienced researchersand the lay presshave commented on the fact that many research findings cannot be replicated by researchers redoing the same experiments. Because replication is a fundamental aspect of science research, as it allows us to essentially double-check the initial study, such a discovery suggests that many of the research findings we take for granted might not actually be true.

Most famously, this was shown in terms of power posing, where early research suggested that holding a specific pose derived improvements in mood and performance. This then led to a best-selling bookand highly popular TED Talk. Only… it turned out that these findings could not be replicated, leading many to believe that they might not be true.

P-hacking is believed to be one of the main drivers of the (alleged) replication crisis in science, says @craig100m. Share on X

More recently, Brian Wansink, a leading food researcher from Cornell University, was found to have committed scientific misconduct, and was made to leave his post. Wansink’s papers, which have over 20,000 citations, have been largely discredited, with 13 being retracted from the journals they were published in. Wansink was accused of p-hacking, a practice that is largely believed to be one of the main drivers of the (alleged) replication crisis in science. So, just what is p-hacking? In this article, I aim to find out.

The Scientific Method

First, a reminder on how science should work, with specific mention of p-values. When I conduct an experiment, I should (in theory—but nobody really states this explicitly anymore), create two hypotheses: the null hypothesis and the alternative hypothesis. The idea with a study is to set out to falsify the null hypothesis; while we can’t “prove” that something has an effect, we can say that an effect is very likely if we are able to reject the null hypothesis. (The approach detailed here is an example of Null Hypothesis Significance Testing [NHST]. It is perhaps the most well-known method in scientific research, but it’s not the only one, so keep this in mind).

Let’s work through an example: I have 20 athletes, and I want to understand whether caffeine improves their 1 repetition maximum (1RM) bench press. What I plan to do is get them all to do a 1RM bench press test without caffeine, and get them all to do a 1RM bench press test with caffeine. If they lift more with caffeine than without, then I can state caffeine enhances 1RM bench press performance.

If I was a really good scientist, I’d randomize the order in which they did the test; some athletes would do the caffeine-free test first, followed by the caffeine test, and some the other way around. I would also want to blind the athletes as to whether or not they had consumed caffeine, as through the expectancy or placebo effects, knowing whether or not you consumed caffeine could affect your performance outside of any impact of caffeine. Having viewed the previous research on caffeine and performance, I think that caffeine likely would enhance bench press performance. So, in this case, my null hypothesis is “caffeine will not enhance performance” and my alternative hypothesis is “caffeine will enhance performance.”

After setting up my experiment and my hypotheses, in order to show an effect of caffeine, I want to try and reject the null hypothesis. To do this, I can use a variety of different statistical methods, but the most common and basic is the t-test. One of the outputs from these statistical tests is a p-value. We can use this p-value to guide us on whether or not we can safely reject the null hypothesis. What the p-value tells us—and this is commonly misunderstood—is the chance (or probability, hence “p”) of getting this result, and the null hypothesis being correct.

The p-value tells us the probability of getting this result, and the null hypothesis being correct, says @craig100m. Share on X

Let’s return to the caffeine example. I’ve done the 1RM bench press testing of my athletes under both conditions (caffeine and placebo). The average 1RM score when the athletes didn’t have caffeine was 120kg. The average when they did have caffeine was 130kg. We want to use our statistical tests to understand whether the differences in means is likely “real” (i.e., caffeine does enhance performance—the alternative hypothesis) or “false” (i.e., caffeine does not actually improve performance, and that the difference in means is likely due to chance, random variation, etc.—the null hypothesis).

Remember, the p-value tells us the chance of getting a result this extreme, and the null hypothesis being correct. If I had a p-value of 0.1, then there would be a 10% chance of a difference between the two trials being 10kg, and the null hypothesis (i.e., caffeine does not enhance performance) being true. Similarly, if the p-value was 0.01, then there is a 1% chance.

False Positives vs. False Negatives

Still with me? Now I need to introduce type I and type II errors. A type I error is where we reject the null hypothesis, but the null hypothesis is actually true. In the caffeine example, we would state that caffeine does have an effect, while in actual fact it does not. We can consider type I errors to be false positives.

A type II error is the opposite; here, we accept the null hypothesis, when the alternative hypothesis is actually true. In the caffeine example, this would be saying that caffeine has no effect on 1RM bench press strength, when in fact it actually does. We can consider type II errors to be false negatives.

The p-value essentially tells us our risk of committing a type I error. So, the big question is: What is the acceptable risk of committing a type I error? In this case, what should we set our p-value threshold as, before we can reject the null hypothesis and say that caffeine does have a performance-enhancing effect on 1RM bench press strength?

We could argue about this all day long, but the general consensus is that a p-value of 0.05 is the appropriate threshold. With a p-value of 0.05, there is a 5% chance of us getting the result we did, and the null hypothesis being true. So, if we reject the null hypothesis when p=0.05, what we’re effectively saying is that we have a 5% chance of stating there is an effect of caffeine when actually there isn’t (i.e., a false positive). Some researchers recommend using a much stricter threshold, such as 0.001 (while others believe p-values are a largely outdated method). There is a balancing act here: The stricter the threshold we choose to accept for a p-value—and therefore, the lower the chance of committing a type I error—the greater the chance of committing a type II error.

Ok, now we’re getting closer to the crux of the problem with p-hacking. If I set my p-value threshold as 0.05, then I’m accepting that there is a 5% chance of me claiming that caffeine has a performance-enhancing effect when it actually doesn’t. This means that, if I repeat this experiment 20 times, and each time get a p-value of <0.05, on one of those occasions (i.e., 5% of the trials) I will have gotten a false positive—I will be saying caffeine has an effect, when actually, it might not. This has implications for larger, more complex experiments, when we might have to run multiple statistical tests.

Returning to the caffeine example, let’s now introduce 10 genes that we might believe affect how much caffeine influences performance. For each gene, I want to know whether people with one version see a greater performance enhancement than people with the other version—so I need two statistical tests for each gene, and I have 10 genes, leading to 20 statistical tests. If I select my p-value as being 0.05 for each of these, then, by virtue of running many tests, I’m greatly increasing the chances of committing a type I error; the chance of a false positive is 1 in 20, and I’ve done 20 tests (this is an oversimplification, but it helps to demonstrate the points).

There are a number of ways researchers can correct for this, including the Bonferroni correction. Here, the accepted p-value is divided by the number of significance tests carried out: If I did 20, then my p-value threshold becomes 0.0025 (0.05 ¸20); if my p-value is above this, then I accept the null hypothesis as per usual. This is what I should do, if I’m being honest and scientifically robust.

Manipulating Probability

What p-hacking entails is doing a number of statistical tests, seeing which are significant, and then selectively reporting the tests you did in your paper. So, in my caffeine and genotype example, I would have done 20 statistical tests, with a p-value of 0.05 as my threshold. Having done these tests, I found that subjects with a certain type of one gene, CYP1A2, found caffeine enhanced their performance to a greater extent than those with the other type of that gene.

The p-value for this statistical test is 0.03; below my threshold of 0.05. All the other tests I ran showed p-values of anywhere between 0.2 and 1.0, meaning that, for those genes, I cannot reject the null hypothesis, and so I have to state that those genes have no effect on the size of performance enhancement seen following intake of caffeine. Because null results aren’t as interesting as positive results, and because there is a bias of journals to only report interesting results, I decide to write my paper by just looking at CYP1A2 and caffeine. In my paper, I therefore “pretend” that I’ve only carried out one statistical test. I report the p-value as 0.03, below the threshold of significance (0.05), and thereby demonstrating that caffeine is more performance enhancing in some people than others.

Of course, what I should have done was correct my p-values for multiple hypothesis testing; I actually ran 20 tests—even if I didn’t publish these—meaning my p-value should have been 0.0025. In reality, this gene had no clear effect on the size of performance enhancement following caffeine consumption, but by selectively reporting what statistical tests I did, I can make it seem like it did. And this, in a nutshell, is what p-hacking is.

If you perform multiple tests and selectively report just the significant ones, you are p-hacking, says @craig100m. Share on X

There are other ways I can p-hack. I might carry out my analysis, find that I’m very close to a significant p-value (whether that’s 0.05 or something else), and then go back into the data and make changes so that I am engineering a “successful” p-value. For example, in my caffeine study, I might find that in the caffeine trial my subjects lifted more weight, but with a p-value of 0.08—close to my threshold of 0.05, but not quite there.

So, I go back and play around with the data: What happens when I remove males from the analysis? Or what if I remove those with more than four years of training history? Or perhaps subject 17 only had a 0.5% improvement while all others had a 6% improvement, leading me to believe that he/she didn’t really try, so I can remove them from the analysis. Often, there are both legitimate and innocent reasons for removing some subjects from data analysis, which is fine—provided it’s not being done to manufacture a significant result.

Hypothesis After Result is Known

P-hacking also has a close cousin: HARKing, where HARK stands for Hypothesis After Result is Known. Here, researchers generate a hypothesis after they have analyzed their data. Again, this is frowned upon—the purpose of a statistical test is often to test a hypothesis, which indicates that such a hypothesis has to exist prior to the test being used. Similar to p-hacking, HARKing increases the risk of a type I error, which is why replicating such research often proves impossible—hence the replication crisis.

The world of science is well aware of these issues, and the dangers of them potentially undermining public confidence. There are a number of practices being put in place by the various journals in an attempt to guard against both p-hacking and HARKing. These include open data sharing, where researchers upload their raw data as a supplement to their paper, for all to analyze. A second approach is the pre-registration of study designs; here, researchers state what they are going to do, what their hypothesis is, and how they’re going to analyze their data, before they actually do it—preventing both p-hacking and HARKing.

Another potential solution that has been proposed is to increase the threshold required for statistical significance(although not everyone agrees).

Various journals are putting practices in place to guard against both p-hacking and HARKing, says @craig100m. Share on X

Finally, we could drop p-values altogether. This is an approach that has gained increased popularity in sports science research in recent years, in part because p-valuesmight not be all that useful to researchers in the field, with a focus on researchers reporting effect sizes with a probability of importance, as opposed to p-values.

Perhaps the most popular approachhere is that of magnitude based inferences(MBI), developed by Will Hopkins and Alan Batterham. However, the use of MBIs has recently been heavily criticized by other statisticians, with at least one journal stating they won’t accept papers that utilize the method. Nevertheless, the approaches detailed here will hopefully help address the replication crisis, and increase public confidence in the scientific process. Given how important it is to society as a whole, this is hugely important.

Since you’re here…
…we have a small favor to ask. More people are reading SimpliFaster than ever, and each week we bring you compelling content from coaches, sport scientists, and physiotherapists who are devoted to building better athletes. Please take a moment to share the articles on social media, engage the authors with questions and comments below, and link to articles when appropriate if you have a blog or participate on forums of related topics. — SF


P-Hacking, HARKing, and Science’s Replication Crisis - SimpliFaster (2024)

FAQs

How does the replication crisis relate to p-hacking? ›

Replication failures are sometimes simply due to bad luck, but more often, they are caused by p-hacking - the use of fishy statistical techniques that lead to statistically significant (but misleading or erroneous) results.

What is the difference between p-hacking and HARKing? ›

While researchers are expected to look for significant results to confirm their hypotheses, some engage in intentional or unintentional HARKing (Hypothesizing After Results are Known) and p-hacking (repeated tinkering with data and retesting).

What is the main problem with p-hacking? ›

P-hackers, however, commit 2 cardinal sins against the scientific method. They use P values to back in to a hypothesis after the fact. And they fail to conduct additional studies with a separate dataset to validate the findings. This approach leaves the hypothesis essentially untested.

What is the p-hacking fallacy? ›

When researchers engage in p-hacking, they conduct multiple hypothesis tests without correcting for the α-error accumulation, and report only significant results from the group of tests. This practice dramatically increases the percentage of false-positive results in the published literature [18].

Why is HARKing bad? ›

falsified. preventing the research community from identifying already falsified hypotheses. HARKing leads to irreproducibility or the 'Replication Crisis'. increases the probability that the findings are not reproducible or generalizable in the population of interest.

What is HARKing in psychology? ›

This article considers a practice in scientific communication termed HARKing (Hypothesizing After the Results are Known). HARKing is defined as presenting a post hoc hypothesis (i.e., one based on or informed by one's results) in one's research report as i f it were, in fact, an a priori hypotheses.

How do you identify HARKing? ›

In other words, research reports suffer from HARKing if they include one or more post hoc hypotheses (that is, hypotheses developed after the results of the data analysis are known) that are misrepresented as a priori (that is, as developed prior to the data analysis) or if they exclude one or more a priori hypotheses ...

What is the difference between p-hacking and publication bias? ›

for simplicity, we refer to “publication bias” as behaviors that reviewers and editors (i.e., the peer review process) engage in that are skewed in favor of statistical significance, while “ p-hacking” refers to behaviors engaged in by authors.

What is p-hacking in medical research? ›

P-hacking is a QRP wherein a researcher persistently analyzes the data, in different ways, until a statistically significant outcome is obtained; the purpose is not to test a hypothesis but to obtain a significant result.

What caused the replication crisis? ›

Historical and sociological causes

The replication crisis may be triggered by the "generation of new data and scientific publications at an unprecedented rate" that leads to "desperation to publish or perish" and failure to adhere to good scientific practice.

How does p-hacking impact our daily lives? ›

P-hacking can distort the validity of research findings, leading to incorrect conclusions and potentially harmful consequences for society. One major consequence of p-hacking is the dissemination of false or misleading information.

What are the best practices to avoid p-hacking? ›

Pre Registration for the study

It is one of the best methods to avoid p-hacking. Using pre-registration can help you avoid tweaking data after recording it. his requires a comprehensive test plan, statistical tools, and examination methods to be applied to the data.

What is another word for p-hacking? ›

P-value hacking, also known as data dredging, data fishing, data snooping or data butchery, is an exploitation of data analysis in order to discover patterns which would be presented as statistically significant, when in reality, there is no underlying effect.

What is the false discovery rate of p-hacking? ›

Specifically, about 57% of experimenters p-hack when the experiment reaches 90% confidence, and p-hacking increases the false discovery rate (FDR) from 33% to 42% among experiments p-hacked at 90% confidence.

What does the questionable research practice of HARKing refer to? ›

HARKing (hypothesizing after the results are known) is an acronym coined by social psychologist Norbert Kerr that refers to the questionable research practice of "presenting a post hoc hypothesis in the introduction of a research report as if it were an a priori hypothesis".

What is p-hacking and what does it say about how we as a society relate to data and statistics? ›

This is a technique known colloquially as 'p-hacking'. It is a misuse of data analysis to find patterns in data that can be presented as statistically significant when in fact there is no real underlying effect.

What are 3 possible reasons for the replication crisis? ›

There are many proposed causes for the replication crisis.
  • Historical and sociological causes.
  • Problems with the publication system in science.
  • Publication bias.
  • Mathematical errors.
  • "Publish or perish" culture.
  • Standards of reporting.
  • Procedural bias.
  • Cultural evolution.

How research results might be implicated as a result of p-hacking? ›

P-Hacking Definition

P hacking is a set of statistical decisions and methodology choices during research that artificially produces statistically significant results. These decisions increase the probability of false positives—where the study indicates an effect exists when it actually does not.

What is the replication crisis that confronts psychology relates to? ›

The replication crisis psychology faces is one seen in many of the social science fields due to the lack of open access to information on previous studies, the lack of internal validity of some studies, biases in publication along with the pressure to publish novel research, manipulation of data and/or hypotheses, and ...

Top Articles
Latest Posts
Article information

Author: Duncan Muller

Last Updated:

Views: 5583

Rating: 4.9 / 5 (59 voted)

Reviews: 82% of readers found this page helpful

Author information

Name: Duncan Muller

Birthday: 1997-01-13

Address: Apt. 505 914 Phillip Crossroad, O'Konborough, NV 62411

Phone: +8555305800947

Job: Construction Agent

Hobby: Shopping, Table tennis, Snowboarding, Rafting, Motor sports, Homebrewing, Taxidermy

Introduction: My name is Duncan Muller, I am a enchanting, good, gentle, modern, tasty, nice, elegant person who loves writing and wants to share my knowledge and understanding with you.