On P-Hacking

A lot of published scientific findings don't replicate. In 2015, the Open Science Collaboration attempted to reproduce 100 psychology studies and found that only 36% of the replications produced statistically significant results, compared to 97% of the originals [1]. In medicine, the situation is at least as bad – John Ioannidis famously argued in 2005 that most published research findings are false [2].

There are many reasons for this, but one of the most important (and vilified in scientific circles) is the heinous practice of p-hacking.

What is a p-value, anyway?

A p-value answers a specific question: if there were no real effect here – if the null hypothesis were true – how likely would I be to see these results?

A small p-value means that your test statistic would be very surprising if there were no real effect. (It's not the probability that your hypothesis is true or false.)

The 0.05 problem

The scientific community has settled on a particular threshold (α = 0.05) at which we collectively decide that a result is "statistically significant." Below 0.05, your finding gets taken seriously. Above it, it mostly doesn't.

This threshold is, to be clear, arbitrary. There's nothing magical about 5%. Ronald Fisher suggested it as a convenient default in 1925, and it stuck – not because of any deep statistical reasoning, but because the field needed a line in the sand and this one was as good as any [3].

The problem is what the threshold does to incentives. Journals overwhelmingly publish significant results. Careers are built on significant results. Grants are awarded on the basis of significant results. So researchers are under enormous pressure to produce p-values that fall below 0.05 – and that pressure has shaped how science actually gets done.

What p-hacking looks like

P-hacking is the practice of fiddling with an analysis until you get a p-value below the threshold. This can take many forms, most of them subtle:

You try different subsets of your data until one gives a significant result

You test multiple outcome variables but only report the one that worked

Control variables get added or removed until the p-value drops

Data collection continues until the result becomes significant – or stops early when it is

Preprocessing decisions – how you handle outliers, which transformations you apply, how you code missing data – get adjusted until the p-value comes out right

The same data gets run through different statistical tests until one clears the bar

Most researchers who do this aren't deliberately committing fraud. They're making "reasonable" analytic choices – each one defensible in isolation – that collectively amount to searching for significance. Simmons, Nelson and Simonsohn showed that these kinds of researcher "degrees of freedom" can produce false-positive rates of over 60%, even with the nominal α set at 5% [4].

And the evidence suggests this is widespread. Head et al. analysed the distribution of p-values across the scientific literature and found a suspicious bump just below 0.05 – exactly what you'd expect if researchers are nudging their results over the line [5].

Why it matters

Results that have been p-hacked are unlikely to generalise. The claims in the paper are overfit to the specific analysis – the particular combination of variable selections, subgroup choices, and statistical tests that happened to produce p < 0.05. Change any of those choices and the significance disappears. The finding was an artefact of the analysis, not a feature of the world.

This is the core of the replication crisis. A result that was found in one dataset under one specific set of analytic decisions will very often fail to appear in a new dataset, because the pattern was never real in the first place – it was a statistical coincidence that was selected for.

How Disco avoids this

Disco is designed with the goal that every pattern it finds should be as likely as possible to hold up on data it hasn't seen. We care about generalisability, not significance theatre – and that's reflected in the architecture of the pipeline.

Holdout validation. As soon as Disco ingests data, it splits a large chunk of it (a hold-out set) that plays no part at all in the pattern discovery process. It is completely ignored until the end of the pipeline. We are vigilant against data leakage. All patterns are tested against this holdout set – and if they don't generalise, they are rejected – regardless of how impressive they looked on the training data.

This directly addresses overfitting. Any pattern-finding process – whether it's a human fiddling with analysis parameters or an automated pipeline searching a large feature space – can surface patterns that are artefacts of the training data. Testing against a separate holdout protects against this. If a pattern doesn't generalise to data it's never seen, Disco discards it.

In future, we plan to automatically find more data in the same domain from a completely different source to further validate the patterns we find.

Multiple testing correction. When you find hundreds of potential patterns – as Disco does – some of them will appear significant by chance. If you find 100 patterns at α = 0.05, you'd expect about 5 false positives even if nothing real is going on.

Disco corrects for this using Benjamini-Hochberg false discovery rate (FDR) correction [6] on all reported p-values. This adjusts each p-value upward to account for the number of tests performed, controlling the expected proportion of false discoveries among the patterns that are reported, and discards any that no longer appear significant.

These are two defences against two different problems. Holdout validation guards against overfitting – patterns that looked real in the pattern-discovery subset but don't generalise. FDR correction guards against multiplicity – patterns that appear significant purely because you tested enough hypotheses that some were bound to clear the bar by chance.

References

[1] Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716. https://doi.org/10.1126/science.aac4716

[2] Ioannidis, J. P. A. (2005). Why most published research findings are false. PLOS Medicine, 2(8), e124. https://doi.org/10.1371/journal.pmed.0020124

[3] Wasserstein, R. L. & Lazar, N. A. (2016). The ASA's statement on statistical significance and p-values: context, process, and purpose. The American Statistician, 70(2), 129–133. https://doi.org/10.1080/00031305.2016.1154108

[4] Simmons, J. P., Nelson, L. D. & Simonsohn, U. (2011). False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366. https://doi.org/10.1177/0956797611417632

[5] Head, M. L. et al. (2015). The extent and consequences of p-hacking in science. PLOS Biology, 13(3), e1002106. https://doi.org/10.1371/journal.pbio.1002106

[6] Benjamini, Y. & Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B, 57(1), 289–300. https://doi.org/10.1111/j.2517-6161.1995.tb02031.x