Data-Driven Discovery vs Hypothesis-Driven Research: Why Starting With the Data Works

A landmark study by Bloom, Jones, Van Reenen, and Webb found that the number of researchers required to achieve a given rate of economic or scientific progress has increased by a factor of roughly 23 over the past century. We're running faster just to stay in place. You'd expect this, when the research methodology doesn't scale.

The problem with starting from a hypothesis

The standard scientific workflow – formulate a hypothesis, design an experiment to test it, analyse the results, publish – is so deeply embedded in scientific culture that questioning it feels almost impolite. But it has a fundamental epistemological problem: the hypothesis constrains what you can find.

When you begin with a hypothesis, every subsequent decision is shaped by it. Which variables to measure. Which controls to include. Which subgroups to analyse. Which results to highlight. Most of this happens unconsciously. Confirmation bias is one of the most robust findings in all of psychology, and scientists are not immune to it. When you believe something is true, you unconsciously design experiments to confirm it, weight evidence that supports it, and discount evidence that doesn't.

Ioannidis (2005) made the uncomfortable case that, given typical statistical power, publication rates, and the flexibility researchers have in how they analyse data, the majority of published research findings are probably false [1]. The paper has been cited over 10,000 times. Its core logic has never been seriously rebutted – only elaborated.

The replication crisis is the empirical proof of this argument. Across psychology, medicine, ecology, and economics, large-scale replication efforts have found that between 40% and 65% of published findings fail to replicate. The Open Science Collaboration's reproducibility project reproduced only 36 of 97 psychology studies [2]. The CONSORT and AllTrials campaigns have documented how selective reporting in clinical trials distorts the evidence base used to make treatment decisions. This is not a peripheral problem – it sits at the centre of how science gets done.

Publication bias amplifies everything. Journals prefer positive results. Researchers know this. So negative results go in the file drawer, marginally significant results get nudged over the p < 0.05 threshold, and the published record becomes systematically skewed toward effects that aren't there. Meta-analyses built on this literature inherit the skew. Clinical guidelines built on those meta-analyses inherit it further, and the error compounds.

Path dependence and the narrowing of idea space

There's a subtler problem, too. Hypotheses don't come from nowhere – they come from prior literature. Which means that what gets studied today is heavily shaped by what got studied (and published) in the past. Citation network analyses show that science is becoming progressively more concentrated: a smaller fraction of papers account for a larger fraction of citations, and researchers increasingly work within established paradigms rather than challenging them [3].

This is path dependence in research – the tendency for prior choices to constrain future ones, independent of whether those prior choices were correct. When the literature that seeds today's hypotheses is itself riddled with non-replicating findings, the compounding effect is severe. Researchers build on foundations that haven't been properly stress-tested, pursuing incremental refinements in directions that may already be wrong.

The Bloom et al. finding on declining research productivity becomes less mysterious in this light. If you're systematically exploring a biased, path-dependent slice of the available idea space – guided by a flawed literature – you should expect diminishing returns.

Why LLMs don't fix this

The intuitive response is to reach for AI. Large language models trained on scientific literature can synthesise findings across papers, generate hypotheses, and if you hook them up to robots they can even run experiments.

And I think this will be good for science. But not as good as it could be, because LLMs inherit the biases of the literature they're trained on. Fundamentally, LLMs trained on papers are hypothesis generators that draw from the same path-dependent corpus as the researchers themselves. They accelerate traversal of the existing idea space, but they're still trapped inside it.

The data-first alternative

If you measure everything you possibly can about a phenomenon, without filtering by prior expectation, and then apply machine learning to find patterns in that data, the patterns you find are not constrained by what you expected – they're constrained by what's actually there. Patterns found by an ML model don't care what the researcher expected.

What machine learning makes possible

This approach has been technically feasible for only a short time. Exploratory data analysis has always been part of science, but traditional statistical methods struggle badly with high-dimensional, non-linear interactions. If you have 50 variables and you want to understand every possible combination of them, you're looking at a combinatorial explosion that no human analyst can navigate – and standard correlation or regression analyses will miss interactions that only manifest under specific joint conditions.

Deep learning changes this. Modern neural networks can learn complex, non-linear relationships between arbitrary features and targets, including interaction effects that are invisible to standard statistical tests. The challenge – and it's been a serious one – is that neural networks have historically been black boxes. They can predict well without being able to explain why they're predicting well.

AI interpretability solves this problem. When you can interrogate a trained model and extract the patterns it has learned to see – make those patterns human-readable and domain-legible – you get something qualitatively new: data-driven science grounded in empirical pattern-finding rather than prior belief.

This is what Disco does. It ingests raw data, trains models appropriate to the structure of that data, applies interpretability to extract the patterns those models have learned, and presents them as falsifiable, human-readable findings. The process is fully automated. A dataset goes in; a structured report of patterns – with effect sizes, statistical significance, and hold-out validation – contextualised with existing literature comes out.

The epistemological case

The philosophy of science here is closer to exploratory data analysis as Tukey imagined it – but powered by machinery that Tukey couldn't have anticipated. Let observations precede theory. Measure broadly. Be genuinely open to finding something unexpected.

What's different now is that the machinery – the pattern recognition, the interpretability, the automation – is finally good enough to make this practical without a dedicated data science team.

The result is a scientific method that is structurally less prone to the failure modes that have produced the replication crisis, the path dependence, and the declining productivity that characterise modern research. Not immune to them – no method is – but less vulnerable by design.

If you're working with data and want to see what's in it, without deciding in advance what to look for, Disco is available now, and free to use..

References

[1] Ioannidis, J. P. A. (2005). Why most published research findings are false. PLOS Medicine, 2(8), e124. https://pmc.ncbi.nlm.nih.gov/articles/PMC1182327/

[2] Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716. https://www.science.org/doi/10.1126/science.aac4716

[3] Park, M., Leahey, E., & Funk, R. J. (2023). Papers and patents are becoming less disruptive over time. Nature, 613, 138–144. https://doi.org/10.1038/s41586-022-05543-x

[4] Bloom, N. et al. (2020). Are ideas getting harder to find? American Economic Review, 110(4), 1104–1144. https://doi.org/10.1257/aer.20180338