What to Do With a Failed Phase III Trial

Your Phase III trial missed its primary endpoint. The press release is drafted. The programme review is scheduled. And somewhere in your data management system sits a dataset that cost somewhere north of $500 million to generate – patient outcomes, biomarker panels, genetic data, adverse event profiles, concomitant medications – that you're about to archive.

That dataset probably contains something worth knowing. Maybe something worth millions. Will you find it?

The $500M Problem

Around half of all Phase III trials fail [1]. The reasons vary – wrong dose, wrong population, wrong endpoint, wrong timing – but the data doesn't disappear when the trial does. You have measurements on hundreds or thousands of patients, taken at multiple timepoints, across dozens or hundreds of variables. That's a rare and expensive resource.

The instinct, after a failure, is one of two things: kill the programme entirely, or advocate loudly for why the trial wasn't truly negative. Both are understandable. Neither is a rigorous response to what the data actually contains.

The problem with declaring complete defeat is obvious – you may be discarding a drug that works for a specific population, or that has a measurable effect masked by population heterogeneity. The problem with advocacy is subtler but more dangerous: the people making the case for continuation are typically the same people who designed the trial, believe in the drug, and have spent years on the programme. The evidence they produce in support of continuation will be viewed – rightly – with suspicion.

What's needed is something different: a principled, systematic exploration of the data that can tell you what's actually there, rather than what you hoped would be.

Why Standard Post-Hoc Analysis Fails

The standard approach to a failed trial is to run post-hoc subgroup analyses – slice the patient population by age, gender, disease severity, biomarker status, and see if anything looks better. This is well-intentioned but deeply problematic, and the scientific and regulatory communities know it.

The core issue is multiple comparisons. If you test enough subgroups, you'll find something that looks significant purely by chance. With even a modest number of binary variables – say, twenty – you're generating over a million possible subgroup combinations. At a significance threshold of 0.05, you'd expect to find spurious positive results by accident. The literature is sceptical of these findings for good reason [2]: most don't replicate, most aren't pre-specified, and most exist because someone went looking for a rescue narrative.

The edaravone story illustrates this well. The drug's first Phase III trial in ALS was negative overall, but a post-hoc subgroup analysis identified a subset of patients – those in earlier disease stages with preserved respiratory function – who appeared to benefit [3]. A second, smaller trial was designed around that subgroup and showed a positive result, leading to FDA approval. But the approach has been contested from the start: the subgroup definition was derived post-hoc from the failed trial, the confirmatory study was small, and independent commentators have questioned whether the effect is robust or an artefact of the selection process [4]. Whether the drug genuinely works for some patients remains unclear – and that ambiguity is a direct consequence of the analytical approach.

So how do you generate subgroup signals that people will actually trust?

What Principled Exploration Looks Like

Standard post-hoc analysis looks for subgroups by hand, using human intuition about which slices might be interesting, without a systematic framework, and without any mechanism to distinguish real signal from noise.

A principled alternative has a few defining features.

Systematic search, not selective search. Rather than asking "do patients over 65 do better?", the approach lets the data surface its own structure – identifying patient clusters, interaction effects, and predictive patterns without pre-specifying which variables to examine. The researcher's thumb comes off the scale.

Discovery separated from validation. The dataset is split at the outset: one portion for pattern discovery, one held back for testing. Any signal found in the discovery phase gets tested against data the model has never seen – the only reliable way to distinguish a real finding from an artefact.

High-dimensional interactions. Human-designed subgroup analyses are almost always univariate – older patients, or patients with low baseline ALSFRS-R, or patients with a particular SNP. But many biologically meaningful effects are multivariate: a drug might work specifically in patients who are older and have a particular biomarker profile and haven't been on prior immunotherapy. Standard statistical methods struggle to detect these interactions. Machine learning handles them natively.

Interpretable outputs. A black-box model that says "these 47 patients would have responded" isn't useful – you need to know why, in terms that are biologically coherent and defensible to regulators. Interpretability methods can render model decisions legible: this subgroup is defined by these variables, in these ranges, with these interactions.

The FDA's guidance on enrichment strategies [5] is explicit that prospective-retrospective analyses – where the subgroup hypothesis is generated from one dataset and validated on an independent one – are the appropriate framework for informing follow-up trials. The goal is to generate a rigorous, independently testable hypothesis that can inform what comes next – not to rescue the failed trial.

Four Possible Outcomes

If you approach the data systematically, there are four meaningful things you might find.

A responder subgroup

The most commercially valuable finding: a subset of patients who showed a genuine, robust treatment effect, definable by variables you measured. If the effect size in this subgroup is clinically meaningful, and the subgroup is large enough to be practical, this is the basis for a biomarker-enriched follow-up trial with a narrower indication.

This is how precision oncology progressed from broad chemotherapy to targeted therapies. Many of those targeted drugs failed their initial trials in broad populations – it was only by identifying which patients responded that the drugs reached patients at all. The lesson generalised: a drug that works for 20% of patients works. The trial was just unenriched.

The regulatory path here is well-established. The FDA's enrichment guidance describes exactly this scenario – a failed trial generating a prospectively-defined hypothesis for a follow-up study in a molecularly or clinically defined subgroup. Reprioritisation, not rescue.

A predictive biomarker

Distinct from a responder subgroup is the case where a biomarker predicts response in a way that's mechanistically coherent and potentially measurable in clinical practice. This is the foundation of a companion diagnostic – a test that stratifies patients before treatment, allowing you to prescribe only to those likely to benefit.

Companion diagnostics have transformed oncology (EGFR mutation status and erlotinib; HER2 amplification and trastuzumab; PD-L1 expression and checkpoint inhibitors) and are beginning to do the same in other indications. A failed trial that surfaces a predictive biomarker has discovered what it should have known before it started. That biomarker is a platform asset.

A genuine null result

Not every failed trial contains a hidden responder. Some drugs genuinely don't work, in any meaningful subgroup, at any dose, in any population measurable by the variables you collected. This is useful too – it's just less exciting.

A systematic analysis that searches for signal and finds none is the most credible possible basis for a kill decision. It removes the ambiguity that haunts programmes where the data was never properly interrogated: the nagging question of whether someone who looked harder would have found something. If you searched systematically and found nothing, you can close the programme with confidence.

This matters for portfolio allocation. Programmes that are killed cleanly release resources. Programmes that linger in a state of contested ambiguity absorb resources and attention indefinitely.

A licensable insight

Even where the primary drug hypothesis is exhausted, a rich dataset may contain findings of value to someone else. A competitor pursuing a different mechanism in the same indication might be interested in the natural history data, the biomarker profiles, or the evidence about which patient characteristics predict disease progression independent of treatment. Biotech and pharma companies increasingly treat failed trial data as a separable asset – one that can be licensed, partnered, or used to anchor a new collaboration.

This requires the data to be well-characterised and the findings to be credibly documented. A systematic analysis produces exactly that.

The Governance Problem

There's a structural issue that sits beneath all of this.

The people best placed to analyse a failed trial's data are the same people who designed the trial, believe in the drug, and have the most invested in a particular outcome. This naturally creates a conflict. When the Head of Translational Medicine runs the post-hoc analyses, the results will be viewed sceptically – by the board, by potential partners, by regulators, and by the scientific community. This is rational scepticism. The analysis was done by someone with strong prior beliefs and the latitude to make analytical choices that favour those beliefs, whether or not they intended to.

Independent analysis changes this dynamic. When the exploration is done by a system that searches the data systematically – without access to the programme's history, without a stake in the outcome, with a documented and reproducible methodology – the findings carry different epistemic weight. The system isn't infallible, but it removes the most obvious source of bias from the process.

This is the governance argument for systematic data analysis, and it's often more compelling than the scientific one. The science makes the case for what's possible to find. The governance case is about what findings can actually be used – what you can take to a board, to a regulatory agency, to a licensing partner, and defend without being accused of p-hacking your way to a narrative.

Both matter. But in the weeks after a failed trial, when the question is whether to kill or continue, the governance case is often the deciding one.

Doing This in Practice

The practical barriers are real. You need to clean and harmonise the dataset, select an appropriate modelling approach, pre-specify the discovery-validation split, render outputs in a form that clinicians can actually evaluate – and make the whole process reproducible and auditable.

This is where most organisations get stuck. They have the data, and they have statisticians who can run subgroup analyses, but they don't have the infrastructure to do this kind of systematic, ML-driven exploration at the speed and rigour the situation requires. Building it from scratch – custom preprocessing pipelines, model selection, interpretability methods, validation frameworks – takes months and requires expertise that most internal teams don't have on standby.

Disco is designed for exactly this situation. It takes tabular clinical data, automatically identifies the structure of patient responses, surfaces subgroup signals with validation built into the pipeline, and generates interpretable outputs that describe what the model found and why. It runs end-to-end without months of custom development, and it produces outputs in a form that can be reviewed, interrogated, and defended.

You're not going to recover a failed trial by running it again on wishful thinking. But a dataset that cost hundreds of millions of dollars to generate deserves a serious attempt at finding what's in it – done properly, documented clearly, and separated from the advocacy pressures that make post-hoc analysis so unreliable in practice.

If you have a failed trial you're reviewing – or a programme in late-stage development where you'd want to understand the responder landscape before you get there – we'd like to talk.

References

[1] Hwang, T. J. et al. (2016). Failure of investigational drugs in late-stage clinical development and publication of trial results. JAMA Internal Medicine, 176(12), 1826–1833. https://doi.org/10.1001/jamainternmed.2016.6008

[2] Sun, X., Briel, M., Busse, J. W. et al. (2012). Credibility of claims of subgroup effects in randomised controlled trials: systematic review. BMJ, 344, e1553. https://doi.org/10.1136/bmj.e1553

[3] Edaravone (MCI-186) ALS 16 Study Group. (2017). A post-hoc subgroup analysis of outcomes in the first phase III clinical study of edaravone (MCI-186) in amyotrophic lateral sclerosis. Amyotrophic Lateral Sclerosis and Frontotemporal Degeneration, 18(sup1), 11–19. https://doi.org/10.1080/21678421.2017.1363780

[4] Turnbull, J. (2018). Is edaravone harmful? (A placebo is not a control). Amyotrophic Lateral Sclerosis and Frontotemporal Degeneration, 19(7–8), 477–482. https://doi.org/10.1080/21678421.2018.1517179

[5] U.S. Food and Drug Administration. (2019). Enrichment strategies for clinical trials to support approval of human drugs and biological products. https://www.fda.gov/regulatory-information/search-fda-guidance-documents/enrichment-strategies-clinical-trials-support-approval-human-drugs-and-biological-products