Your statistical analysis plan tests what you hypothesised. It doesn't test what's actually in the data.
When a reviewer asks whether you looked at a particular biomarker × treatment interaction, the honest answer is usually that you didn't know to look. The SAP was locked before the data told you anything. And once the trial is unblinded, any subgroup analysis you run is exploratory at best, and fishing at worst – both in your eyes and the regulator's.
This is the fundamental tension in clinical trial analysis. Pre-specification gives findings credibility. But pre-specification requires hypotheses, and hypotheses require prior knowledge. When the interaction you missed would have changed the interpretation of the trial – and sometimes it would – that's a patient outcomes problem.
Why standard approaches fall short
The established response to this is exploratory subgroup analysis. Run the treatment comparison within each level of a handful of baseline variables, look for heterogeneous treatment effects, apply a Bonferroni correction or two, and report what you find with appropriate caveats.
This works, up to a point. But it has well-documented limitations that the biostatistics literature has been grappling with for decades.
First, it's manual and assumption-driven. You choose which subgroups to look at based on clinical intuition or prior literature. If the interaction involves a variable you didn't think to check, or a combination of variables that wouldn't have seemed plausible a priori, you won't find it. A 2016 paper in BMC Medical Research Methodology examined the different purposes that subgroup analyses serve in confirmatory clinical trials, and found that methods and purposes are frequently conflated in practice – a distinction that matters, because different purposes require fundamentally different methodological approaches [1].
Second, it doesn't scale. Pairwise interaction testing across even a modest number of covariates – say, 20 or 30 baseline variables – generates hundreds of tests. Three-way interaction effects, where the treatment response depends on the combination of two baseline characteristics simultaneously, are almost never systematically explored. They're too expensive to test exhaustively, too difficult to visualise, and easy to rationalise skipping. But three-way interactions are real, and they matter – particularly in oncology, where response to treatment often depends on a constellation of tumour and patient characteristics acting together.
Third, you can't validate on hold-out data after the fact. Once the trial is unblinded and you've run your exploratory analyses, there's no clean held-out set to test whether a finding is real or artefactual.
A landmark paper in the New England Journal of Medicine documented just how often subgroup analyses in major trials are reported without adequate pre-specification or correction for multiplicity, and how this leads to overconfident claims about differential treatment effects that don't replicate [2]. People aren't doing subgroup analysis badly – the standard toolkit makes it structurally difficult to do it well.
What systematic interaction discovery looks like
The alternative is to treat interaction discovery as a high-dimensional pattern recognition problem – which it is.
Methods like iFOR (iterative Feature and Outcome Regression) and HDSI (High-Dimensional Subgroup Identification) approach this more systematically, using regularisation and recursive partitioning to search the interaction space more completely than manual methods allow. They represent a step in the right direction. But they still require expertise to apply, generate outputs that are hard to interpret clinically, and don't naturally extend to discovering the structure of interactions – not just whether an interaction exists, but what the underlying pattern looks like, and whether it's robust across subsets of the data.
Machine learning changes what's tractable here. A deep learning model trained on your tabular clinical data will, if it achieves genuine predictive performance, have learned something about how baseline characteristics combine to determine treatment response. The question is whether you can extract that knowledge in a form that's scientifically and regulatorily useful.
Interpretability is the key step. Not SHAP alone – SHAP is excellent at marginal feature importance but doesn't naturally surface the interaction structure you care about. What you need is a method that identifies which combinations of baseline variables are jointly predictive of heterogeneous treatment effects, ranks them by effect size and statistical significance, and renders them in terms a clinician can evaluate. The interaction between, say, elevated CRP and low baseline ECOG status might collectively predict a differential response that neither variable predicts alone. That's the kind of thing that gets missed.
Hold-out validation is what makes a finding defensible. If a model trained on 70% of your data identifies a biomarker × treatment interaction, and that interaction validates on the remaining 30% with maintained statistical significance, you have something qualitatively different from a post-hoc subgroup analysis. You have evidence that the pattern generalises, that it's not an artefact of multiple testing, and that it was discoverable from the data before you looked at outcomes.
The workflow that matters: before Phase III
The correct time to run systematic interaction discovery is before you lock the Phase III SAP.
The evidence base for this is your Phase I/II data, or existing prior trials where you have individual patient data. Run Disco on that data. Let it search the full interaction space – pairwise and higher-order – across all the baseline variables you've collected. Use interpretability to identify the patterns that are strongest, most clinically coherent, and most likely to represent genuine heterogeneity of treatment effect. Validate on held-out subsets. Then take the ranked list of validated findings to your SAP, and pre-specify the ones that meet your credibility threshold.
The output is a prioritised set of candidates for pre-specification – hypotheses generated from the data itself, validated before the pivotal trial runs, and defended on that basis. These feed directly into SAP design as inputs to pre-specification, not as results to be reported alongside the primary analysis.
The practical workflow looks like this: gather all available pre-Phase III data with treatment assignment and outcomes. Upload it to Disco. The system automatically handles preprocessing, trains models that capture non-linear and interaction effects, applies interpretability to surface the most robust patterns, and validates findings on held-out data. The output is a structured report listing the relationships found – ranked by effect size, statistical significance, and prevalence in the dataset. You bring that report to your SAP design meeting and decide which findings to pre-specify.
The whole process – from data upload to actionable output – takes hours, not weeks.
The regulatory angle
The FDA's 2019 guidance on adaptive clinical trial designs explicitly encourages prospective use of adaptive and model-informed approaches to trial design, including pre-specification of subgroup analyses informed by prior data [3]. The guidance recognises that evidence-based pre-specification – hypotheses derived from prior trials or early-phase data, validated before the pivotal study – is different in kind from post-hoc exploration, and can be treated accordingly.
This creates a genuine regulatory opening. If you can show that a subgroup interaction was identified from prior data, validated on held-out samples from that data, and pre-specified before the pivotal trial began, you're in a defensible position even if the analysis wasn't in the original protocol from day one. The FDA understands that what you include in the SAP is bounded by what you knew when you wrote it. What you cannot do is mine the pivotal trial data, find an interaction, and claim it was prospectively motivated.
The validation piece is critical to the regulatory argument. A finding that survives hold-out validation on prior data has a prior probability of being real that a purely exploratory post-hoc analysis doesn't. Regulators know the difference, and it shows in how they respond to these analyses at review.
There's also a practical benefit during review. When a reviewer asks whether you looked at a particular subgroup interaction, "we ran a systematic data-driven analysis of our Phase II data before locking the Phase III SAP, and this interaction wasn't among the findings that validated" is a much stronger answer than "we didn't think to look." It shows that the absence of a finding was informative, not just an omission.
What this doesn't solve
Systematic data-driven interaction discovery finds what's in your prior data. That comes with real constraints. If the Phase I/II population is materially different from the Phase III population, interactions found in one may not translate to the other. If your earlier trials were small, the interaction estimates will be noisy even after hold-out validation. And the method can only find interactions that are present in the variables you collected – heterogeneous treatment effects driven by unmeasured biological mechanisms won't appear.
There's also the question of how many findings to pre-specify. The temptation is to pre-specify everything the discovery process flags, which reproduces the multiplicity problem you were trying to solve. The right answer is to apply a credibility threshold – probably a combination of effect size and biological coherence – and pre-specify only the findings that clear it. This requires judgment, and it's better exercised before unblinding than after.
None of this replaces the primary analysis or the pre-specified secondary endpoints that define the trial's inferential architecture. It's a method for using available prior evidence more completely when designing the SAP.
Try it on your data
The Disco handles this workflow end-to-end. Upload your Phase I/II tabular data – treatment assignments, baseline characteristics, outcome measures – and the system will automatically search for interaction effects across the full feature space, surface interaction effects including three-way and higher-order combinations, and validate findings before you ever look at pivotal trial outcomes. The output is a structured report you can take directly into your statistical analysis design process.
If you have data from a prior trial and a Phase III SAP in preparation, we'd like to run this for you. Get in touch or try Disco directly.
References
[1] Tanniou, J., van der Tweel, I., Teerenstra, S., & Roes, K. C. B. (2016). Subgroup analyses in confirmatory clinical trials: time to be specific about their purposes. BMC Medical Research Methodology, 16(20). https://doi.org/10.1186/s12874-016-0122-6
[2] Wang, R. et al. (2007). Statistics in medicine — reporting of subgroup analyses in clinical trials. New England Journal of Medicine, 357(21), 2189–2194. https://doi.org/10.1056/NEJMsr077003
[3] U.S. Food and Drug Administration. (2019). Adaptive design clinical trials for drugs and biologics: Guidance for industry. https://www.fda.gov/media/78495/download