The Patterns That Agents Miss

Every AI data analysis tool makes the same promise: upload your data, ask a question, get insights.

The promise matters because the problem is real – most organisations are sitting on valuable data that they don't have the time, expertise, or tooling to properly analyse. Clinical researchers have trial data with subgroup effects they'll never find by hand. Operations teams have process data with combinatorial failure modes buried in the noise. Scientists have experimental results where the interesting patterns are non-linear interactions between variables – the kind you'd never think to test for unless you already suspected they were there.

Automated data analysis could change this. If an AI agent can reliably find the patterns that matter in a dataset – not just the obvious correlations, but the complex, conditional, combinatorial relationships – that could be transformative for decision-making and discovery across science, medicine, and industry. A doctor gets subgroup-specific treatment guidance instead of population averages. A plant biologist finds the gene-environment interaction that explains why a cultivar fails under specific conditions. A manufacturing team discovers the particular combination of process parameters that causes defects.

But only if the analysis is actually right.

We built a dataset where we knew exactly what the key patterns are, and put 18 data analysis agents to the test.

Dataset

We generated 5,000 synthetic clinical trial records with recovery_score as the outcome variable. The data contains 17 features (treatments, demographics, biomarkers, lifestyle factors etc) – and five planted signals that influence recovery_score.

Two of these patterns are a bit tricky – the signals that matter most aren't in individual feature values, but in specific cross-feature combinations.

IRL this is common. In genetics, ignoring interactions between genes may account for the majority of "missing heritability" in complex diseases [1,2]. In pharmacology, roughly one in five patients is exposed to potentially interacting drug-drug-gene combinations [3]. In any sufficiently complex dataset, from materials science to user behaviour, the patterns that matter most often involve specific combinations of variables, not individual features in isolation. These kinds of relationships are often the hardest to find, unless you already know what to look for.

In this dataset, we have two combinatorial patterns of this kind:

A two-way interaction. When response_1 > 3.0 AND response_2 < 1.5, recovery score jumps by +30 points. The signal only exists in the combination.

A three-way interaction. When age > 55 AND bmi < 27 AND treatment_A = True, recovery score drops by −35 points. Treatment A does nothing to ~90% of patients but costs 35 recovery score points for the ~10% who are older and lean.

The remaining three signals are smaller, simpler effects – the kind that any competent analysis should pick up:

smoking_pack_years has a counterintuitive positive linear relationship with recovery (r ≈ 0.44).

systolic_bp > 150 adds +6 points (a threshold effect).

treatment_B = True adds +4 points.

Everything else – treatment C, treatment D dosage, diabetes status, blood type, cholesterol, resting heart rate, alcohol consumption, day of treatment cycle – is pure noise.

The interactions account for ~72% of the variance in the outcome. They are, by far, the most important thing in the dataset with respect to determining recovery_score. An agent that misses them is missing the point.

Protocol

We uploaded the CSV to each tool and used the same prompt each time: "I need to understand what influences recovery_score."

If the agent offered to dig further into any particular feature or relationship, we said yes. If it offered choices, we said do all of them. If, when the analysis was complete, it did not provide a summary, we asked for one. When configuration or plan options were available, we used the free plan with default settings – presumably for at least some of these, upgrading might improve performance.

Results

Tool	2-way threshold interaction (r1 × r2)	3-way interaction (age × bmi × A)	Note
Disco	Found	Found	We are the champions
Edison	Partial – interaction detected, no thresholds	Partial – "gatekeeper" structure detected, no thresholds	Continuous coefficients only, didn't consider threshold effects
TextQL	Missed	Partial – A×age and A×BMI found separately, never unified for full effect	Tested some interactions, but not response_1 × response_2
Delphina	Partial – found r1 >2.9 and r2 <1.5 thresholds separately	Partial – found age >55 and BMI ~27 thresholds separately	Concluded "most relationships are non-linear" but didn't tell us what they were
Claude	Partial – identified relevant quartiles in the two variables separately	Missed	Suggested interaction modelling as a next step, but ran out of budget
Hex	Found	Partial, fragmented into inconsistent 2-ways	A×BMI direction backwards!
Dremio	Missed	Tangled up with smoking, but sort of there	Seems extremely keen on stratifying everything by treatment, which is sensible?
Wobby	Missed	Missed	Univariate only – never discovered subgroups
Veezoo	Missed	Missed	Suggested interactions as follow-up but didn't run them
Zerve	Missed	Missed	Noticed response_2 ranks differently across methods but didn't investigate
Julius	Missed	Missed	Gave good general advice, though
Science Machine	Missed	Missed	Noted the R² gap implies interactions, then didn't look for them
ChatGPT	Missed	Missed	No effect sizes, no R², nothing auditable
Graphite Note	Missed	Missed	Turned a +6 pt synthetic BP effect into a 158–170 mmHg clinical target
Domo	Missed	Missed	Hallucinated findings from variable names
Quadratic	Missed	Missed	Spent entire analysis budget investigating smoking marginal
Mammoth	Missed	Missed	More of a data profiling tool, really
Alteryx	Missed	Missed	Segmented by noise variables (diabetes, blood type)
GoodData	Missed	Missed	Returned the mean recovery_score. That's it.

Disco was the only tool that found every pattern (plus, marginal effects). You can see our results here.

Edison

Edison results

Edison ran a tree-based model, computed SHAP interaction values, and stratified correlations across subgroups. It correctly identified Treatment A as a "gatekeeper" that activates the effects of age and BMI, and flagged response_1 × response_2 as the strongest pairwise interaction (SHAP = 2.38), noting the interaction is "synergistic". For the Treatment A interaction, it reported that BMI and age are "only predictive when Treatment A is given" (r = 0.37 vs ≈ 0 for BMI, r = −0.36 vs ≈ 0 for age). It also correctly identified the noise variables as non-significant.

Where it fell short: continuous coefficients lack precision. Edison reported that response_1 and response_2 interact strongly, but couldn't pin down "when response_1 > 3.0 AND response_2 < 1.5, recovery jumps by +30." Without those thresholds, it's hard to act on the finding (though we imagine a follow up question would help). The three-way interaction was similarly detected but never unified – it came through as two separate pairwise SHAP interactions (Treatment A × BMI, Treatment A × age) without the conjunction.

TextQL

TextQL results

TextQL was the only tool that explicitly tested interaction terms in a regression. It ran OLS with Treatment A × age and Treatment A × BMI as separate interaction terms, plus Mann-Whitney U tests on the subgroups. The results were correct: Treatment A × age 60+ showed −14.5 points (p < 0.001), Treatment A × BMI underweight showed −13.8 points (p < 0.001). It correctly noted that Treatment A's negative effect "essentially disappears in obese patients."

While it found two parts of the three-way interaction, it never unified them – never said "older AND leaner AND treated." Second, it never tested response_1 × response_2 at all, presumably because Treatment A had the largest marginal effect and the analysis focussed on that. And despite finding the conditional structure, the report still framed Treatment A as consistently harmful across subgroups – with the interaction qualifying the magnitude rather than changing the recommendation. The correct interpretation is that Treatment A is harmless for ~90% of patients and only costs recovery points for the specific older-and-leaner subgroup.

Delphina

Delphina results

Delphina ran a random forest (R² = 0.96) and produced a clean summary table with correct features, directions, and magnitudes. It identified threshold effects on response_1 (~3), response_2 (<1.5), age (>55), BMI (~27), and systolic BP (>150) – all five planted breakpoints, individually approximately correct. It even flagged that "most relationships are non-linear" and that variables exhibit "threshold/step effects rather than gradual slopes."

All the pieces were there, but it reported each threshold independently and concluded: "Treatment A and B are additive – no interaction. The best treatment combination is A=No + B=Yes." In reality, Treatment A is harmless for ~90% of patients and only matters in the specific three-way combination.

Hex

Hex results

Hex started with a standard random forest importance ranking – smoking, Treatment A, response_2, age, BMI, response_1 – all in the right ballpark. It correctly found response_1 × response_2 as the strongest interaction pair and correctly noted that the two response variables "function as a pair" with threshold and step-function behaviour.

But for pattern 2, instead of the unified three-way rule (age > 55 AND BMI < 27 AND Treatment A), Hex fragmented it into three separate two-way interactions and reported Treatment A × BMI in the wrong direction – claiming the harm was concentrated in BMI > 27, when it's actually < 27. It concluded that "Treatment A may be counterproductive for older and/or higher-BMI patients." Getting one interaction right and the other backwards is arguably worse than missing both, because it looks like a complete analysis.

Claude

Claude results

Claude is our AI of choice at Leap Laboratories, but for this test we used a clean account, with code execution enabled. It ran both correlation analysis and a random forest, and produced correct feature rankings with directions and approximate effect sizes. It also identified important quartiles in the response variables. Claude also suggested that smoking and BMI findings "warrant a deeper look for confounders or interaction effects with treatment A", reasoning that they run counter to typical clinical intuition.

But it missed all of the interactions that most strongly predict recovery_score.

Dremio

Dremio results

Dremio took a different approach – stratified analysis. It computed within-stratum treatment comparisons (adjusting for age, BMI, smoking) and produced a treatment recommendation table broken down by subgroup. The table actually contains the interaction structure implicitly: you can see that Treatment A's effect varies dramatically across strata. It even gave a specific example: "Age > 65, BMI 18.5–24.9, non-smoking, Treatment A = −34 points within that stratum."

Dremio got closer to the answer than most tools, but the strata are too coarse to find the actual ranges that matter. So close, yet so far.

Veezoo

Veezoo results

Veezoo generated univariate breakdowns – strongest positive correlations, BMI and age buckets, treatment group means – and summarised most of them correctly. Smoking positive, Treatment A negative (~7 point drop), BMI inverse, age decline. It missed the systolic BP threshold entirely. It flagged that "the relationship with smoking and BMI is counterintuitive and may warrant further investigation to understand whether these are true causal relationships or confounding factors."

It then offered follow-up suggestions – "Recovery Score by Treatment A and Treatment B," for instance – but didn't run them unprompted. As a BI tool designed for guided exploration, this is a reasonable design choice. But for autonomous analysis, the result was a correct univariate summary with the most important patterns left on the table.

Wobby

Wobby results

Similar to Veezoo – univariate breakdowns by feature. Wobby identified Treatment A as having "a notable negative effect" (−7 points), Treatment B as positive (~4 points), Treatment C as negligible. Age decline was correctly mapped with per-decade breakdowns (18–49 at ~56, 50–59 at ~53, 60–69 at ~49, 70+ at ~48.5). BMI was flagged as "a surprising pattern" where higher BMI associated with higher recovery.

Wobby concluded that Treatment A and Treatment B were "the most actionable signals." Sure, you could tell it to check subgroups – but you'd need to already know which subgroups matter, which is the hard part.

Zerve

Zerve results

The most methodologically thorough of the tools that missed the interactions. Zerve ran gradient boosting and random forest, then computed three separate importance rankings: correlation-based, tree-based, and permutation importance. The consensus ranking was correct – smoking, response_2, response_1, Treatment A, age, BMI at the top. It flagged that these top six features "account for about 96% of the model's predictive power" and that the rest could be dropped.

The interesting near-miss: Zerve observed that response_2 ranks differently across the three importance methods. That's exactly the signature of a variable that only matters conditionally – its marginal correlation is modest, but its permutation importance is high because the model relies on it for the interaction. Zerve flagged the discrepancy and even suggested "identify two-feature combinations with low R²" as a follow-up – but never ran it.

Julius

Julius results

Julius ran a random forest (R² = 0.97) and reported the top features by permutation importance: BMI, age, Treatment A, response_1, smoking, response_2, systolic BP, Treatment B. The rankings are roughly correct, and the R² proves the interaction effects are captured in the trees – but the analysis stopped there, concluding that "recovery is shaped far more by patient characteristics and a specific treatment than by most other variables in the dataset".

It followed up with some advice on controlling for baseline patient factors when comparing treatments (sensible), and asked if the treatments were randomized or observational (nice!) before running out of credits.

Science Machine

Science Machine results

Science Machine ran both linear regression (R² = 0.46) and random forest (R² = 0.96), correctly identified the feature rankings, noted that response_1 and response_2 have strong opposing effects on recovery, and generated six output files: correlation heatmap, regression coefficients, feature importance, categorical boxplots, scatter plots, and a recovery score distribution.

It also explicitly noted that the gap between linear and RF R² "suggests important non-linear relationships and feature interactions" – but didn't find them.

ChatGPT Data Analyst

ChatGPT results

ChatGPT ran a correlation analysis and a "machine learning feature importance model," which produced correct feature rankings and directions. Smoking was identified as the strongest predictor, Treatment A as a strong negative, response variables as very influential. It noted that smoking's positive direction was "surprisingly positive in this dataset."

But the output was entirely qualitative. No model type named, no coefficients, no R², no effect sizes, no plots – just "strongest influence," "strong negative impact," "very influential." It offered to "build a regression equation or visualization" but didn't do it unprompted. This was on the free plan – paid plans likely allow for more in-depth analysis.

Graphite Note

Graphite Note results

Graphite Note produced correct feature importance rankings from an ML model, then fed them into a "Strategic Guidance Document" – a five-page report with a purpose statement, key points, strategic directions, KPIs, a "Five Whys" root cause analysis, and implementation recommendations. The feature rankings are fine. Everything built on top of them is not.

The systolic BP effect (+6 points above 150) became a clinical recommendation to "maintain systolic BP within 158–170 mmHg." That's stage 2 hypertension. The smoking effect became a smoking cessation programme recommendation – in the same document that flagged smoking's direction as anomalous. The report reads like a consulting deliverable generated from variable names and effect directions, with no assessment of whether the underlying effects are causal, clinically meaningful, or even real.

Quadratic

Quadratic results

Quadratic showed the most analytical initiative of any tool that fully missed the interactions. It stratified smoking by treatment group, ran partial correlations controlling for confounders, tested Treatment D dosage interactions, and produced multiple diagnostic plots. The instinct to investigate confounding was correct and the methodology was sound.

The problem: it spent its entire analysis budget on the smoking signal. After extensive stratification, partial correlation, and dose-response checks, it concluded the counterintuitive smoking effect was "genuine and independent." It never looked at response_1 × response_2 or age × BMI × Treatment A. All that analytical energy pointed at the third-most-important marginal effect while the two patterns that account for 72% of the variance went completely unexamined.

But, if you like spreadsheets, you'll love the UI (we did)!

Domo

Domo results

Domo returned a plausible but incorrect summary. It claimed Treatment B and Treatment C together produce moderate recovery scores (Treatment C is pure noise), claimed both very low and very high resting heart rates "can produce strong scores" (resting heart rate is pure noise), and referenced "cardiovascular and metabolic markers" with clinical significance that doesn't exist in the data.

The output reads like an LLM given the column names and asked to write a plausible-sounding clinical summary. Most of the specific claims – the Treatment B + C interaction, the heart rate U-shape, the metabolic marker patterns – are fabricated. Oh dear.

Mammoth

Mammoth results

Mammoth returned six data profiling observations: response scores cluster around certain values, treatment assignments are roughly balanced, response_1 and response_2 have overlapping distributions.

These are statements about column distributions, not analysis of what drives recovery_score. The mention of "overlapping response_1 and response_2 distributions" is the closest it got to noticing anything – and that's a marginal distribution observation, not the interaction. Probably not intended to be an automated analysis tool?

Alteryx

Alteryx results

Alteryx produced a BI-style "Auto Insights" dashboard. The summary tile reads "recovery_score was 263,857" – that's the sum, not the mean, which is an odd default for a continuous outcome variable. The top "insights" were segment breakdowns: "#1 none in diabetes_status" (56.97%), "#2 150mg in treatment_D" (21.67%), "#3 B1 in blood_type" (13.18%). These are the largest categorical segments in the data – all noise variables, ranked by group size. Possibly we're missing some functionality here.

GoodData

GoodData results

GoodData returned a single tile: Recovery Score Analysis: 52.77. That's the mean of the outcome variable. No features examined, no model, no breakdown. It listed the columns in a sidebar and offered to create visualisations, but in response to "I need to understand what influences recovery_score," it computed the average and stopped.

What went wrong

Every tool that ran a random forest got the feature importance rankings roughly right. Smoking and Treatment A at the top, response variables and age in the middle, noise features at the bottom. R² was above 0.96 in most cases. So the limitation here isn't modelling – most tools do that pretty well – but interpreting what the model learned.

You might look at these results and think the fix is obvious: just test interaction terms. Several tools even said as much – Claude suggested "interaction modelling" as a next step, Science Machine noted that the R² gap "suggests important feature interactions," Zerve flagged that response_2 ranks differently across importance methods. You might add some stratification to identify important ranges, and you're done, right? (Spoiler: this isn't how Disco does it.)

The problem is, "test interactions" and "find ranges" is the beginning of a combinatorial search problem. There are tools that can help – SHAP interaction values, for instance, will tell you that two features interact and by how much. Edison actually computed these. But knowing that BMI and treatment_A interact isn't the whole story – you still don't know when they interact, what direction the effect goes, or what the subgroup looks like. And you'll completely miss more complex multivariate interactions when even approximating SHAP interactions becomes intractable.

This dataset has 17 features – it's pretty small. Twelve of those are continuous. To find threshold rules, you could split each continuous variable into decile bins. That gives you 136 feature pairs, but each pair isn't a single test – two continuous variables at decile resolution give you 100 subgroups to check, a continuous variable crossed with blood type (e.g. 8 categories) gives 80. Across all 136 pairs, that's roughly 9,000 subgroups, each one a candidate for a pattern like "BMI above the 80th percentile and blood type O+ has elevated recovery." Three-way combinations push it past 400,000. Four-way: 11 million. And every additional test raises the multiple-comparison correction bar, so the exhaustive approach actively penalises itself – the more subgroups you check, the stronger the evidence you need in each one.

No agent is doing this. So what actually happens is good old hypothesis-driven iteration: the agent looks at the marginals, picks the most interesting-looking feature, investigates it, and moves on. Quadratic spent three rounds investigating whether smoking was confounded. TextQL tested Treatment A interactions but never looked at the response variables. Edison computed SHAP interaction values but never considered that specific ranges might matter. Each agent followed a plausible analytical path and missed the patterns that weren't on that path.

This is the same problem that hypothesis-driven research has always had – you find what you think to look for.

What Disco Found

Disco recovered both interactions with near-ground-truth effect sizes (+29 and −32, vs the planted +30 and −35). It correctly framed Treatment A as conditionally harmful – specifically for older, leaner patients – rather than broadly harmful. It also found all three smaller effects with their correct magnitudes.

Disco doesn't use a better predictive model (it actually uses multiple predictors under the hood, everything from trees to tabular foundation models depending on data properties and predictive performance).

What's different is what happens after the model is trained. Instead of ranking features by marginal importance and handing the list to an agent for further investigation, Disco uses specialised interpretability methods to extract learned patterns from the trained models directly, and then ties those patterns back to actual samples in the data. So we're able to give you not only "these features are important", but also tell you when and how they interact, how strong the effect is, which samples the pattern is present in, and how confident you should be.

We also hold out a test dataset from the modelling process and actually check that the patterns generalise to unseen data. We provide p-values corrected for multiple testing. We contextualise everything we find with published literature. And we do all of this with no hypothesis and no confirmation bias – we just find signal in the numbers, irrespective of the column names. We want to find what's really there, not just what you think to look for.

And if you want to give your agent these capabilities, you can!

Data-driven science for all mankind, now available via API, for agents and humans.

Download the dataset here.

References

[1] Zuk, O. et al. (2012). The mystery of missing heritability: Genetic interactions create phantom heritability. Proceedings of the National Academy of Sciences, 109(4), 1193–1198. https://doi.org/10.1073/pnas.1119675109

[2] Mackay, T. F. C. & Moore, J. H. (2014). Why epistasis is important for tackling complex human disease genetics. Genome Medicine, 6(6), 42. https://doi.org/10.1186/gm561

[3] Wittwer, N. L. et al. (2025). The Prevalence of Potential Drug-Drug-Gene Interactions: A Descriptive Study Using Swiss Claims Data. Pharmacogenomics and Personalized Medicine, 18, 197–208. https://doi.org/10.2147/PGPM.S527556