Leap Icon

Picture Claude in a lab coat. Your trusty research assistant. You hand it a spreadsheet with 2,170 rows of experimental catalyst data and ask it to find patterns that could revolutionise green hydrogen production.

What happens next reveals a fundamental truth about AI for frontier science, and why the right tools matter just as much as intelligence.

The Task: Understanding Catalyst Performance

Meta OCx24 is benchmark dataset bridging computational and experimental catalyst discovery for clean energy applications. Each entry includes elemental composition tested at various current densities, along with their electrochemical performance metrics including voltage and Faradaic efficiencies (FE, a key measure of catalyst selectivity, capturing what fraction of electrical energy goes toward producing specific products).

CO₂ reduction catalysts make it possible to convert captured carbon dioxide into useful products like fuels, chemicals, or feedstocks – and enable carbon recycling at scale. Understanding the relationships between different elements, compositions, and operating conditions is key to designing better catalysts, but the complexity of the data makes this a major challenge.

Where to begin? Do we look at every possible composition? Computationally infeasible. Do we follow our intuition, or start with a literature search to identify promising avenues? Intuition is great, but we all have blind spots and assumptions – and the literature often doesn’t replicate. Exploratory data analysis, then. But that takes weeks or months.

Claude?

Claude Goes It Alone

Let’s see what Claude Opus 4.1 (Anthropic’s latest model in thinking mode) makes of the data. We asked it to look for insights that could help guide the development of more promising catalysts, contextualise anything it found with reference to existing literature, and generate plots to validate its findings.

For the sake of brevity, we asked for only the three most important insights.

Over to you, Claude!

Let me analyze this dataset to extract insights about the relationship between catalyst composition and Faradaic efficiency for liquid products in CO2 reduction. I need to focus on very high or low values of Faradaic efficiency.

First, let me understand the data structure:
- The dataset has 2170 rows
- Key columns include composition fractions for various elements (In, Pd, Sn, Ga, Ag, Cd, Cu, etc.)
- Current density ranges from 50 to 300 mA/cm²
- The target variable is fe_liquid (Faradaic efficiency for liquid products)

Let me think about what patterns might be important:
1. Which elements or combinations lead to high/low Faradaic efficiency
2. The role of current density
3. Specific ratios or binary/ternary combinations

I should look for:
- Compositions with very high fe_liquid values (e.g., >80%)
- Compositions with very low fe_liquid values (e.g., <5%)
- The effect of specific elements like In, Sn, Ga, Pd, Cu, etc.
- The influence of current density on performance

Let me write code to validate these patterns and search for literature support.

Looking at your CO2 reduction catalyst dataset, I'll analyze the relationship between catalyst composition and Faradaic efficiency for liquid products, focusing on extreme values. Let me identify the three most important insights:

Claude Alone: Insight #1

Indium-containing catalysts achieve exceptionally high Faradaic efficiency for liquid products (>80%) at high current densities.

Catalysts containing indium (In) show dramatically higher Faradaic efficiency for liquid products, with values exceeding 80% at current densities of 200-300 mA/cm². Pure indium and indium-rich compositions consistently outperform all other catalyst compositions in the dataset. This aligns with literature showing that indium is highly selective for formate production in CO2 reduction, with recent studies reporting Faradaic efficiencies above 90% for formate on indium-based catalysts [1][2]. The high selectivity is attributed to indium's weak binding to CO intermediates, favoring the formate pathway over CO evolution [3].

Evaluation:

This is partially correct. On average, indium absolutely does increase efficiency, and some indium containing catalysts are indeed the highest performing in the dataset – but as we can see, many have poor performance. It also isn’t the case that current densities of 200-300 mA/cm² show dramatically higher performance – some catalysts containing indium perform well at much lower densities, and many perform poorly at all densities.

Claude Alone: Insight #2

Binary Sn-Ga catalysts with Sn:Ga ratios between 0.3:0.7 and 0.5:0.5 show consistently low Faradaic efficiency (<15%) for liquid products.

The dataset reveals that binary combinations of tin and gallium result in particularly low selectivity for liquid products, with Faradaic efficiencies typically below 15%. The Sn-0.324-Ga-0.676 and Sn-0.5-Ga-0.5 compositions show the lowest liquid product formation. This contrasts with literature showing that pure Sn catalysts can achieve moderate formate selectivity [4], suggesting an antagonistic effect when Ga is present. The low selectivity may be due to gallium promoting the hydrogen evolution reaction, as reported in recent studies [5], while the Sn-Ga interface may favor gaseous products like CO over liquid products.

Evaluation:

Unfortunately, this is hallucinated – as the plots show, the data doesn’t actually contain any Sn-Ga compositions.

Claude Alone: Insight #3

High palladium content (>0.75 fraction) combined with copper at Cu:Pd ratios below 0.15:0.85 results in moderate but stable liquid product formation (15-20%) across all current densities

Catalysts with palladium fractions exceeding 0.75, particularly when combined with small amounts of copper (Cu fraction < 0.15), show remarkably stable Faradaic efficiency for liquid products between 15-20% regardless of current density (50-300 mA/cm²). This current density independence is unusual and contrasts with literature reporting that Pd typically favors formate production with strong current density dependence [6]. The stabilizing effect of small Cu additions to Pd has been reported to modify the d-band center and improve formate selectivity stability [7], though the specific Cu:Pd ratio of approximately 1:6 observed here as optimal has not been previously reported in the literature.

Evaluation:

This is misleading. We can see that high palladium with a small amount of copper results in increased efficiency, but is no more stable than palladium with high copper (in fact, high copper and palladium catalysts exhibit smaller efficiency variance). Neither is FE independent of current density – Claude’s chart shows that as current density on Cu:Pd < 0.2 increases, so does Faradaic efficiency.

As shown by the plots that Claude itself generated, none of these findings are well supported by the data. What happened here?

Claude did what many humans do when confronted with complex data: it looked for the patterns that it was expecting to see, whether or not they were actually present. Claude alone, like all LLMs, just isn’t equipped to find patterns in this kind of data – and why would we expect it to be? It’s a language model. (Sorry Claude, we love you!).

Perhaps we should give Claude the tools it needs to do well at this kind of task.

Claude + Discovery Engine

What if Claude had access to a tool that would give it the ability to systematically extract insights from data? Let’s find out. As above, we asked for only the three most important insights – and the only difference between the two prompts is that this time, we asked Claude to primarily use the output of Discovery Engine, instead of analysing the data itself.

Looking at the Discovery Engine output, I need to extract the most important insights from the patterns found. The data is about catalyst compositions and their Faradaic efficiency for liquid products in CO2 reduction.

Let me analyze the key patterns found:

1. Novel patterns with high Faradaic efficiency:
- When both palladium and gold are absent from the catalyst composition, the Faradaic efficiency for liquid averages around 30, which is notably higher than when either metal is absent individually (means of 21-23).
- When catalysts contain high amounts of copper (between 0.36 and 1.00) and no gold, they achieve moderate Faradaic efficiency for liquid CO2 reduction with values ranging from 17.91 to 35.48, showing a higher average efficiency (26.70).

2. Negative synergistic effects (patterns showing low efficiency):
- When both copper (Cu) content is between 0.24 and 0.52 and gold (Au) content is between 0.58 and 0.69, the Faradaic efficiency for liquid drops to a mean of 5.99, which is notably lower than when either condition occurs individually.
- When catalysts contain no cadmium and copper content between 0.24 and 0.52, the Faradaic efficiency for liquid tends to be lower (mean of 10.59) compared to catalysts with just zero cadmium content (mean of 18.85).
- When catalysts contain no cadmium, no tin, and indium levels between 0 and 0.39, the Faradaic efficiency for liquid ranges between 6.44 and 29.79, with a mean of 18.11.
- High palladium content (between 0.59 and 1.00) results in Faradaic efficiency for liquid ranging between 7.20 and 22.85, with an average of 15.02.
- Existing research confirms some key observations about catalyst performance, highlighting specific metal combinations that optimize CO2 reduction efficiency.

I've identified critical design strategies for improving catalyst performance. The research reveals clear guidelines: prioritize copper-rich compositions without gold, eliminate palladium and gold simultaneously, and carefully manage metal interactions to maximize Faradaic efficiency. These insights provide a roadmap for developing more effective catalysts for liquid product generation in CO2 reduction processes.

Based on the Discovery Engine analysis of your CO2 reduction catalyst dataset, here are the three most important insights for maximizing Faradaic efficiency for liquid products:

Claude + Discovery Engine: Insight #1

Eliminate Both Palladium and Gold for Optimal Performance
The absence of both palladium and gold in catalyst compositions leads to significantly higher Faradaic efficiency (~30%) compared to when only one is absent (21-23%). This novel finding suggests a negative synergistic effect when either precious metal is present. [1][2][3]

Evaluation:

Not bad! It would have been nice to plot the effect of gold (Au) and palladium (Pd) used together. And the phrasing “a negative synergistic effect when either precious metal is present” is bit confusing – a “synergistic effect” when only one of the two is present? But the data does indeed show that when neither gold nor palladium are present, FE increases.

Claude + Discovery Engine: Insight #2

Optimize Copper Loading at High Levels (36-100%) Without Gold
High copper content (0.36-1.0 fraction) combined with zero gold produces consistently good Faradaic efficiency (mean ~27%), outperforming the individual effects of these conditions. This established relationship in MEA-based CO2 reduction confirms copper's effectiveness for liquid product formation. [1][2]

Evaluation:

The effect here is subtle, but supported by the data – high copper and zero gold, while not producing the highest Faradaic efficiency, consistently performs above the mean (and outperforms high copper with gold).

Claude + Discovery Engine: Insight #3

Avoid Detrimental Metal Combinations
Several metal combinations create strong negative synergies that should be avoided:
- Copper (24-52%) with Gold (58-69%): Drops efficiency to ~6% (vs. 11-16% individually)
- High Palladium (>59%): Limits efficiency to ~15%
- Cadmium absence with moderate copper (24-52%): Reduces efficiency to ~11%

These novel findings reveal that certain mid-range compositions create unexpected antagonistic effects. [1][2][3]

Evaluation:

Some nuance is lost in the description – high palladium doesn't always limit efficiency to 15%, only on average. But in general, there appears to be solid evidence in the data for these findings: bravo Claude!

With Discovery Engine, every insight is supported by the data (as expected – it returns validated patterns), and Claude did a great job of exploring the most interesting results. LLMs in general have a pretty good idea of what will be interesting to us – they just need help getting those insights out of the data.

LLMs need data-driven discovery just as much as humans.

Making claims that aren’t supported by the data is far from ideal for a research assistant. It burdens scientists with either carefully check the work (which kind of defeats the point), or eating the cost of wasted time and failed experiments. But it's more than just whether the claims are true or false: the real difference between Claude alone vs Claude + Discovery Engine is the kinds of insights we can find.

Without specialised tools, language models either propose broad, general patterns (the kind you might expect from reading the literature, but not actually examining the data too closely), or massively over-indexed on a few data points. This is the kind of stuff that humans do! We too are not good at systematically analysing large, numeric datasets. And this is exactly what we’d expect. LLMs inherit from us. And like us, they tend to overgeneralise, misinterpret outliers, or fall back on vague heuristics. But Discovery Engine transforms that all that complexity into a structured list of validated patterns – filtering noise and surfacing the kinds of non-obvious, non-linear, combinatorial relationships that are so hard for us to find alone. Free from the assumptions and biases of previous publications, it systematically pulls meaningful patterns out of data.

With that scaffold in hand, we allow language models to do what they're best at: automation, synthesis, and contextualisation with what we already know.

This is a toy example, with one model, single-shot prompted, on a single dataset. It is far from a rigorous study on the ability of language models to make sense of data. And no doubt multi-agent systems with access to more structured knowledge would do better – but this is the point. It is structure, frameworks, and better tools that will allow LLMs to contribute to scientific discovery.

These examples are one illustration of how language models can accelerate scientific discovery, when given the right tools. And there are many more to come: Discovery Engine is no longer just for humans.

Stay tuned for part 2!

All prompts, thinking blocks, and references can be found here.

If you're a materials scientist and you'd like to apply Discovery Engine to your experimental data, you can request a pilot here.

Claude Meets Discovery Engine, Part I

The Task: Understanding Catalyst Performance

Claude Goes It Alone

Claude Alone: Insight #1

Evaluation:

Claude Alone: Insight #2

Evaluation:

Claude Alone: Insight #3

Evaluation:

Claude + Discovery Engine

Claude + Discovery Engine: Insight #1

Evaluation:

Claude + Discovery Engine: Insight #2

Evaluation:

Claude + Discovery Engine: Insight #3

Evaluation:

LLMs need data-driven discovery just as much as humans.