Scientific Discovery in the Age of Artificial Intelligence

For many people, including me, the real promise of AI is massively accelerated scientific discovery. Chatbots, vibe coding, video generation: these things are magical, but what I really want is superhuman medicine, radical life extension, humanity blossoming out into the universe. Understanding the universe. Is this the path we’re on?

Recently Leap had the opportunity to get involved in LongHack, a day of talks followed by a hackathon weekend, organised by London Longevity. Mo Elzek kindly invited me to present something about AI for science – an invitation which I accepted, hoping to meet some longevity researchers who might be able to use our discovery engine to advance their work on understanding aging. Lots of people asked for slides, which you can find here (bit hard to make sense of without my pontificating) – but it’s an important enough topic that I thought I’d write up my notes.

In this post I hope to provide a (brief, incomplete) overview of how AI is being applied to scientific discovery today (as of May 2025); some discussion of the limitations I see; and a few suggestions for improvement.

The Scientific Method

Most science happens through this kind of process. You begin by understanding the state of the art from published work in your field. Then, you think of a new hypothesis to test – a question to ask, that, if answered, would meaningfully increase your (the world’s) knowledge in some way. You then must design an experiment and go out into the world, or into a lab, to gather data to test this hypothesis. Once you have the data, you’ll analyse it to see if you managed to answer your question. Hopefully you got an interesting result – which you then publish – allowing this new knowledge to inform everyone’s future work. Of course, there are now many AI applications that hope to accelerate, or even replace, parts of this process.

Literature Review

AI tools for literature review are typically driven by a two-stage process: semantic search, followed by synthesis. Semantic search works by taking your query (e.g. “what are the most promising candidates for biomarkers of aging?”) and converting it to an embedding that captures meaning. It then finds papers with similar embeddings, which are parsed by a large language model (LLM) to generate summaries and extract specific information.

There are a couple of things to consider here: firstly, coverage isn’t perfect. These tools can miss paywalled or non-english content. Secondly, retrieval and summary is only 80-90% accurate (as reported by Elicit) – this means that 10-20% of your lit review could be made up.

But there’s a bigger problem – one that underpins much of science. Academic readers will be all too familiar with the replication crisis. While peer-reviewed publication is seen as the gold standard, sadly, you can’t really take any single publication at face value. Many papers are just wrong or irreplicable – and peer review, which is supposed to filter out the bad work and amplify the good, is [extremely unreliable](https://pmc.ncbi.nlm.nih.gov/articles/PMC11804526/#:~:text=A meta-analysis of 45,conclusions (38%2C 39).). AI aided literature review tools typically provide some imperfect heuristics (citation count, reputability of publisher), but these are no guarantee of reliability. We’ll see this problem come up again and again.

Knowledge Graphs

Baricitinib FTW!

Another way of understanding the state of the art in a given domain is by attempting to map knowledge in graph form.

Knowledge graphs work by using LLMs, or other natural language processing (NLP) methods, to extract named entities and relationships between them: genes, compounds, diseases; inhibits, causes, treats.

The hope is to identify previously unknown relationships through multi-hop reasoning – for example, if compound A inhibits inflammation, and inflammation causes some disease, you might hope to make the connection that this compound could be used to treat that disease.

The knowledge graph depicted above is a real example from Benevolent AI. They were looking for covid treatments, and particularly interested in AAK1 inhibitors, which interrupt the passage of the virus into cells in the body. Using their own knowledge graph, they were able to identify no less than 378 candidate drugs – and of those, six inhibited AAK1 with high affinity. Some of those were oncology drugs with serious side effects, and so not appropriate for patients sick with covid – but Baricitinib was tolerated well, and ultimately trialled and approved. It reduces mortality in covid patients and is now widely prescribed for this purpose. A success story!

However, it’s worth considering that only six of 378 candidate AAK1 inhibitors in the knowledge graph actually worked well. This is because named entity extraction is not a perfect process – a 10% error rate is typical for state of the art approaches, and that 10% compounds pretty quickly if you’re looking at a chain of three or four entities and their relationships. Of course, there’s also the problem of untrustworthy literature – even if your named entity extraction works perfectly, the relationships you extract might simply be wrong.

Hypothesis Generation

I think it needs more agents

On top of knowledge graphs, there are now AI tools that will generate a hypothesis for you. Their goal is to find novel, tractable questions by applying solutions from one field to analogous problems in another; by identifying unasked questions and gaps in the literature; and by predicting logical next steps given an existing body of work. Google’s Co-scientist is one example. It works by using a (somewhat baroque) system of multiple LLM agents to generate, rank and iteratively improve hypotheses and proposals. It was applied to drug repurposing, finding novel treatment targets, and understanding antimicrobial resistance mechanisms – but public results are limited. In the case of the latter, they claim to have successfully recreated a previously validated (but not publicly known) hypothesis:

“expert researchers instructed the AI co-scientist to explore a topic that had already been subject to novel discovery in their group, but had not yet been revealed in the public domain, namely, to explain how capsid-forming phage-inducible chromosomal islands (cf-PICIs) exist across multiple bacterial species. The AI co-scientist system independently proposed that cf-PICIs interact with diverse phage tails to expand their host range. This in silico discovery … had been experimentally validated in the original novel laboratory experiments performed prior to use of the AI co-scientist system…”

This is all very well, but similarly to knowledge graphs (and somewhat glossed over in the report) is that fact that many of the hypotheses generated weren’t worth testing. Humans are still very much required to filter the system’s proposals – but I think there’s value here, as a brainstorming tool. Of course, this too is dependent on existing literature.

Robotic Labs

Long day in the lab

A key bottleneck for AI in science is physical experimentation: closing the loop in many domains means bringing robots into the lab. One successful example of this comes from the University of Liverpool: they built an autonomous system that set out to optimise photocatalysts for hydrogen production. It worked within a ten-variable space, selected its own experiments via a batched Bayesian search, and ran for eight days, logging 688 experiments in total. By the end it had uncovered a catalyst mixture that performed roughly six times better than the starting formulation. More interesting to me is that did this by testing some extreme-pH conditions that human chemists tend to skip – it’s a nice illustration of how automation can help us step outside our usual assumptions.

If you don’t have a robot in-house, services like Emerald Cloud Lab offer another route – their platform gives you a single software interface to a fully automated, round-the-clock facility stocked with standard chemistry and biology instruments. So your AI system could design experiments, push them to the cloud lab, and return the results. Not tried it but looks cool!

Machine Learning for Science

I love AlphaFold

Okay, so once we’ve got our data, what can we do with it? We can use it to train ML models, of course. AlphaFold is a shining example of this. The latest version uses the Evoformer architecture (attention-based like transformers, but adapted for spatial/biological data), trained on amino acid sequences to predict 3D protein structures. The AlphaFold database now contains nearly all known proteins (200+ million), and applications to date are many and varied: from COVID vaccines to understanding drug-resistant bacteria; designing plastic-degrading enzymes to drug discovery in general – it’s spectacular.

Here, massive value comes from automation. Humans can find out how proteins fold (this is where the training data comes from) but it’s expensive, and can take years. Enter ML for science: train a model to do a task that you can already do – now you can do it at speed and scale.

There are many other subfields of ML for science – a couple that I find particularly interesting are physics-informed neural networks (PINNs), and surrogate models.

PINNs embed governing equations directly into the loss function, so that training simultaneously reduces prediction error and penalises violations of known physical laws – with the intention of constraining the model to learning only things that are physically possible. Practical success hinges on correctly weighting these two losses – failing to find a balance can make it either extremely difficult to train the model, or lead to physically impossible results.

Surrogate models are when you run a high-fidelity but computationally heavy simulator (say, weather, or fluid dynamics) to generate datasets, then train a much lighter-weight model to reproduce the final state from the initial state, or similar. The aim is to get within a few percent of the original simulation accuracy, while delivering orders-of-magnitude speed-ups (at the cost of rendering the process opaque). The speed increase also means you can do things like provide probability estimates over the outcome, since it’s no longer infeasible to run the ‘simulation’ multiple times.

There are many other approaches, and even more applications – from cancer diagnosis to predicting material properties. And there’s a mature ecosystem with lots of tools and frameworks available to make this easier. I recommend the book Supervised Machine Learning for Science for a good overview if you’d like to learn more.

Machine Learning + Interpretability

How we make discoveries at Leap

Say you’ve trained a neural network to diagnose cancer automatically, maybe more accurately than humans (which does happen – neural networks are just really good at finding complex predictive patterns). That’s a great ability you have now, and maybe that’s where you stop: automating a difficult task provides a lot of value (see e.g. AlphaFold, above.)

I think there’s a more interesting question here, though: how? What has the model learned that enable this ability? What patterns is it using? Is it just more consistent than humans, but doing pretty much the same thing that a pathologist would – or does it know something that we don’t know? Of course, this is what we’re interested in at Leap: accessing superhuman pattern recognition for knowledge discovery, by interpreting neural networks. Read more in our whitepaper.

I think Chris Olah was the first to propose this kind of approach, back in 2015, with reference to his work on visualising learned features in vision models:

“The visualizations are a bit like looking through a telescope. Just like a telescope transforms the sky into something we can see, the neural network transforms the data into a more accessible form. One learns about the telescope by observing how it magnifies the night sky, but the really remarkable thing is what one learns about the stars. Similarly, visualizing representations teaches us about neural networks, but it teaches us just as much, perhaps more, about the data itself.”
— Chris Olah

At Leap, our process looks something like this:

Scientists gather maximum data about a phenomenon of interest, without a particular hypothesis
A neural network (or any kind of model really) is trained on that data, to predict something about the phenomenon
That model is then interpreted, to see what it has learned
profit!

In this way you find patterns you’d otherwise miss, because you don’t have to limit your investigation to confirming a single hypothesis, and you don’t have to do manual exploratory data analysis, which can take months (and it’s extremely difficult to find complex, combinatorial patterns, unless you already know what to look for). In practice, we replicate a lot of existing knowledge using this approach – and very often find new stuff, too.

One really nice property is that there’s no reliance on existing literature. While biases do exist in terms of priors baked into whatever model architecture you use, and naturally in the data acquisition process, it’s possible to much more fully explore the space of possible discoveries in an unbiased way.

However, it’s entirely reliant on data quality. And depending on what kind of model you want to use, data quantity might be an issue too. In some cases, it can also be expensive – in contrast to foundation models, which you can train once on a gigantic dataset and then rely on transfer learning with a bit of fine-tuning for future tasks, you typically want to train from scratch each time, so the interpretability artefacts don’t inherit patterns from the base model that aren’t actually in the dataset of interest.

AI Scientists

Look ma, no hands

What about end-to-end systems? Some recent work uses agents to automate the entire scientific process: hypothesis generation → experimentation → analysis → writing/reviewing → publication. Sakana’s AI scientist is one example, made possible because the field of study was computer science – so the agent can actually run its own experiments. FutureHouse, an extremely well-funded non-profit, are doing something similar – only with humans in the loop to do experiments IRL. It’s impressive, but also kind of worrying. LLM generated papers look really legit, but often contain subtle errors. This is pernicious, because if you’re at the frontier of research, it can be really hard to spot – most readers won’t know the ground truth.

Many human papers don't replicate in the first place anyway. So now we’ll have maybe-hallucinating (how can you tell?) LLMs, trained on some-large-percentage of non-replicating papers, generating really convincing but kinda made up preprints. Imagine arXiv (maybe just arXiv, for now – but not for long, I bet) flooded with extremely plausible but not-quite-true papers. You can’t replicate everything, and ML in particular moves too fast these days for peer review (which is questionable, anyway). What do you trust? What do you build on? Only your friends’ work?

To their credit, Sakana guys discuss some of this. I recommend you read their paper. I think the key insight here is that LLMs trained on human language probably inherit our problems, and they only know stuff that we’ve published. Which is often sadly irreplicable. And maybe that’s okay, up to a limit – maybe you can cleverly filter the LLM hypotheses, maybe you can afford to test them all, maybe models will hallucinate less and less and get better at not believing their own bullshit. But IDK if that’ll happen before we completely corrupt preprints, at least. The point is not to speed up irreplicable science, and I’m worried that’s what we’ll end up with if we let LLMs run the show.

Aaaaanyway, if you’d like to give this approach ago, there are many general purpose agent orchestration tools available: AG2, CrewAI, LangChain, many more.

What now?

Okay, so we’ve talked about problems with reliance on literature. I want to acknowledge that some domains are much better than others in this regard (shout out to the economists!).

But honestly, I think the highest impact thing we can do to apply AI (especially LLMs, which everyone seems pretty keen on these days) to science well, is to improve the underlying data and fix our literature base (easy-peasy then). Some ideas:

Pre-registration of papers/experiments, finally, please
Journal requirements for code/data (and actually testing it – LLMs?)
Pay peer reviewers, hire experts to do this, so they can really spend time on it! Train LLM on those reviews.
Failing that, maybe we could use LLMs to review papers from first principles please don’t train them on human reviews! Human reviewers disagree with each other all the time, it’s not good signal)
Incentivise publication of high-quality datasets for training AI models. I think the best data is siloed in industry right now. (Can we automate cleaning and aggregating open source datasets?)
More knowledge graphs based on data, not literature
Figure out how to stop incentivising publication count – are there other ways to measure research talent? (Hard problem)

Basically, we need to sort out incentives, and clean the literature. The current publish-or-perish set up does not give us good science. And it makes AI for science so much harder than it should be.

In summary: I reckon that relying on academic publications to capture the state of human knowledge doesn’t really work, and language is a super lossy abstraction over reality, anyway. This is massively retarding AI for science. GIGO. Data gets us closer. I strongly advocate using LLMs for what they're good at (generating plausible language, oft-repeated code, some kinds of reasoning), and considering other (data-driven) approaches for everything else. I still think we might have a shot at superhuman science within our lifetimes, but these are hard problems, and you should work on them.