The new era of predictive biology
Will AI-powered prediction replace the need to understand the individual components of complex biological systems?
By Jacob Kimmel, 13 May 2026
In
Describing someone as a biologist tells you surprisingly little about their skills, day-to-day work or epistemic principles. Do they study the herding patterns of African elephants on the savannahs of East Africa or the structural basis for regulation of a ligand’s activity in a dark crystallography room?
Over the past century biology has arborised into subfields that address distinct problems, mirroring physics and chemistry before it. Many of these subfields are distinct enough that they represent their own intellectual disciplines. Not only do they value different questions, but they approach problems using different cognitive tools. Fields are often born at the confluence of two ancestral disciplines. Molecular biology, for example, emerged from physics and biochemistry; systems biology arose at the intersection of genomics and statistical mechanics.
I propose that a new field has emerged in the last five years with roots in molecular biology and machine learning: predictive biology. (The term predictive biology has previously been used by others, but I believe each of these uses is distinct from the definition provided here.)
Predictive biology is focused on inferring the outcomes of future experiments using quantitative models trained on a corpus of past data. Implicitly, predictive biologists hypothesise that biological systems contain a large amount of mutual information, so that the present and future state of one system (say, a cell’s shape) can be predicted from a description of another system (say, a cell’s gene expression profile).
This frontier may be pioneered largely in for-profit ventures rather than academic institutions
Where molecular biology is often reductionist, predictive biology is emergent, assuming that many complex biological phenomena cannot be explained without the interactions of many components.
Where systems biology argues that mapping the individual interactions within a system will yield understanding, predictive biology counters that predicting the future state of a system is understanding. Where molecular biology was enabled by nucleic acid biochemistry and systems biology by early computers, predictive biology is built on AI tools that learn to explain biology from data.
Predictive biology is not superior or inferior to the fields that came before it, but it is distinct. These distinctions have enabled scientists to ask new questions, build new institutions and found new companies. For potentially the first time in biology’s history, this frontier may be pioneered largely in for-profit ventures rather than traditional academic institutions. And I believe that these approaches will shape the future of biology.
The power of prediction
One way to frame the direction of the field is in terms of a causal graph. If we imagine all the nodes in a graph as biological molecules, systems biologists hope to measure and annotate all of the edges between nodes. By quantifying all these connections systems biologists hope that one day we will be able to understand and then redesign such systems.
The tools of systems biology have, unfortunately, failed to scale beyond the simplest interactions. There are few differential equations that can predict complex cellular behaviours – such as development, immunity or drug responses – with meaningful fidelity.
Predictive biology defines prediction as the core task of a biological study, rather than cataloguing the functions and relationships of molecules in order to build towards prediction. If we know the function of a gene and its relationships to all others, hopefully we can infer what will happen if we activate or repress the gene.
Predictive biologists are willing to eschew the intermediary catalogues in pursuit of the understanding that arises from predictive power. Phrased differently, predictive biologists are more concerned with measuring the ‘mutual information’[1] between two biological phenomena than they are with measuring direct causality. Where molecular biology takes inspiration from the epistemology of classical physics, predictive biology borrows the cognitive tools of computer science and information theory.
This approach has only been made possible by modern machine learning methods.
Predictive biologists are willing to eschew the intermediary catalogues in pursuit of the understanding that arises from predictive power
The first generation of models in the 1990s enabled researchers to extract more insights from emerging high-throughput experiments, but largely could not predict the outcomes of experiments based on their inputs alone. Early DNA sequence models enabled researchers to search for and align similar sequences, but could not predict the effect of a previously unobserved mutation. Simple models of gene expression could infer cell types or cancer outcomes, but could not predict the effect of inhibiting a gene on cell functions.
Computational constraints prevented early models from capturing sufficient biological context. Without this context models were limited to making relatively local predictions, hindering applications to the most complex problems in biology. Deep representation learning tools, enabled by advances in computing, broke through this second barrier in around the 2010s. It is now possible for researchers to create models that capture a rich input context – long sequences of life’s code, thousands of expression profiles, the covariates of paired drug treatments and images capturing hundreds of cells across a half-dozen different phenotypic dimensions.
By capturing a more detailed portrait of biological systems, a second generation of predictive biology models enables in silico hypothesis testing. These capabilities change both the questions predictive biologists explore and the experimental approaches they use to render new truths from a range of latent possibilities.
Asking bigger questions
Biology is rife with ‘hypothesis spaces’ that are too large to ever search exhaustively. For example, testing all possible 100-base-pair DNA sequences for enhancer activity – the ability to promote expression of a gene – would require in the region of 1060 experiments. Testing just all the combinations of two-gene perturbations in a simple cell line would require in the region of 108 experiments.
The traditional tools of molecular and cell biology are insufficient to explore all of these possibilities by many, many orders of magnitude. Simple questions such as ‘what is the strongest possible enhancer for the expression of a gene?’ or ‘what pairs of genes are essential for a cell to divide?’ are surprisingly inaccessible.
Molecular biology and its immediate descendants have made progress in the face of these daunting numbers through local searches, where researchers use their intuitions and prior knowledge to guess which hypotheses are the most fruitful to test. While we can’t test every 100-base-pair DNA sequence for enhancer activity, if we know several strong enhancers at about that size a clever molecular biologist is likely to try testing mutants initialised from those promising starting points.
The very best researchers have an instinct that enables them to guess correctly which hypotheses will be fruitful. However, if the space of known strong enhancers is actually quite far from the global optimum, a molecular biologist is nonetheless unlikely to find any sequence that comes close to the true strongest enhancer.
Predictive biology models enable researchers to take a different approach. Rather than using intuition to navigate a local hypothesis space, researchers can focus on gathering data to train models that enable a global search.
These experiments might look quite different from those that a traditional molecular or systems biologist would employ. Speaking loosely, a predictive biologist might allocate more of their budget to gathering diverse data that spans the range of possibilities within a hypothesis space.
Picking up our example of the 100-base-pair enhancer sequence, a predictive biologist might run an experiment to test the activity of thousands of random sequences to promote gene expression, then train a model to predict the activity from the sequence directly. They might then use this in silico model to search for optimal sequences across the full range of possibilities, predicting the global optimum, which may be far from the range of those previously known. While this example is stylised, real-world experiments to design new proteins have achieved just such results[2].
Predictive institutions
Disciplines beget institutions in their image. Molecular biology led to the creation of the MRC Laboratory of Molecular Biology, Cold Spring Harbor Laboratory and the original four horsemen of biotech – Genentech, Biogen, Genzyme and Amgen. Systems biology spawned the Broad Institute, UW Genome Sciences, Illumina, Myriad Genetics and Millennium Pharmaceuticals.
Predictive biology’s institutions are still being rendered. Previous disciplines have often germinated in academic centres before giving rise to commercial firms, but predictive biology may be offering an inverse example. Few academic organisations are configured to explore this intersection today, but new institutes such as the Arc Institute and the Eric and Wendy Schmidt Center offer examples of where the future may blossom. By contrast, a large number of techbio firms have already emerged across diagnostics and therapeutics[3].
Predictive biology has the potential to be the first biological discipline truly driven by industrial rather than academic scientists. Unlike molecular biology, where problems can often be addressed by a single investigator with a modest budget, predictive biology is most productive when data can be generated at scale and compute is abundant. These conditions are often easier to achieve in a for-profit endeavour.
I feel privileged to be living through a phase transition in my field. From the dawn of early biotech, scientists have dreamed of manipulating biology to craft a better world. We have extended lives and grown wonders once difficult to imagine, but we have yet to tame disease or design our environment.
Even the simplest cell is more complex than our most sophisticated computers. There are far more layers of abstraction than a human mind can conceive. Predictive biology’s promise is that perhaps we need not be limited by the human mind’s ability to connect nodes on a graph, but rather by our ability to observe patterns sufficient to guide our search and our will to do so with vigour.
This is an edited version of a post by Jacob Kimmel on Substack – see @jacobkimmel for more.
References
1wikipedia.org/wiki/Mutual_ information
2See protein binders designed with RFDiffusion; novel proteins designed with Chroma models; and the demonstration that ESM3 was able to find a functional green fluorescent protein about as distant from known proteins as any newly discovered protein.
3In diagnostics: Freenome and GRAIL. In therapeutics: BigHat, Dyno, Enveda, Excentia, Generate, Recursion, Xaira.