An artificial intelligence (AI) model trained on genomic data can predict disease-causing mutations and design synthetic viruses.
The development of the Evo 2 model – which was trained on trillions of DNA bases from the genomes of more than 100,000 species, including animals, plants and microbes – is described in Nature. By learning patterns conserved through evolution, the model can analyse DNA sequences and infer how genetic changes might affect biological function.
Study co-lead author Dr Brian Hie, assistant professor of chemical engineering at Stanford University, California, said: 'Just as the world has left its imprint on the language of the internet used to train large language models, evolution has left its imprint on biological sequences. These patterns, refined over millions of years, contain signals about how molecules work and interact.'
Using these evolutionary signals, Evo 2 correctly predicted the effects of certain human genetic mutations – including variants in BRCA1, a gene associated with breast cancer – with more than 90 percent accuracy.
The model builds on an earlier genomic language model, Evo, which was trained primarily on microbial genomes such as bacterial and viral DNA, and was used to design synthetic biological tools including CRISPR-based genome-editing systems (see BioNews 1266).
In the new study, researchers used Evo 2 to design synthetic bacteriophages – viruses that infect bacteria – with some of which being capable of infecting and killing the host bacterium.
'Our development of Evo 1 and Evo 2 represents a key moment in the emerging field of generative biology,' said Dr Patrick Hsu, assistant professor of bioengineering at the University of California, Berkeley, also co-founder of the Arc Institute and a co-senior author on the paper. 'Evo 2 has a generalist understanding of the tree of life that's useful for a multitude of tasks, from predicting disease-causing mutations to designing potential code for artificial life.'
One of Evo 2's major technical advances is its ability to analyse very long stretches of DNA, up to one million bases at once. This is made possible by a new architecture known as StripedHyena 2, which allows the model to detect relationships between distant regions of the genome.
A different team of researchers has developed a related model called Evo2HiC, described in a separate study published on the preprint platform bioRxiv (which has not yet been peer-reviewed). Evo2HiC takes account of how DNA folds into 3D structures inside the cell nucleus in order to detect interactions across the genome, while running around 500 times faster than Evo 2.
Despite such advances, the authors of the Nature study note that experimental validation remains a major challenge. Synthesising large DNA molecules and inserting them into living cells is expensive and time-consuming, limiting how quickly AI-designed genomes can be tested in the laboratory.
The ability to design synthetic DNA also raises biosecurity and ethical concerns. To reduce potential misuse, the researchers excluded viruses that infect humans from the training datasets. When Evo 2 was prompted to generate sequences for these viruses, the outputs consisted only of random DNA, rather than functional designs.
Various other projects are currently exploring the science and ethics of synthetic DNA technologies. These include the UK's Synthetic Human Genome project and its accompanying social research initiative Care-full Synthesis (see BioNews 1295), and the WritingLife project hosted at the University of Oslo (see BioNews 1314).
Sources and References
-
Genome modelling and design across all domains of life with Evo 2
-
Evo 2: One Year Later
-
AI can write genomes — how long until it creates synthetic life?
-
AI trained on nine trillion DNA letters predicts harmful mutations and designs new genomes
-
AI model trained on 100,000+ species learns to read and design genetic code
-
Large genome model: Open source AI trained on trillions of bases
-
Evo2HiC: a multimodal foundation model for integrative analysis of genome sequence and architecture


