An artificial intelligence (AI) model has been developed to identify unseen disease-causing genetic variants and rank their severity.
The machine learning model popEVE can analyse these variants from a single human genome and rank them by the severity of the disease they may contribute to. The model combined evolutionary genetic data from thousands of different species with human genomic data. The model needed to be able to distinguish between mild genetic variants that cause adult-onset diseases versus more harmful variations that lead to childhood deaths.
'Our goal was to develop a model that ranks variants by disease severity – providing a prioritised, clinically meaningful view of a person's genome,' said Professor Debora Marks, from the Blavatnik Institute at Harvard Medical School, Boston, Massachusetts, and joint senior author of the study published in Nature Genetics.
PopEVE is the first model to identify and rank harmful variants by severity across an entire proteome, improving on its predecessor EVE (see BioNews 1119). Developed in 2021, EVE could identify harmful genetic variants; however, variants within the same genome could not be compared.
In a cohort of 30,000 patients with undiagnosed developmental conditions, popEVE identified the genetic variant responsible for each patient's disorder with 98 percent accuracy. In addition, the model identified 123 genes previously unknown to be associated with developmental diseases.
The new study focused specifically on missense genetic variants that result from a change in a single DNA nucleotide base, of which most humans have around 20,000 to 30,000.
PopEVE uses a large language model trained on many amino-acid sequences to help determine general rules and trends in protein structure and function. The model was trained to examine which amino-acid positions are conserved across species and thus, what its importance is to the overall protein's function. This was used to examine which missense variants were seen in human genomes and which were unseen. The use of human genomes was key to calibrating and ranking genetic variants in evolutionary trees based on their severity across different genes.
Furthermore, compared to EVE, popEVE produced fewer false positives – genetic variants incorrectly considered pathogenic. In reality, many of these variants are found in the genomes of people from underrepresented groups, such as those of non-European ancestry. The lack of inclusion of genomes of people from these backgrounds has meant their genetic diversity has not been fully represented and thus has appeared as pathogenic variations rather than normal harmless DNA variants. PopEVE does not rely on the frequency of a genetic variant in the human genomes, but only on whether it is seen or not.
'No one should get a scary result just because their community isn't well represented in global databases. PopEVE helps fix that imbalance, something the field has been missing for a long time,' said Dr Jonathan Frazer, from the Centre for Genomic Regulation in Barcelona, Spain, and joint senior author of the study.

