Njuus

News, summarized

Open source AI model trained on trillions of genetic bases
Photo: Generated by Flux
Machine learningAiScience

Open source AI model trained on trillions of genetic bases

Researchers have released Evo 2, an open source AI system trained on 8.8 trillion DNA base pairs from bacteria, archaea, eukaryotes, and viruses. The model can identify key genome features like regulatory DNA and splice sites across all domains of life without specialized fine-tuning.

A research team has unveiled Evo 2, an advancement of the earlier Evo system that was limited to bacterial genomes. The new model was trained on the OpenGenome2 dataset containing 8.8 trillion bases from all three domains of life, as well as bacteriophages. Two versions were developed: a smaller model with 7 billion parameters trained on 2.4 trillion bases, and a full version with 40 billion parameters trained on the complete dataset.

Unlike its predecessor, Evo 2 successfully learned to recognize complex genome features found in eukaryotes, such as introns, regulatory sequences, and splice sites, which are characterized by weak sequence definition and scattered across large stretches of DNA. The system uses a convolutional neural network called StripedHyena 2, trained in two stages to first identify important features and then recognize large-scale patterns. Testing showed the model could detect mutations affecting transcription and translation sites, assess mutation severity, and recognize which genetic code a species uses—outperforming specialized software for some tasks like splice site identification.

The researchers made Evo 2 fully open source, including model parameters, training code, inference code, and the dataset. When tested on generating new sequences, the system produced regulatory DNA with activity in specific cell types, though results were modest at 17 percent showing significant differential activity. The team designed the release to enable the broader research community to explore potential applications, including protein design and genome annotation, with the possibility that Evo 2 may have identified previously unknown genome features.

Open source AI model trained on trillions of genetic bases