Artificial Intelligence Solved the Puzzle of DNA Activation Code for Identifying 75% of Human Genes

The University of California San Diego researchers used artificial intelligence to identify a DNA activation code, which was previously known to be "an enigma," but can later be utilized in n biotechnology and biomedical applications.

Scientists have long known that human genes are identified by instructions delivered by the precise DNA order and tied to four "bases", which are coded as A, C, G, and T. The paper was published in the journal Nature on September 9.

Nearly 25% of genes are identified by the TATAAA sequence also called the TATA box while the remaining 75% remained unclear because of the enormous number of possibilities of DNA base sequence.

DNA Activation Code — AI helped identify DNA activation code previously regarded as an enigma National Cancer Institute

However, researchers now found an activation code to help identify these genes. The UC San Diego News Center reported it is called the downstream core promoter region (DPR).

UC San Diego's Division of Biological Sciences Professor James T. Kadonaga who is also the paper's senior author said that DPR identification is a key step in activating up to a third of human genes.

"The DPR has been an enigma - it's been controversial whether or not it even exists in humans," said Kadonaga adding that they used artificial intelligence and machine learning to "solve this puzzle."

In 1996, Prof Kadonaga and his colleagues working in fruit flies have found a new gene activation sequence, which they called as the DPE. It is derived from a portion of the DPR, which allowed genes to be activated even without the TATA box.

James T. Kadonaga American Academy of Arts and Sciences

In 1997, they also found a single DPE-like sequence in humans. However, they were not successful in deciphering the details and frequency of human DPE.

After about 23 years, Kadonaga worked with lead author and post-doctoral scholar Long Vo Ngoc, Cassidy Yunjing Huang, Claudia Medrano, and Jack Cassidy, a retired computer scientist who helped in leveraging powerful artificial intelligence tools.

This study was backed by the National Institutes of Health's National Institute of General Medical Sciences.

Achieving "absurdly good" results

In this study, researchers made a pool of 500,000 random DNA sequences versions, which Kadonaga describes as "fairly serious computation." They analyzed each DPR activity and used 200,000 versions to create a machine learning model that could forecast DPR activity in human DNA with such accuracy.

Kadonaga described the results as "absurdly good" that they have created a similar machine learning model to identify TATA box sequences in a new way. After evaluating the new models on thousands of test cases, the TATA box and DPR results showed "incredible" predictive ability.

Results also revealed that human genes have a DPR motif. While identifying six bases in the TATA box was fairly simple, cracking the code for 19 bases for DPR was much tougher.

Kadonaga explained this was because there was no clear sequence pattern, so the DPR could not be found. The professor said it was like having encrypted information in the DNA sequence that humans cannot decipher, but the machine learning model can.

"A lot of things that are unexplained could now be explainable," he said adding that artificial intelligence may be further used for analyzing DNA sequence patterns to enhance researchers' ability to understand and control human cell gene activation.

This is owned by Tech Times

Written by CJ Robles