January, 2025
January 2025
M T W T F S S
 12345
6789101112
13141516171819
20212223242526
2728293031  
Columbia University researchers have developed new AI Model for Predicting Gene Activity in Human Cells
Jan 15, 2025, 06:43

Columbia University researchers have developed new AI Model for Predicting Gene Activity in Human Cells

Researchers at Columbia University Vagelos College of Physicians and Surgeons, led by Dr. Raul Rabadan, have developed a new predictive AI system called the General Expression Transformer (GET) to analyze gene activity in human cells with remarkable accuracy.

This AI model aims to enhance the understanding of cellular mechanisms in both healthy and diseased states. GET is designed to “uncover regulatory grammars” across 213 different human fetal and adult cell types by utilizing chromatin accessibility data alongside genomic sequences. By training on millions of cells collected from normal human tissues, GET has learned how cells typically function, enabling it to predict gene expression patterns effectively in various contexts, including both normal and diseased cells.

“It’s really a new era in biology that is extremely exciting; transforming biology into a predictive science,” – said Raul Rabadan.

On January 8th, 2025, the model and analysis were published in Nature.

A foundation model of transcription across human cell types

Authors: Raul Rabadan et al.

gene

The study introduces the General Expression Transformer (GET), a foundation model designed to unravel the complexities of transcriptional regulation across various human cell types. Transcriptional regulation is important for numerous biological processes, including those related to genetic diseases and cancers.

This regulation is orchestrated by a network of transcription factors (TFs), coactivators, and RNA polymerase II, which interact with regulatory sequences to modulate gene expression. Despite the conserved nature of these interactions, our understanding has often been limited to specific cell types, making it challenging to generalize findings across different contexts.

“Predictive generalizable computational models allow [us] to uncover biological processes in a fast and accurate way. These methods can effectively conduct large-scale computational experiments, boosting and guiding traditional experimental approaches,” – said Rabadan.

Columbia University researchers have developed new AI Model for Predicting Gene Activity in Human Cells

GET predicts gene expression in both familiar and novel cell types, demonstrating adaptability to various sequencing platforms and assay types, including tumor cells.

The model not only predicts gene expression but also identifies long-range regulatory elements associated with fetal hemoglobin and their corresponding TFs. The design of GET is based on the principles of transcription regulation, focusing on how genomic regions interact with TFs and how accessible these regions are in specific cell types. This interaction shapes the chromatin environment, which influences how RNA polymerase II drives gene expression.

GET employs a two-stage training process: first, it undergoes self-supervised pretraining to learn the interactions between regulatory elements, followed by fine-tuning to predict gene expression without needing paired expression measurements. In terms of performance, GET has shown impressive accuracy in predicting gene expression for unseen cell type. This performance surpasses that of previous models and emphasizes the importance of DNA sequence specificity in transcription regulation. GET also demonstrates generalizability to adult cell types when trained solely on fetal data.

GET’s ability for zero-shot prediction was validated using a lentivirus-based massively parallel reporter assay (lentiMPRA), where it effectively identified regulatory elements in a cell-type-specific context without prior exposure to relevant data. This capability allows GET to predict regulatory activity across different genetic sequences.The model interpretation techniques used in GET enable researchers to derive scores indicating the contribution of specific regions or motifs to gene expression across various cell types.

“The vast majority of mutations found in cancer patients are in so-called dark regions of the genome. These mutations do not affect the function of a protein and have remained mostly unexplored. The idea is that using these models, we can look at mutations and illuminate that part of the genome,” – said Rabadan.

GET has proven effective in discovering coregulating TFs by analyzing correlations between motifs. The model’s causal discovery algorithms have uncovered both known and novel interactions between TFs, confirming its potential for elucidating complex regulatory mechanisms. GET represents an advancement in transcriptional modeling, offering insights into regulatory elements, upstream regulators, and TF interactions across diverse human cell types. Its broad applicability positions it as a valuable tool for understanding gene regulation in health and disease contexts. Future enhancements could further integrate various biological data layers to provide an even more comprehensive view of transcriptional regulation dynamics.

Further Reading:

Tempus AI Introduced xH: Its First Whole-Genome Sequencing Test