AI in Genomics Research
The Genomics Data Challenge
The cost of sequencing a human genome has fallen from $3 billion in 2003 to under $200 in 2026. This thousand-fold cost reduction has created an explosion of genomic data. The UK Biobank contains whole-genome sequences from 500,000 participants. The All of Us Research Program aims for one million. Cancer genomics projects have sequenced tumors from hundreds of thousands of patients. Each genome generates roughly 100 gigabytes of raw sequencing data, and the processed variant files still occupy gigabytes per person.
The analytical challenge is proportionally enormous. Each person carries about 4 to 5 million genetic variants compared to the reference genome. Of those, perhaps 20,000 to 50,000 alter protein-coding sequences. Of those, a handful (or sometimes none) are relevant to the disease being studied. Finding those few needles in a haystack of millions requires methods that can evaluate each variant's likely functional impact, and that is where AI excels.
Variant Calling and Quality Control
Variant calling, identifying the genetic differences between a sequenced genome and the reference genome, is the first step in any genomic analysis. Traditional variant callers use statistical models to distinguish true genetic variants from sequencing errors. Google's DeepVariant, released in 2017 and continuously improved since, replaced these statistical models with a deep convolutional neural network. It converts sequencing reads into image-like representations and classifies each genomic position as reference, heterozygous variant, or homozygous variant.
DeepVariant consistently outperforms traditional variant callers in benchmarking studies, achieving higher accuracy on all variant types: single nucleotide variants (SNVs), insertions, and deletions. Its advantage is largest for complex regions of the genome where traditional callers struggle, such as repetitive sequences, regions with high GC content, and areas near structural variants. DeepVariant's accuracy in these challenging regions reduces false positives that waste downstream analysis time and false negatives that miss clinically relevant variants.
Structural variant detection, identifying large-scale rearrangements like deletions, duplications, inversions, and translocations, has also benefited from AI. Tools like SVcnn and DeepSV use convolutional neural networks to detect structural variants from patterns in sequencing data, achieving better sensitivity than traditional methods, particularly for variants in repetitive genomic regions.
Variant Effect Prediction
Once variants are identified, the next question is "what do they do?" Most genetic variants have no functional consequence. A small fraction alter protein structure, disrupt regulatory elements, or affect splicing. AI models predict the functional impact of each variant, prioritizing the ones most likely to be biologically or clinically significant.
AlphaMissense, released by DeepMind in 2023, classifies all possible single amino acid changes in all human proteins as likely pathogenic or likely benign. The model was trained on evolutionary conservation patterns, protein structure, and known pathogenic variants. It classified 89% of 71 million possible missense variants, providing a comprehensive catalog that researchers can query instantly. Before AlphaMissense, determining the pathogenicity of a novel missense variant often required months of experimental characterization or reliance on indirect evidence.
SpliceAI predicts how genetic variants affect RNA splicing, the process by which the cell removes non-coding sequences (introns) from the RNA transcript. Splicing defects cause an estimated 15 to 30% of genetic diseases but are difficult to detect from sequence alone because the splicing code is complex and context-dependent. SpliceAI, a deep neural network trained on RNA sequencing data, predicts the effect of any variant on nearby splice sites with remarkable accuracy, identifying disease-causing splicing variants that other methods miss.
Regulatory variant prediction is harder but equally important. Most disease-associated variants identified by genome-wide association studies (GWAS) fall in non-coding regions, where they presumably affect gene regulation rather than protein structure. AI models like Enformer and Sei predict the regulatory impact of non-coding variants by modeling the relationship between DNA sequence and gene expression, chromatin accessibility, and transcription factor binding. These models are trained on thousands of functional genomics experiments and can predict the effect of a single base change on gene expression in specific tissues.
Gene Expression and Single-Cell Analysis
Gene expression analysis measures which genes are active in a sample, and AI has become indispensable for interpreting expression data. Differential expression analysis (finding genes that are more or less active in disease versus healthy tissue) traditionally uses straightforward statistical tests, but the downstream interpretation, understanding which biological pathways are affected and how they interact, benefits enormously from machine learning.
Single-cell RNA sequencing (scRNA-seq) measures gene expression in individual cells, revealing the heterogeneity within tissues that bulk sequencing averages out. A tumor that looks uniform in bulk sequencing might contain distinct subpopulations of cancer cells, immune cells, and stromal cells, each with different gene expression profiles. AI clustering algorithms (like Leiden, UMAP, and scVI) identify these subpopulations automatically, grouping cells by their expression patterns without any prior knowledge of what cell types to expect.
Trajectory analysis uses AI to infer developmental or disease progression paths from single-cell data. If you sequence cells at different stages of differentiation, AI algorithms like Monocle and RNA velocity reconstruct the temporal order and identify branching points where cells diverge into different fates. This reveals the molecular events that drive cell fate decisions, providing insights into development, regeneration, and cancer progression.
Cell type annotation, labeling each cluster of cells with its biological identity, increasingly uses AI models trained on reference atlases. The Human Cell Atlas project is building comprehensive reference datasets for every tissue in the body. AI models trained on these references can automatically annotate cell types in new datasets, reducing the need for manual expert annotation and improving consistency across studies.
Genome-Wide Association Studies
GWAS compare the genomes of thousands of individuals with a disease to thousands of healthy controls, identifying genetic variants that are more common in the disease group. Traditional GWAS use simple statistical tests applied independently to each variant, but this misses interactions between variants and non-linear effects. Machine learning methods that consider variants jointly can detect genetic associations that single-variant tests miss.
Polygenic risk scores (PRS) combine the effects of many genetic variants into a single number that predicts disease risk. AI improves PRS by learning the complex, non-linear interactions between variants that linear models cannot capture. The latest AI-based PRS models for diseases like coronary artery disease, type 2 diabetes, and breast cancer show significantly better predictive performance than traditional linear PRS, identifying more individuals at high risk who would benefit from early intervention.
A major challenge is that most GWAS have been conducted in populations of European ancestry, making the resulting AI models less accurate for other populations. Efforts to build more diverse genomic databases and develop AI models that transfer across populations are critical for ensuring that AI-driven genomic medicine benefits everyone equitably. Transfer learning techniques that adapt models trained on one population to work for another show promise but are not yet mature enough for clinical deployment.
Pharmacogenomics
Pharmacogenomics uses genetic information to predict how individual patients will respond to specific drugs. AI models trained on combined genomic and clinical data can predict which patients will respond to a cancer therapy, which will experience severe side effects, and what dose will be most effective. This enables precision medicine, where treatment is tailored to the individual rather than applied uniformly.
In oncology, AI analysis of tumor genomic data guides treatment selection. Tumors with specific mutations respond to targeted therapies: BRAF mutations in melanoma respond to vemurafenib, HER2 amplification in breast cancer responds to trastuzumab, EGFR mutations in lung cancer respond to erlotinib. AI models go beyond individual mutations to consider the full mutational landscape, identifying patients who are likely to respond to immunotherapy based on tumor mutation burden, microsatellite instability, and gene expression signatures.
AI is essential for genomics because the data volumes and complexity exceed human analytical capacity. From variant calling to effect prediction to single-cell analysis, machine learning transforms raw sequencing data into biological and medical insights. The field's biggest challenge is ensuring that AI models work equitably across diverse populations.