Add MULTI LAYER BIOLOGICAL FOUNDATION MODEL
This commit is contained in:
745
MULTI_LAYER_BIOLOGICAL_FOUNDATION_MODEL.md
Normal file
745
MULTI_LAYER_BIOLOGICAL_FOUNDATION_MODEL.md
Normal file
@@ -0,0 +1,745 @@
|
|||||||
|
# Multi-Layer Biological Foundation Model
|
||||||
|
## A Blueprint for Cross-Omic Intelligence Applied to Longevity
|
||||||
|
|
||||||
|
Biology runs on a multi-layer information stack: DNA encodes potential, epigenetic marks regulate access, RNA transcribes instructions, and proteins execute function. No single layer tells the full story. Yet virtually every biological foundation model today is trained on just one layer. This document makes the case for — and sketches the architecture of — a model trained across all layers simultaneously, with a specific focus on understanding and intervening in biological aging.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Table of Contents
|
||||||
|
|
||||||
|
1. [The Multi-Layer Intelligence Gap](#1-the-multi-layer-intelligence-gap)
|
||||||
|
2. [Current Landscape: Single-Layer Models](#2-current-landscape-single-layer-models)
|
||||||
|
3. [Emerging Cross-Modal Work](#3-emerging-cross-modal-work)
|
||||||
|
4. [The Case for a Unified Multi-Layer Model](#4-the-case-for-a-unified-multi-layer-model)
|
||||||
|
5. [Architecture Design](#5-architecture-design)
|
||||||
|
6. [Training Data and Strategy](#6-training-data-and-strategy)
|
||||||
|
7. [Applications to Longevity Research](#7-applications-to-longevity-research)
|
||||||
|
8. [Technical Challenges](#8-technical-challenges)
|
||||||
|
9. [Concrete Project Proposals](#9-concrete-project-proposals)
|
||||||
|
10. [Infrastructure and Roadmap](#10-infrastructure-and-roadmap)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. The Multi-Layer Intelligence Gap
|
||||||
|
|
||||||
|
### 1.1 The Central Dogma Is a Multi-Layer System
|
||||||
|
|
||||||
|
```
|
||||||
|
DNA → Epigenetic Marks → RNA → Protein → Metabolites → Phenotype
|
||||||
|
↑ ↑ ↑ ↑ ↑
|
||||||
|
└──────────┴──────────────────┴─────────┴────────────┘
|
||||||
|
Feedback loops at every level
|
||||||
|
```
|
||||||
|
|
||||||
|
The central dogma of molecular biology is not a one-directional pipeline — it is a densely interconnected network with regulatory feedback at every level:
|
||||||
|
|
||||||
|
- **DNA** encodes the genome — the instruction set — but its expression is controlled by everything downstream
|
||||||
|
- **Epigenetic marks** (DNA methylation, histone modifications, chromatin accessibility) determine which genes are active in each cell type and age state
|
||||||
|
- **RNA** (mRNA, ncRNA, miRNA) is the transcribed output, but its stability, splicing, and translation are regulated by RNA-binding proteins and epigenetic context
|
||||||
|
- **Proteins** are the functional executors, but their folding, modification, localisation, and degradation depend on cellular context
|
||||||
|
- **Metabolites** are the biochemical output, and they feed back to regulate gene expression, enzyme activity, and epigenetic marks
|
||||||
|
|
||||||
|
**Aging perturbs all layers simultaneously.** DNA accumulates mutations. Epigenetic marks drift. Transcription patterns shift. Protein homeostasis degrades. Metabolic output declines. No single-layer model can capture the cascading, multi-directional dysfunction that constitutes biological aging.
|
||||||
|
|
||||||
|
### 1.2 The Current Situation
|
||||||
|
|
||||||
|
As of early 2026, we have impressive foundation models for individual layers:
|
||||||
|
|
||||||
|
- **Protein:** ESM-2 (15B params), ESM3 (98B params), xTrimoPGLM (100B params)
|
||||||
|
- **DNA/Genomics:** Evo 2 (40B params, 9.3T nucleotides), AlphaGenome (1Mb context)
|
||||||
|
- **Single-cell transcriptomics:** Geneformer (104M cells), TranscriptFormer (112M cells, 12 species), scFoundation (50M cells, 100M params)
|
||||||
|
- **Epigenomics:** MethylGPT (226K profiles), CpGPT (100K+ samples)
|
||||||
|
|
||||||
|
But **almost no model genuinely learns joint representations across the central dogma in a unified pre-training process.** The layers are modelled in isolation, missing the very cross-layer interactions that drive both normal biology and disease.
|
||||||
|
|
||||||
|
This is the gap. And it is enormous.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Current Landscape: Single-Layer Models
|
||||||
|
|
||||||
|
### 2.1 Protein Language Models (pLMs)
|
||||||
|
|
||||||
|
The most mature class of biological foundation models. Trained on protein sequences using masked language modelling (like BERT) or autoregressive generation (like GPT).
|
||||||
|
|
||||||
|
| Model | Organisation | Params | Training Data | Key Innovation |
|
||||||
|
|-------|-------------|--------|---------------|----------------|
|
||||||
|
| **ESM-2** | Meta FAIR | 15B | 250M protein sequences (UniRef50) | Largest masked protein LM; learned structure emerges from sequence alone |
|
||||||
|
| **ESM3** | EvolutionaryScale | 98B | 2.78B proteins, 771B tokens | Multi-modal: sequence + structure + function jointly tokenized; VQ-VAE for 3D structure as discrete tokens; can generate novel functional proteins |
|
||||||
|
| **xTrimoPGLM** | BioMap | 100B | 1T tokens from protein sequences | Largest pLM; unified MLM + GLM objectives for both understanding and generation |
|
||||||
|
| **AlphaFold2/3** | DeepMind | ~600M | PDB + sequence databases | Structure prediction, not a language model per se; AF3 extends to DNA, RNA, ligands, ions |
|
||||||
|
| **ProtTrans** | TU Munich | Up to 11B | UniRef/BFD | Family of protein transformers (ProtBERT, ProtT5, ProtXLNet) |
|
||||||
|
| **ProGen2** | Salesforce | 6.4B | Protein sequences + metadata | Autoregressive protein generation with controllable properties |
|
||||||
|
| **RFdiffusion** | Baker Lab | — | PDB structures | Diffusion model for de novo protein structure design |
|
||||||
|
| **EvoDiff** | Microsoft | 640M | Evolutionary data | Diffusion model for protein sequence generation |
|
||||||
|
| **GPT-4b micro** | OpenAI/Retro Bio | — | Protein sequences + text + 3D structural data | Redesigned Yamanaka factors with >50-fold improvement in reprogramming markers |
|
||||||
|
|
||||||
|
**Key insight:** ESM3's joint tokenization of sequence, structure, and function is a template for multi-modal biological modelling. It demonstrates that discrete tokenization of continuous biological information (3D coordinates → VQ-VAE tokens) enables standard transformer architectures to reason across modalities.
|
||||||
|
|
||||||
|
### 2.2 Genomic / DNA Language Models (gLMs)
|
||||||
|
|
||||||
|
Trained on raw DNA sequences. The central challenge is context length — regulatory elements can be hundreds of kilobases from their targets, requiring models that handle millions of tokens.
|
||||||
|
|
||||||
|
| Model | Organisation | Params | Training Data | Context Length | Key Innovation |
|
||||||
|
|-------|-------------|--------|---------------|----------------|----------------|
|
||||||
|
| **Evo 2** | Arc Institute | 40B | 9.3T nucleotides, 128K genomes | 1M nucleotides | StripedHyena 2 architecture (gated multi-hybrid: short/medium/long convolutions + attention); DNA/RNA/protein predictions from DNA alone |
|
||||||
|
| **Evo** (v1) | Arc Institute | 7B | 2.7M prokaryotic/phage genomes | 131K nucleotides | StripedHyena architecture; byte-level DNA tokenization |
|
||||||
|
| **AlphaGenome** | DeepMind | — | Human + mouse functional genomics | 1Mb (1M bp) | U-Net CNN encoder + transformer + decoder; predicts gene expression, chromatin, TF binding, splice sites, 3D contacts at single-bp resolution |
|
||||||
|
| **Enformer** | DeepMind | — | Human/mouse epigenomics | 196K bp | Transformer for gene expression prediction from DNA; predecessor to AlphaGenome |
|
||||||
|
| **Nucleotide Transformer v2** | InstaDeep | 500M | Multi-species genomes | 12K tokens (6-mer) | Segment NT extends to 25K tokens; fine-tuned for 18 genomic tasks |
|
||||||
|
| **DNABERT-2** | — | 117M | Multi-species genomes | BPE tokenized | First BPE-tokenized DNA model (replaces fixed k-mers) |
|
||||||
|
| **HyenaDNA** | Stanford/Hazy | 6.6M–1.6B | Human genome | Up to 1M nucleotides | Hyena operator (sub-quadratic long convolution); single-nucleotide resolution at genomic scale |
|
||||||
|
| **Caduceus** | — | Up to 7B | Multi-species | — | Bidirectional Mamba (state-space model) with reverse-complement equivariance |
|
||||||
|
| **CpG Transformer** | — | — | Methylation profiles | — | Transformer for CpG methylation prediction using DNA context |
|
||||||
|
|
||||||
|
**Key innovations in this space:**
|
||||||
|
- **Sub-quadratic architectures** (Hyena, Mamba, StripedHyena) solve the long-context problem that standard attention cannot
|
||||||
|
- **Reverse-complement equivariance** (Caduceus) builds in the biological constraint that DNA is double-stranded
|
||||||
|
- **Zero-shot cross-modality** (Evo 2) shows a DNA-only model can predict RNA structure and protein function — suggesting that deep genomic understanding implicitly captures downstream biology
|
||||||
|
|
||||||
|
### 2.3 Single-Cell Foundation Models
|
||||||
|
|
||||||
|
Trained on single-cell RNA-seq (scRNA-seq) data. These models learn cell-type-specific gene expression patterns across millions of cells.
|
||||||
|
|
||||||
|
| Model | Organisation | Params | Training Data | Key Innovation |
|
||||||
|
|-------|-------------|--------|---------------|----------------|
|
||||||
|
| **TranscriptFormer** | CZI | 444M (633M with frozen embeddings) | 112M cells, 12 species spanning 1.5B years of evolution | Uses ESM-2 embeddings as gene tokens; expression-aware attention (log-count bias); cross-species transfer |
|
||||||
|
| **Geneformer** | NIH (Theodoris) | 10M → larger v2 | 30M cells (v1) → 104M cells (v2) | Rank-value encoding of gene expression; transfer learning for rare diseases |
|
||||||
|
| **scFoundation** | biomap | 100M | 50M cells | Asymmetric encoder-decoder (large encoder, Performer decoder) for efficiency; handles 20K genes |
|
||||||
|
| **scGPT** | Bo Wang lab | 50M+ | 33M cells | Generative pre-training for single-cell; cell embeddings, gene network inference |
|
||||||
|
| **GREmLN** | CZI/Columbia | 10.3M | 11M cells | Graph-aware attention using gene regulatory networks as positional encoding; outperforms models 10-100x larger |
|
||||||
|
| **CellFM** | — | 800M | 100M cells | Modified RetNet (retention mechanism) for efficient training |
|
||||||
|
| **UCE** (Universal Cell Embeddings) | Stanford | — | 36M cells, 8 species | Uses ESM-2 protein embeddings as gene representations; no re-training needed for new species |
|
||||||
|
| **GeneCompass** | — | 120M | Mouse + human cells | Embeds 4 types of biological prior: GRNs, promoters, gene families, co-expression |
|
||||||
|
|
||||||
|
**Key insight:** GREmLN's 10.3M parameter model outperforming 100M+ parameter models demonstrates that **biological inductive biases can substitute for raw scale.** Architecture informed by gene regulatory network structure is more efficient than brute-force attention over unstructured gene lists.
|
||||||
|
|
||||||
|
### 2.4 Epigenomic Models
|
||||||
|
|
||||||
|
The least developed category despite epigenetic changes being among the most robust biomarkers of aging.
|
||||||
|
|
||||||
|
| Model | Organisation | Params | Training Data | Key Innovation |
|
||||||
|
|-------|-------------|--------|---------------|----------------|
|
||||||
|
| **MethylGPT** | — | 1.6B | 226K methylation profiles (7.6B CpG tokens) | Largest methylation-specific model; ~450K CpG sites; sample-level embeddings |
|
||||||
|
| **CpGPT** | — | — | >100K methylation samples, 1500+ datasets | Predicts intervention effects on methylation age |
|
||||||
|
| **CpG Transformer** | — | — | Paired DNA + methylation | Predicts CpG methylation from sequence context |
|
||||||
|
|
||||||
|
**Gap:** No model jointly pre-trains on methylation data and gene expression from the same cells. This is critical because methylation controls expression, and understanding the methylation → expression → phenotype cascade is fundamental to aging.
|
||||||
|
|
||||||
|
### 2.5 RNA Models
|
||||||
|
|
||||||
|
RNA biology is increasingly recognised as critical (non-coding RNAs, mRNA stability, splicing regulation), but RNA foundation models lag behind DNA and protein models.
|
||||||
|
|
||||||
|
| Model | Organisation | Params | Training Data | Key Innovation |
|
||||||
|
|-------|-------------|--------|---------------|----------------|
|
||||||
|
| **RiNALMo** | — | 650M | 36M non-coding RNA sequences | Largest RNA language model |
|
||||||
|
| **Uni-RNA** | — | — | ~1B RNA sequences | Broad RNA model |
|
||||||
|
| **RNA-FM** | — | — | ncRNA sequences | Foundation model for non-coding RNA |
|
||||||
|
| **SpliceBERT** | — | 19M | 2M pre-mRNA sequences | Pre-mRNA splicing prediction |
|
||||||
|
|
||||||
|
### 2.6 Protein Design for Longevity
|
||||||
|
|
||||||
|
A striking proof-of-concept has already emerged:
|
||||||
|
|
||||||
|
**GPT-4b micro** (OpenAI + Retro Biosciences, January 2025): A biology-specialised variant of GPT-4o trained on protein sequences, biological text, and tokenized 3D structural data. Applied to redesigning the Yamanaka reprogramming factors (SOX2, KLF4):
|
||||||
|
- >50-fold improvement in pluripotency marker expression over wild-type
|
||||||
|
- >30% of AI-designed SOX2 variants outperformed natural protein
|
||||||
|
- Validated across multiple donors, cell types, and delivery methods
|
||||||
|
- Full pluripotency and genomic stability confirmed in derived iPSC lines
|
||||||
|
|
||||||
|
This demonstrates that **cross-modal training (sequence + text + structure) already produces actionable results for longevity-relevant biology.**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Emerging Cross-Modal Work
|
||||||
|
|
||||||
|
### 3.1 Modular Systems
|
||||||
|
|
||||||
|
**AIDO** (Artificial Intelligence-Driven Digital Organism, GenBio AI, December 2024):
|
||||||
|
- The most ambitious multi-layer system to date
|
||||||
|
- Architecture: Modular system of interconnected foundation models covering DNA (AIDO.DNA, up to 7B params; AIDO.DNA2 with MoE), RNA (AIDO.RNA, 650M), protein (AIDO.Protein, up to 16B), single-cell (AIDO.Cell, 120M), and tissue (AIDO.Tissue)
|
||||||
|
- Key innovation: **Hierarchical representation propagation** — embeddings from one layer's model feed into the next (DNA embeddings → RNA model → protein model → cell model → tissue model)
|
||||||
|
- Trained on 300+ datasets across modalities
|
||||||
|
- Status: Component models available; full cross-modal integration still in progress
|
||||||
|
|
||||||
|
**CZI Virtual Cell Platform** (Chan Zuckerberg Initiative, 2024-2025):
|
||||||
|
- Ecosystem of complementary models: TranscriptFormer (single-cell), GREmLN (gene regulatory networks), rBio (reasoning agent), plus connections to ESM3 for protein representations
|
||||||
|
- NVIDIA partnership (October 2025) to scale to petabytes of data spanning billions of cells
|
||||||
|
- Not a single unified model, but a platform for connecting multiple specialised models
|
||||||
|
|
||||||
|
### 3.2 Cross-Modal Fusion
|
||||||
|
|
||||||
|
| Project | Year | Modalities | Approach |
|
||||||
|
|---------|------|------------|----------|
|
||||||
|
| **IsoFormer** | 2024 | DNA + RNA + protein | Cross-attention fusion of pretrained encoders for transcript isoform expression prediction |
|
||||||
|
| **BioLangFusion** | 2025 | DNA + mRNA + protein | Codon-level alignment of pretrained DNA, mRNA, and protein language models; three fusion strategies; no additional pre-training needed |
|
||||||
|
| **mosGraphGPT** | 2024 | Multi-omic signaling graphs | Graph-based integration of multi-omic data on TCGA cancer + Alzheimer's disease data |
|
||||||
|
| **TMO-Net** | 2024 | Pan-cancer multi-omics | Cross-omics interaction learning with incomplete omics inference |
|
||||||
|
|
||||||
|
### 3.3 Aging-Specific Multi-Modal Models
|
||||||
|
|
||||||
|
**Precious3GPT (P3GPT)** (Insilico Medicine / Galkin et al., July 2024):
|
||||||
|
- The first transformer specifically designed for aging research
|
||||||
|
- Trained on >2 million data points from public omics datasets spanning mice, rats, monkeys, and humans
|
||||||
|
- Integrates transcriptomics, methylation, proteomics, and laboratory blood tests, plus biomedical text and knowledge graphs
|
||||||
|
- Capabilities: age prediction across species, target discovery, tissue/sex/disease classification, drug sensitivity prediction, omics response simulation
|
||||||
|
- Uses Retrieval-Augmented Generation (RAG) for literature integration
|
||||||
|
- Available on HuggingFace
|
||||||
|
|
||||||
|
**ClockBase Agent** (2024):
|
||||||
|
- AI-driven screening of 43,529 interventions using 2M+ samples and 40 aging clocks
|
||||||
|
- Identified 5,756 statistically likely age-modifying interventions
|
||||||
|
- Demonstrates the scale at which computational aging research can operate
|
||||||
|
|
||||||
|
### 3.4 What's Still Missing
|
||||||
|
|
||||||
|
Despite these advances, **no existing model genuinely learns joint representations across the central dogma in a unified pre-training process.** Current approaches either:
|
||||||
|
|
||||||
|
1. Train separate modality-specific models and connect them post-hoc (AIDO, CZI Virtual Cell, IsoFormer)
|
||||||
|
2. Focus on one modality but incorporate limited information from others (TranscriptFormer using ESM-2 embeddings)
|
||||||
|
3. Use bulk-level multi-omic data with limited molecular resolution (Precious3GPT, mosGraphGPT)
|
||||||
|
|
||||||
|
The unified multi-layer model — one that simultaneously learns DNA regulatory grammar, epigenetic control logic, transcriptomic patterns, and protein function from matched multi-omic data — does not yet exist.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. The Case for a Unified Multi-Layer Model
|
||||||
|
|
||||||
|
### 4.1 Why Post-Hoc Connection Isn't Enough
|
||||||
|
|
||||||
|
Connecting pre-trained single-layer models with cross-attention or fusion layers is the current pragmatic approach. But it has fundamental limitations:
|
||||||
|
|
||||||
|
1. **Representation misalignment.** Each model learned its own latent space independently. DNA embeddings, protein embeddings, and gene expression embeddings live in different geometric spaces with different organising principles. Post-hoc alignment (like CLIP-style contrastive learning) captures surface correspondences but misses deep structural relationships.
|
||||||
|
|
||||||
|
2. **Missing cross-layer features.** Some of the most important biological phenomena are inherently cross-layer and cannot be captured by any single-layer model:
|
||||||
|
- How a SNP in a promoter region changes chromatin accessibility, which alters transcription factor binding, which changes gene expression, which changes protein levels, which alters cellular function
|
||||||
|
- How a drug that binds a protein (protein layer) changes a signalling cascade (protein-protein interactions) that ultimately changes gene expression (transcriptome) and epigenetic marks (epigenome)
|
||||||
|
- How aging-associated epigenetic drift changes which genes are accessible, which proteins are made, and which metabolic pathways are active
|
||||||
|
|
||||||
|
3. **Emergent cross-layer representations.** In NLP, transformer models learn syntactic and semantic representations that are not present in any individual word. Similarly, a truly joint multi-layer biological model could learn emergent "biological concepts" — representations of regulatory programs, cellular states, or aging trajectories — that are invisible to any single-layer model.
|
||||||
|
|
||||||
|
### 4.2 The Evo 2 Precedent
|
||||||
|
|
||||||
|
Evo 2 (trained only on DNA) demonstrates zero-shot capability on RNA structure prediction and protein function prediction. A DNA sequence implicitly encodes the RNA that will be transcribed and the protein that will be translated. The fact that a DNA-only model partially captures downstream biology suggests that a model explicitly trained on all layers would capture far more.
|
||||||
|
|
||||||
|
### 4.3 The Aging Imperative
|
||||||
|
|
||||||
|
Aging is perhaps the strongest argument for multi-layer modelling:
|
||||||
|
|
||||||
|
- **Epigenetic clocks** measure aging at the methylation layer but don't tell you which genes are affected or what proteins change
|
||||||
|
- **Transcriptomic signatures** of aging identify expression changes but don't tell you what regulatory elements drove them
|
||||||
|
- **Proteomic aging** reveals functional decline but doesn't connect back to the genetic and epigenetic causes
|
||||||
|
|
||||||
|
A model that understands the full cascade — from DNA variant → epigenetic regulation → transcription → protein → function — would enable:
|
||||||
|
- **Causal aging clocks** that identify root causes, not just correlates
|
||||||
|
- **Intervention simulation** that traces how a drug propagates through all layers
|
||||||
|
- **Personalised aging assessment** that integrates an individual's genome, epigenome, transcriptome, and proteome into a unified aging trajectory
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Architecture Design
|
||||||
|
|
||||||
|
### 5.1 Design Principles
|
||||||
|
|
||||||
|
The architecture must solve several challenges simultaneously:
|
||||||
|
|
||||||
|
1. **Multi-resolution.** DNA operates at single-nucleotide resolution. Epigenomics at CpG/histone resolution. Transcriptomics at gene resolution. Proteomics at protein resolution. The model must handle these different resolutions.
|
||||||
|
|
||||||
|
2. **Variable context length.** Genomic context requires millions of bases. A cell's transcriptome is ~20,000 genes. A protein is typically hundreds of amino acids. The model must handle vastly different sequence lengths.
|
||||||
|
|
||||||
|
3. **Cross-layer attention.** The model must learn which DNA regions regulate which genes, which genes produce which proteins, and how these relationships change across cell types and aging states.
|
||||||
|
|
||||||
|
4. **Scalable training.** Paired multi-omic data is scarce. The model must be able to learn from abundant unpaired data while exploiting paired data when available.
|
||||||
|
|
||||||
|
### 5.2 Proposed Architecture: Hierarchical Cross-Modal Transformer
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────────────────────────────┐
|
||||||
|
│ MULTI-LAYER BIOLOGICAL FM │
|
||||||
|
│ │
|
||||||
|
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
|
||||||
|
│ │ DNA/Epi │ │ RNA / │ │ Protein │ │ Cell │ │
|
||||||
|
│ │ Encoder │ │ Transcript│ │ Encoder │ │ State │ │
|
||||||
|
│ │ │ │ Encoder │ │ │ │ Encoder │ │
|
||||||
|
│ │ StripedH │ │ Transform│ │ ESM-style│ │ GREmLN- │ │
|
||||||
|
│ │ -yena + │ │ -er + │ │ + VQ-VAE │ │ style + │ │
|
||||||
|
│ │ attention│ │ express- │ │ structure│ │ GRN-aware│ │
|
||||||
|
│ │ │ │ ion-aware│ │ tokens │ │ attention│ │
|
||||||
|
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
|
||||||
|
│ │ │ │ │ │
|
||||||
|
│ └──────┬───────┴──────┬───────┴──────┬───────┘ │
|
||||||
|
│ │ │ │ │
|
||||||
|
│ ┌───────────┴──────────────┴──────────────┴───────────┐ │
|
||||||
|
│ │ CROSS-MODAL FUSION LAYERS │ │
|
||||||
|
│ │ │ │
|
||||||
|
│ │ Cross-attention between all modality pairs │ │
|
||||||
|
│ │ Hierarchical: DNA↔RNA (codon-level alignment) │ │
|
||||||
|
│ │ RNA↔Protein (translation mapping) │ │
|
||||||
|
│ │ All↔Cell State (expression context) │ │
|
||||||
|
│ │ │ │
|
||||||
|
│ │ Shared latent "biological concept" space │ │
|
||||||
|
│ └──────────────────────┬───────────────────────────────┘ │
|
||||||
|
│ │ │
|
||||||
|
│ ┌──────────────────────┴───────────────────────────────┐ │
|
||||||
|
│ │ TASK-SPECIFIC HEADS │ │
|
||||||
|
│ │ │ │
|
||||||
|
│ │ - Aging clock (multi-omic biological age) │ │
|
||||||
|
│ │ - Perturbation prediction (intervention effects) │ │
|
||||||
|
│ │ - Cell type annotation │ │
|
||||||
|
│ │ - Gene expression prediction from DNA + epigenomics │ │
|
||||||
|
│ │ - Protein function prediction │ │
|
||||||
|
│ │ - Variant effect prediction │ │
|
||||||
|
│ │ - Drug target identification │ │
|
||||||
|
│ └──────────────────────────────────────────────────────┘ │
|
||||||
|
└─────────────────────────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
### 5.3 Modality-Specific Encoders
|
||||||
|
|
||||||
|
Each layer gets a specialised encoder that captures its unique structure:
|
||||||
|
|
||||||
|
**DNA/Epigenome Encoder:**
|
||||||
|
- Architecture: StripedHyena 2-style (from Evo 2) for long-context genomic sequences, hybridized with attention layers
|
||||||
|
- Input: Nucleotide sequence + CpG methylation values + histone modification tracks + chromatin accessibility (ATAC-seq)
|
||||||
|
- Tokenization: Single-nucleotide with epigenetic annotations encoded as continuous features on each token
|
||||||
|
- Context: 1Mb+ (sufficient to capture distal regulatory elements)
|
||||||
|
- Innovation: Methylation and chromatin state are encoded as continuous features attached to each nucleotide token, not as separate modalities. This forces the model to learn how epigenetic marks on specific DNA positions affect function.
|
||||||
|
|
||||||
|
**RNA/Transcriptome Encoder:**
|
||||||
|
- Architecture: TranscriptFormer-style with expression-aware attention
|
||||||
|
- Input: Gene expression values (rank-encoded or continuous-embedded), assay metadata
|
||||||
|
- Innovation: Use pre-computed ESM-2/ESM3 protein embeddings as gene representations (following TranscriptFormer and UCE), giving each gene a rich functional descriptor rather than a learned-from-scratch token
|
||||||
|
- Gene regulatory network structure encoded in attention mask (following GREmLN)
|
||||||
|
|
||||||
|
**Protein Encoder:**
|
||||||
|
- Architecture: ESM3-style multi-track transformer
|
||||||
|
- Input: Amino acid sequence + VQ-VAE structure tokens + functional annotation tokens
|
||||||
|
- Pre-trained weights: Initialise from ESM3 or ESM-2 and fine-tune
|
||||||
|
- Innovation: Integrate post-translational modification (PTM) predictions as additional token tracks
|
||||||
|
|
||||||
|
**Cell State Encoder:**
|
||||||
|
- Architecture: GREmLN-style graph-aware transformer
|
||||||
|
- Input: The combined output of the other three encoders for a given cell, plus cell metadata (tissue, age, sex, species)
|
||||||
|
- Innovation: This encoder reasons over the integrated state — it doesn't see raw data but rather the representations produced by each modality-specific encoder
|
||||||
|
|
||||||
|
### 5.4 Cross-Modal Fusion
|
||||||
|
|
||||||
|
The critical architectural innovation. Several options:
|
||||||
|
|
||||||
|
**Option A: Hierarchical Cross-Attention (Recommended)**
|
||||||
|
- DNA/Epi encoder outputs feed into RNA/Transcript encoder as context (cross-attention)
|
||||||
|
- RNA/Transcript outputs feed into Protein encoder as context
|
||||||
|
- All feed into Cell State encoder
|
||||||
|
- This follows the biological information flow: DNA → RNA → Protein → Cell
|
||||||
|
- Reverse connections (protein → DNA regulation) handled by iterative passes or bidirectional cross-attention
|
||||||
|
|
||||||
|
**Option B: All-to-All Cross-Attention**
|
||||||
|
- Every encoder attends to every other encoder at each layer
|
||||||
|
- Maximum information flow but O(N²) in number of modalities and computationally expensive
|
||||||
|
- May be unnecessary if the hierarchical approach captures sufficient cross-layer information
|
||||||
|
|
||||||
|
**Option C: Shared Latent Space (Contrastive)**
|
||||||
|
- Each encoder projects into a shared "biological concept" space
|
||||||
|
- Contrastive loss aligns representations of the same biological entity across modalities (e.g., a gene's DNA locus, its RNA expression, and its protein product should map to nearby points)
|
||||||
|
- CLIP-style but for biology
|
||||||
|
- Computationally efficient but potentially loses fine-grained cross-modal relationships
|
||||||
|
|
||||||
|
**Recommended approach: Hierarchical cross-attention (A) as the primary mechanism, with a shared latent space (C) as an auxiliary training objective to improve alignment.**
|
||||||
|
|
||||||
|
### 5.5 Key Architectural Innovations to Incorporate
|
||||||
|
|
||||||
|
Drawing from the best ideas in existing models:
|
||||||
|
|
||||||
|
- **Reverse-complement equivariance** (from Caduceus) in the DNA encoder
|
||||||
|
- **Expression-aware attention** (from TranscriptFormer) in the RNA encoder — expression magnitudes bias the attention matrix rather than being tokenized
|
||||||
|
- **VQ-VAE structure tokenization** (from ESM3) in the protein encoder — enables joint sequence-structure reasoning
|
||||||
|
- **GRN-aware attention** (from GREmLN) in the cell state encoder — gene regulatory network structure as positional encoding
|
||||||
|
- **Asymmetric encoder-decoder** (from scFoundation) for computational efficiency — large encoder, small decoder
|
||||||
|
- **Codon-level alignment** (from BioLangFusion) for DNA↔protein cross-attention — align at the natural unit of the genetic code
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Training Data and Strategy
|
||||||
|
|
||||||
|
### 6.1 The Paired Data Problem
|
||||||
|
|
||||||
|
The biggest practical challenge. Truly paired multi-omic data (multiple measurements from the same cells) is orders of magnitude smaller than single-modality data:
|
||||||
|
|
||||||
|
| Data Type | Available Scale | Notes |
|
||||||
|
|-----------|----------------|-------|
|
||||||
|
| Protein sequences | ~2B+ | Abundant |
|
||||||
|
| DNA sequences | 9.3T+ nucleotides | Abundant |
|
||||||
|
| scRNA-seq (unpaired) | 100M+ cells | Abundant (CELLxGENE) |
|
||||||
|
| Methylation arrays (bulk) | 226K+ profiles | Moderate |
|
||||||
|
| 10X Multiome (scRNA + scATAC, paired) | Millions of cells | Growing rapidly |
|
||||||
|
| CITE-seq (scRNA + surface proteins, paired) | Hundreds of thousands | Moderate |
|
||||||
|
| scNMT-seq (scRNA + methylation + chromatin, trimodal) | Thousands of cells | Very limited |
|
||||||
|
| Full quad-modal (DNA + methylation + RNA + protein, same cell) | Essentially none | The bottleneck |
|
||||||
|
|
||||||
|
### 6.2 Three-Phase Training Strategy
|
||||||
|
|
||||||
|
To address the data scarcity, training proceeds in phases:
|
||||||
|
|
||||||
|
**Phase 1: Modality-Specific Pre-Training (use abundant unpaired data)**
|
||||||
|
- DNA/Epi encoder: Pre-train on 9.3T+ nucleotides (Evo 2 scale) with epigenomic annotations from ENCODE/Roadmap
|
||||||
|
- RNA encoder: Pre-train on 100M+ cells from CELLxGENE
|
||||||
|
- Protein encoder: Initialise from ESM3 (98B params, 2.78B proteins)
|
||||||
|
- Each encoder learns rich representations within its modality
|
||||||
|
|
||||||
|
**Phase 2: Cross-Modal Alignment (use paired and pseudo-paired data)**
|
||||||
|
- Train cross-attention fusion layers on available paired data:
|
||||||
|
- 10X Multiome (scRNA + scATAC): Millions of cells — largest paired set
|
||||||
|
- CITE-seq (scRNA + surface proteins): Hundreds of thousands of cells
|
||||||
|
- TCGA (bulk RNA + methylation + some proteomics): ~11,000 patients, 33 cancer types
|
||||||
|
- GTEx (genotype + expression, 54 tissues): ~17,000 samples
|
||||||
|
- scNMT-seq (trimodal): Thousands of cells — small but uniquely valuable
|
||||||
|
- **Pseudo-pairing strategies** to expand effective paired data:
|
||||||
|
- Match bulk RNA-seq and methylation from the same tissue/individual
|
||||||
|
- Use DNA sequence as an anchor — every cell shares the same genome, so DNA context is always "paired" with whatever other modality is measured
|
||||||
|
- Use gene identity as a bridge — a gene's DNA locus, RNA expression, and protein product are biologically paired even when measured in different experiments
|
||||||
|
- **Contrastive alignment loss:** Same gene/protein/cell should have similar representations regardless of which modality they come from
|
||||||
|
|
||||||
|
**Phase 3: Holistic Fine-Tuning and Task-Specific Training**
|
||||||
|
- Fine-tune the entire model end-to-end on specific tasks:
|
||||||
|
- Multi-omic aging clock training (using longitudinal cohort data — ROSMAP, Dunedin, UK Biobank)
|
||||||
|
- Perturbation response prediction (using Perturb-seq / CROP-seq data — millions of cells with CRISPR perturbations)
|
||||||
|
- Drug response prediction
|
||||||
|
- Intervention effect simulation
|
||||||
|
|
||||||
|
### 6.3 Key Training Datasets
|
||||||
|
|
||||||
|
**For Phase 1 (per-modality):**
|
||||||
|
- Evo 2 training data: 9.3T nucleotides, 128K genomes (for DNA encoder)
|
||||||
|
- CELLxGENE: 100M+ cells (for RNA encoder)
|
||||||
|
- UniRef + AlphaFold DB: billions of sequences + 214M structures (for protein encoder)
|
||||||
|
- GEO + Generation Scotland: 226K+ methylation profiles (for epigenome training)
|
||||||
|
- ENCODE + Roadmap: histone, chromatin, TF binding across 500+ cell types
|
||||||
|
|
||||||
|
**For Phase 2 (cross-modal):**
|
||||||
|
- 10X Multiome datasets (growing rapidly — 10X Genomics public data + CZI Billion Cells Project)
|
||||||
|
- CITE-seq (NeurIPS 2021 competition: ~90K cells; plus growing corpus)
|
||||||
|
- TCGA (11K patients, matched DNA + RNA + methylation + some protein)
|
||||||
|
- GTEx (17K samples, genotype + expression across 54 tissues)
|
||||||
|
- scMMO-atlas: curated 852 samples across modalities
|
||||||
|
- SingleCellMultiModal (Bioconductor): landmark paired datasets
|
||||||
|
|
||||||
|
**For Phase 3 (longevity-specific):**
|
||||||
|
- ROSMAP: ~3,000 individuals with longitudinal multi-omic profiling + cognitive data
|
||||||
|
- Dunedin Study: basis of DunedinPACE, longitudinal birth cohort
|
||||||
|
- UK Biobank: 500K participants with genotyping + clinical biomarkers + some multi-omic
|
||||||
|
- Calico Labs aging datasets (if accessible)
|
||||||
|
- Perturb-seq: millions of cells with CRISPR perturbations (for intervention prediction)
|
||||||
|
- DrugAge + CMap: drug-induced expression signatures + known lifespan-extending compounds
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Applications to Longevity Research
|
||||||
|
|
||||||
|
### 7.1 Next-Generation Aging Clocks
|
||||||
|
|
||||||
|
Current aging clocks are powerful but limited to single modalities and shallow architectures (typically elastic net regression on ~500-1,000 features).
|
||||||
|
|
||||||
|
**What a multi-layer model enables:**
|
||||||
|
|
||||||
|
**Unified multi-omic aging clock:**
|
||||||
|
- Input: Individual's genome + methylome + transcriptome + proteome
|
||||||
|
- Output: Overall biological age + organ-specific ages + rate of aging + identification of fastest-aging molecular systems
|
||||||
|
- Trained on longitudinal multi-omic data from aging cohorts
|
||||||
|
- Recent work on "multi-organ, multi-omics aging clocks" (February 2025) shows this direction is extremely promising
|
||||||
|
|
||||||
|
**Causal aging clock:**
|
||||||
|
- Current clocks are correlative — they measure things associated with aging, which may be passengers rather than drivers
|
||||||
|
- A multi-layer model that understands the DNA → epigenome → transcriptome → proteome cascade can identify which changes are upstream (causal) vs. downstream (consequential)
|
||||||
|
- This directly addresses the "causal clock problem" identified in COMPUTATIONAL_BIOLOGY.md Section 14.1
|
||||||
|
|
||||||
|
**Intervention-sensitive clock:**
|
||||||
|
- CpGPT already predicts intervention effects on methylation age
|
||||||
|
- A multi-layer model could predict how an intervention propagates through all layers
|
||||||
|
- Example: "If this person takes compound X, how will their methylation change? What downstream transcriptomic shifts will that cause? What proteins will be affected? What is the net effect on biological age?"
|
||||||
|
|
||||||
|
**Metabolic-rate-aware clock:**
|
||||||
|
- Addresses the critical concern from PLAN.md Section 15.6: do current clocks confound metabolic suppression with rejuvenation?
|
||||||
|
- A multi-layer model that sees both epigenetic marks AND metabolic gene expression AND thyroid pathway proteins could distinguish genuine rejuvenation from metabolic suppression
|
||||||
|
- Test: Does hypothyroidism score as "younger" on a multi-omic clock? If so, the clock is confounded.
|
||||||
|
|
||||||
|
### 7.2 Drug Discovery and Repurposing
|
||||||
|
|
||||||
|
**Cross-layer target identification:**
|
||||||
|
- Input: Multi-omic aging signatures (what changes in the transcriptome, proteome, and epigenome with aging)
|
||||||
|
- Process: Model identifies which protein targets, if modulated, would reverse the aging signature across all layers
|
||||||
|
- Output: Ranked drug targets with predicted multi-omic effects
|
||||||
|
- Precious3GPT already demonstrates this for multi-species target discovery; a more capable model would extend it
|
||||||
|
|
||||||
|
**Geroprotector screening:**
|
||||||
|
- Use CMap drug-induced expression signatures as input
|
||||||
|
- Model predicts how each drug's transcriptomic effects propagate through proteome and epigenome
|
||||||
|
- Rank compounds by predicted biological age reversal across all molecular layers
|
||||||
|
- ClockBase Agent's screen of 43,529 interventions shows the scale at which this is possible
|
||||||
|
|
||||||
|
**Senolytic design:**
|
||||||
|
- Senescent cells have unique multi-omic signatures: p16/p21 expression, SASP secretome, altered chromatin (SAHFs), metabolic changes
|
||||||
|
- A multi-layer model could identify molecular targets unique to the senescent multi-omic state, not just individual markers
|
||||||
|
- Design molecules (using protein design capabilities from ESM3/RFdiffusion) that selectively bind senescence-specific targets
|
||||||
|
|
||||||
|
### 7.3 Reprogramming Optimization
|
||||||
|
|
||||||
|
The GPT-4b micro / Retro Biosciences result (>50-fold improvement in Yamanaka factor potency) proves that AI can directly optimize reprogramming. A multi-layer model extends this:
|
||||||
|
|
||||||
|
**Full protocol optimization:**
|
||||||
|
- Not just the protein factors, but the entire reprogramming protocol: which factors, what expression levels, for how long, in which cell types, with what delivery method
|
||||||
|
- Model predicts the trajectory through multi-omic state space at each time point during reprogramming
|
||||||
|
- Identifies optimal dosing windows where epigenetic rejuvenation occurs without dedifferentiation
|
||||||
|
|
||||||
|
**Tissue-specific reprogramming:**
|
||||||
|
- Different tissues age differently and may require different reprogramming approaches
|
||||||
|
- A 2025 Cell paper showed partial reprogramming reverses "mesenchymal drift" — a hallmark of aging at the transcriptional level
|
||||||
|
- Multi-layer model predicts which tissues/cell types benefit most and designs tissue-specific protocols
|
||||||
|
|
||||||
|
**Safety prediction:**
|
||||||
|
- The key challenge with reprogramming: how close to dedifferentiation (cancer risk) can you go while still achieving rejuvenation?
|
||||||
|
- A multi-layer model that sees the full epigenome + transcriptome + proteome trajectory can predict the safety boundary with far more precision than any single-layer model
|
||||||
|
|
||||||
|
### 7.4 Understanding Hallmarks of Aging at Molecular Resolution
|
||||||
|
|
||||||
|
Each of the 12 hallmarks involves multiple molecular layers. A multi-layer model could trace causality across layers for each hallmark:
|
||||||
|
|
||||||
|
| Hallmark | Cross-Layer Insight |
|
||||||
|
|----------|---------------------|
|
||||||
|
| **Genomic instability** | DNA mutations → expression changes → protein dysfunction → cellular consequences |
|
||||||
|
| **Telomere attrition** | Telomeric DNA changes → chromatin state changes → gene expression changes near telomeres (TPE) |
|
||||||
|
| **Epigenetic alterations** | Methylation drift → chromatin remodelling → transcriptomic shifts → proteomic changes |
|
||||||
|
| **Loss of proteostasis** | Chaperone expression changes → protein aggregation → stress response activation |
|
||||||
|
| **Deregulated nutrient sensing** | SNPs in mTOR/AMPK pathway genes → expression changes → signalling pathway activity |
|
||||||
|
| **Mitochondrial dysfunction** | mtDNA mutations → ETC gene expression changes → complex assembly defects → metabolic output |
|
||||||
|
| **Cellular senescence** | Unified senescence signature across DNA damage, chromatin (SAHFs), expression (p16/p21), and secretome (SASP) |
|
||||||
|
| **Stem cell exhaustion** | Niche signalling changes → stem cell transcriptomic shifts → functional decline |
|
||||||
|
| **Altered intercellular communication** | SASP factor genes → protein secretion → receiver cell transcriptomic response |
|
||||||
|
| **Disabled macroautophagy** | Autophagy gene regulation → ATG protein levels → lipidation status → degradation capacity |
|
||||||
|
| **Chronic inflammation** | Inflammatory gene regulation → NF-kB pathway activity → cytokine protein levels |
|
||||||
|
| **Dysbiosis** | Microbial gene content → host immune gene expression → barrier protein levels (currently unmodelled) |
|
||||||
|
|
||||||
|
### 7.5 Comparative Longevity Biology
|
||||||
|
|
||||||
|
Long-lived species offer natural experiments in aging resistance:
|
||||||
|
|
||||||
|
- **Evo 2** (trained on 128K genomes across all domains of life) already has cross-species genomic understanding
|
||||||
|
- **TranscriptFormer** (trained on 12 species spanning 1.5B years of evolution) has cross-species transcriptomic understanding
|
||||||
|
- A unified model could answer: "What genomic features → regulatory programs → protein functions are shared across independently evolved long-lived species (naked mole-rat, bowhead whale, Greenland shark, giant tortoise)?"
|
||||||
|
- This directly accelerates the "cross-species longevity gene discovery" project in COMPUTATIONAL_BIOLOGY.md Section 12
|
||||||
|
|
||||||
|
### 7.6 Personalized Longevity Medicine
|
||||||
|
|
||||||
|
The ultimate application:
|
||||||
|
|
||||||
|
- **Input:** Individual's genome (from WGS) + methylome (from blood draw) + transcriptome (from RNA-seq) + plasma proteome (from SomaScan/Olink)
|
||||||
|
- **Process:** Multi-layer model integrates all layers into a unified aging assessment
|
||||||
|
- **Output:**
|
||||||
|
- Overall biological age and rate of aging
|
||||||
|
- Organ-specific aging rates (liver, brain, cardiovascular, immune, musculoskeletal)
|
||||||
|
- Identification of the fastest-aging molecular pathways in this specific individual
|
||||||
|
- Ranked intervention recommendations with predicted effect on each aging pathway
|
||||||
|
- Personalised supplement/diet/exercise protocol based on individual multi-omic profile
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Technical Challenges
|
||||||
|
|
||||||
|
### 8.1 The Paired Data Bottleneck
|
||||||
|
|
||||||
|
The single biggest obstacle. Strategies to mitigate:
|
||||||
|
|
||||||
|
1. **DNA as universal anchor.** Every cell shares the same genome. DNA context is always implicitly "paired" with any other measurement. Use DNA locus information to link measurements made in different experiments.
|
||||||
|
|
||||||
|
2. **Gene identity bridging.** A gene's DNA locus, RNA expression, and protein product are biologically linked even when measured separately. Train cross-modal alignment using gene identity as the pairing signal.
|
||||||
|
|
||||||
|
3. **Pseudo-pairing from bulk cohorts.** TCGA (~11K patients) has matched DNA, RNA, methylation, and some proteomics at bulk level. While lacking single-cell resolution, this provides large-scale cross-modal training signal.
|
||||||
|
|
||||||
|
4. **Imputation-assisted training.** Use trained single-modality models to impute missing modalities. For example, predict methylation from gene expression (MethylProphet, 2025 already does this), then use imputed + real data for cross-modal training.
|
||||||
|
|
||||||
|
5. **Generate more paired data.** The CZI Billion Cells Project with 10X Genomics and Ultima Genomics aims to dramatically increase multi-omic data at single-cell resolution.
|
||||||
|
|
||||||
|
### 8.2 Scale vs. Inductive Bias
|
||||||
|
|
||||||
|
GREmLN (10.3M params) outperforms scFoundation (100M params) on some tasks by incorporating gene regulatory network structure. This challenges the "scale is all you need" assumption.
|
||||||
|
|
||||||
|
For a multi-layer model, the question is: **should we scale massively, or should we invest in architecture that encodes biological knowledge?**
|
||||||
|
|
||||||
|
The answer is probably both, but with emphasis on inductive biases:
|
||||||
|
- Encode the central dogma structure in the architecture (hierarchical cross-attention following DNA → RNA → Protein)
|
||||||
|
- Encode known regulatory relationships (GRN-aware attention, promoter-gene links)
|
||||||
|
- Use biological knowledge to define cross-modal alignment (codon-level DNA↔protein correspondence)
|
||||||
|
- Scale where data is abundant (protein sequences, DNA); conserve parameters where data is scarce (paired multi-omics)
|
||||||
|
|
||||||
|
### 8.3 Temporal Modeling
|
||||||
|
|
||||||
|
Biology is dynamic. Nearly all current foundation models train on static snapshots. For aging research, temporal understanding is critical:
|
||||||
|
- Aging is a trajectory, not a state
|
||||||
|
- Interventions have time-dependent effects
|
||||||
|
- Longitudinal data exists (Dunedin, ROSMAP, UK Biobank repeat measurements) but is rarely used in foundation model training
|
||||||
|
|
||||||
|
**Proposed solution:** Add a temporal modality track to the architecture. Each data point gets a "biological time" token (chronological age, or time-since-intervention). Train on longitudinal data where available. The model learns to predict future multi-omic states from current states + time, enabling trajectory forecasting.
|
||||||
|
|
||||||
|
### 8.4 Evaluation and Validation
|
||||||
|
|
||||||
|
No standardised benchmark exists for multi-modal biological models. We need:
|
||||||
|
|
||||||
|
1. **Multi-omic aging benchmark:** A held-out set of longitudinal multi-omic profiles where the model must predict future aging states from current measurements
|
||||||
|
2. **Intervention prediction benchmark:** Given baseline multi-omics, predict the multi-omic response to a known intervention (exercise, senolytics, rapamycin, etc.)
|
||||||
|
3. **Cross-modal prediction benchmark:** Given one modality, predict the others (methylation → expression, expression → proteomics, DNA variant → expression change)
|
||||||
|
4. **Wet-lab validation:** Ultimately, the model's predictions must be validated experimentally. Partner with aging research labs for prospective validation.
|
||||||
|
|
||||||
|
### 8.5 Missing Modalities
|
||||||
|
|
||||||
|
Two critical modalities are absent from current foundation models:
|
||||||
|
|
||||||
|
**Metabolomics:** No metabolomic language model exists, despite metabolism being central to the aging framework in PLAN.md and METABOLISM_AND_AGING.md. Metabolites (ATP, NAD+, ROS, lactate, ketone bodies, fatty acid profiles) are the functional readout of cellular health. Incorporating metabolomics into the multi-layer model would directly connect molecular biology to the bioenergetic aging theory.
|
||||||
|
|
||||||
|
**Microbiome:** The gut microbiome is recognised as one of the 12 hallmarks of aging (dysbiosis) and produces metabolites (SCFAs, serotonin — see METABOLISM_AND_AGING.md Section 8.5) that directly affect host biology. No current multi-modal model incorporates host-microbiome interactions. Evo 2 was trained on prokaryotic genomes alongside eukaryotic ones — the sequence-level understanding exists, but the host-interaction layer is missing.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. Concrete Project Proposals
|
||||||
|
|
||||||
|
### 9.1 Tier 1: Achievable Now (6-12 months)
|
||||||
|
|
||||||
|
#### Project 1: Cross-Modal Aging Clock
|
||||||
|
**Goal:** Build the first aging clock that jointly uses methylation + transcriptomic + proteomic data.
|
||||||
|
**Approach:** Fine-tune pre-trained MethylGPT/CpGPT + Geneformer/TranscriptFormer with a shared output head on ROSMAP or similar longitudinal multi-omic aging data. Compare against best single-modality clocks.
|
||||||
|
**Why it matters:** Directly tests whether cross-modal integration improves aging measurement. If it does, this validates the entire multi-layer approach.
|
||||||
|
**Data:** ROSMAP (~3,000 individuals, multi-omic + longitudinal cognitive data)
|
||||||
|
**Compute:** Single A100, days-weeks
|
||||||
|
|
||||||
|
#### Project 2: DNA → Epigenome → Transcriptome Cascade Prediction
|
||||||
|
**Goal:** Given a DNA sequence and its epigenetic marks, predict gene expression — learning the full regulatory cascade.
|
||||||
|
**Approach:** Use AlphaGenome (DNA → regulatory predictions) as the DNA encoder, feed its outputs into TranscriptFormer (expression prediction) via cross-attention. Train on paired Multiome (ATAC + RNA) + genotype data.
|
||||||
|
**Why it matters:** Tests whether connecting existing models across layers produces better predictions than either alone.
|
||||||
|
**Data:** GTEx (genotype + expression, 54 tissues) + ENCODE (epigenomics)
|
||||||
|
**Compute:** Multi-GPU, weeks
|
||||||
|
|
||||||
|
#### Project 3: The CR Confound Analysis (from COMPUTATIONAL_BIOLOGY.md) — Multi-Omic Edition
|
||||||
|
**Goal:** Determine whether caloric restriction's multi-omic aging signatures are driven by reduced caloric intake or reduced PUFA intake.
|
||||||
|
**Approach:** Use multi-omic data from CALERIE trial and ITP studies. Train a model to separate CR's methylation, transcriptomic, and metabolomic effects into PUFA-reduction-attributable vs. calorie-reduction-attributable components.
|
||||||
|
**Why it matters:** Could reframe the entire CR field (see PLAN.md Section 15.6).
|
||||||
|
**Data:** CALERIE trial data, ITP archives
|
||||||
|
**Compute:** Single GPU, weeks
|
||||||
|
|
||||||
|
### 9.2 Tier 2: Requires Significant Infrastructure (12-24 months)
|
||||||
|
|
||||||
|
#### Project 4: Modality-Bridging Foundation Model
|
||||||
|
**Goal:** Pre-train a model that can translate between any pair of biological modalities.
|
||||||
|
**Approach:** Train modality-specific encoders on abundant unpaired data (Phase 1), then train cross-modal alignment using available paired data and pseudo-pairing strategies (Phase 2). CLIP-style contrastive loss + hierarchical cross-attention.
|
||||||
|
**Why it matters:** This is the core multi-layer model. Once it exists, every downstream application becomes possible.
|
||||||
|
**Data:** All of the datasets listed in Section 6.3
|
||||||
|
**Compute:** Multi-node GPU cluster, months
|
||||||
|
|
||||||
|
#### Project 5: Multi-Omic Intervention Simulator
|
||||||
|
**Goal:** Given a baseline multi-omic profile and a proposed intervention, predict the multi-omic response.
|
||||||
|
**Approach:** Fine-tune the modality-bridging model (Project 4) on Perturb-seq data (genetic perturbations) and CMap data (drug perturbations). Extend to predict aging intervention effects using DrugAge and aging trial data.
|
||||||
|
**Why it matters:** Enables in silico screening of anti-aging interventions across all molecular layers.
|
||||||
|
**Data:** Perturb-seq (millions of cells), CMap/LINCS (1.3M profiles), aging trial datasets
|
||||||
|
**Compute:** Multi-GPU, months
|
||||||
|
|
||||||
|
#### Project 6: Reprogramming Trajectory Optimizer
|
||||||
|
**Goal:** Predict and optimize the multi-omic trajectory during partial epigenetic reprogramming.
|
||||||
|
**Approach:** Train on time-series scMulti-omic data during reprogramming (scRNA + scATAC + methylation at multiple time points). Learn the Waddington landscape in multi-omic space. Use optimal control theory to find the rejuvenation trajectory that avoids dedifferentiation.
|
||||||
|
**Why it matters:** Directly accelerates the most promising aging intervention.
|
||||||
|
**Data:** Time-series reprogramming datasets (scRNA-seq + ATAC-seq + methylation time courses)
|
||||||
|
**Compute:** Multi-GPU, months
|
||||||
|
|
||||||
|
### 9.3 Tier 3: Moonshot (2-5 years)
|
||||||
|
|
||||||
|
#### Project 7: The Unified Biological Foundation Model
|
||||||
|
**Goal:** A single model pre-trained across DNA, epigenomics, transcriptomics, proteomics, and metabolomics simultaneously.
|
||||||
|
**Approach:** Full architecture from Section 5 — modality-specific encoders → cross-modal fusion → shared latent space. Three-phase training at scale.
|
||||||
|
**Why it matters:** This is the endgame — a model that understands biology the way biology works, across all layers simultaneously.
|
||||||
|
**Compute:** Multi-node cluster, many months of training
|
||||||
|
|
||||||
|
#### Project 8: Personalised Longevity Digital Twin
|
||||||
|
**Goal:** A model that takes an individual's multi-omic profile and simulates their aging trajectory + response to interventions.
|
||||||
|
**Approach:** Build on the unified model (Project 7). Train on longitudinal multi-omic data from aging cohorts. Deploy as a clinical decision support tool for personalised longevity medicine.
|
||||||
|
**Why it matters:** The ultimate application of multi-layer biological intelligence to human longevity.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 10. Infrastructure and Roadmap
|
||||||
|
|
||||||
|
### 10.1 Compute Requirements
|
||||||
|
|
||||||
|
| Project Tier | GPU Requirement | Estimated Cost | Timeline |
|
||||||
|
|-------------|----------------|----------------|----------|
|
||||||
|
| Tier 1 (achievable now) | 1-4 A100s | $1K-10K | 6-12 months |
|
||||||
|
| Tier 2 (significant) | 8-32 A100s/H100s | $50K-500K | 12-24 months |
|
||||||
|
| Tier 3 (moonshot) | 64+ H100s or equivalent | $1M+ | 2-5 years |
|
||||||
|
|
||||||
|
### 10.2 Software Stack
|
||||||
|
|
||||||
|
```
|
||||||
|
Core ML:
|
||||||
|
PyTorch 2.x (primary deep learning framework)
|
||||||
|
JAX (alternative for some architectures, especially Mamba/SSM)
|
||||||
|
FlashAttention 2 (efficient attention)
|
||||||
|
DeepSpeed / FSDP (distributed training)
|
||||||
|
Triton (custom GPU kernels)
|
||||||
|
|
||||||
|
Biological:
|
||||||
|
Scanpy / AnnData (single-cell data)
|
||||||
|
Biopython / pysam (sequence handling)
|
||||||
|
ESM (protein embeddings)
|
||||||
|
Enformer / AlphaGenome (DNA regulatory predictions)
|
||||||
|
|
||||||
|
Data:
|
||||||
|
HuggingFace Datasets (model hosting + data loading)
|
||||||
|
AnnData / MuData (multi-modal single-cell data format)
|
||||||
|
Zarr (large array storage)
|
||||||
|
DVC (data version control)
|
||||||
|
|
||||||
|
Evaluation:
|
||||||
|
scib (single-cell integration benchmarking)
|
||||||
|
Custom aging clock evaluation suite
|
||||||
|
Wet-lab collaboration pipeline
|
||||||
|
```
|
||||||
|
|
||||||
|
### 10.3 Development Roadmap
|
||||||
|
|
||||||
|
```
|
||||||
|
Year 1 (Foundation):
|
||||||
|
├── Q1-Q2: Implement and validate Tier 1 projects (3 parallel)
|
||||||
|
│ ├── Cross-modal aging clock
|
||||||
|
│ ├── DNA→Epi→Transcriptome cascade
|
||||||
|
│ └── CR confound multi-omic analysis
|
||||||
|
├── Q3: Evaluate results, identify which cross-modal strategies work best
|
||||||
|
├── Q4: Begin Tier 2 Project 4 (modality-bridging FM) — architecture design and Phase 1 pre-training
|
||||||
|
└── Ongoing: Data curation, partnership development with aging research labs
|
||||||
|
|
||||||
|
Year 2 (Integration):
|
||||||
|
├── Q1-Q2: Phase 2 cross-modal alignment training for modality-bridging FM
|
||||||
|
├── Q3: Begin Projects 5 (intervention simulator) and 6 (reprogramming optimizer)
|
||||||
|
├── Q4: Evaluation and publication of results
|
||||||
|
└── Ongoing: Incorporate new paired multi-omic data as it becomes available
|
||||||
|
|
||||||
|
Year 3-5 (Unification):
|
||||||
|
├── Tier 3 projects: Unified FM and personalised digital twin
|
||||||
|
├── Clinical validation partnerships
|
||||||
|
├── Deployment for personalised longevity assessment
|
||||||
|
└── Open-source release for the aging research community
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Appendix: Relationship to Existing Documents
|
||||||
|
|
||||||
|
This document extends COMPUTATIONAL_BIOLOGY.md by focusing specifically on the multi-layer foundation model concept that was briefly mentioned as "Project 9: Foundation Model for Biological Aging" in that document's moonshot section.
|
||||||
|
|
||||||
|
**Key connections to the broader longevity framework:**
|
||||||
|
|
||||||
|
- **PLAN.md Sections 15.3-15.4 (Seed oils, metabolism):** A multi-layer model could trace how dietary PUFA → membrane composition changes → mitochondrial ETC dysfunction → transcriptomic shifts → metabolic phenotype. The Randle cycle and FADH2/NADH ratio modelling in COMPUTATIONAL_BIOLOGY.md Section 4.2.5 could be validated and extended using multi-omic data.
|
||||||
|
|
||||||
|
- **PLAN.md Section 15.6 (CR confound):** The CR confound analysis (Project 3) directly addresses whether caloric restriction trials are confounded by PUFA reduction — using multi-omic data to disentangle the mechanisms.
|
||||||
|
|
||||||
|
- **METABOLISM_AND_AGING.md Sections 6-8 (Thyroid, cortisol, serotonin, estrogen):** The hormonal aging cascade (pregnenolone decline → cortisol rise → serotonin elevation → estrogen dominance) involves gene expression changes (transcriptomics), protein level changes (proteomics), and epigenetic regulation. A multi-layer model could quantify these cascades and predict intervention effects.
|
||||||
|
|
||||||
|
- **COMPUTATIONAL_BIOLOGY.md Section 14.7 (Clock metabolic bias):** The concern that aging clocks confuse metabolic suppression with rejuvenation is directly addressable by a multi-omic clock that sees metabolic gene expression and thyroid pathway proteins alongside methylation marks.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*This document is part of the longevity research framework. See also: PLAN.md (biological framework), METABOLISM_AND_AGING.md (bioenergetic theory), COMPUTATIONAL_BIOLOGY.md (computational approaches), FAT_LOSS_GUIDE.md (applied metabolic restoration).*
|
||||||
Reference in New Issue
Block a user