Goal of the Model
The primary goal of this model is to predict codon-level evolutionary selection intensity (specifically the Likelihood Ratio Test, or LRT, statistic computed by the state-of-the-art MEME method in HyPhy) directly from a Multiple Sequence Alignment (MSA) and a phylogenetic tree. By learning the complex structural and evolutionary patterns of substitution, the model acts as an extremely fast surrogate predictor, bypasses the computationally heavy maximum likelihood numerical optimization loops, and outputs high-confidence selection scans in a fraction of a second.
Phylogenetic Axial Transformer
Standard transformers are designed for 1D sequences of tokens. An alignment (MSA), however, is a 2D grid of tokens where rows represent species (N) and columns represent codon sites (L). To model this grid efficiently, the PhyloAxialTransformer uses an Axial Attention architecture (alternating attention across columns and rows) augmented by a custom, learnable Phylogenetic Bias.
Axial Processing Workflow
MSA Window (N species × L sites)
Codon & Amino Acid Joint Embedding + Position Encodings
Column Attention (across sites independently per species)
Row Attention (across species with Phylogenetic Distance Bias)
Attention Pooling (CLS Query) & MLP Regression Head
A. Joint Codon-AA Embedding
To capture both codon-level substitution rates and amino acid biochemical shifts, the model embeds codon sequences and translated amino acids jointly:
- Codon Tokenizer: 64 codons + 1 gap + 1 unknown = 66 tokens.
- AA Tokenizer: 20 residues + 1 stop + 1 gap + 1 unknown = 23 tokens.
- Embedding Concatenation: Codon (size 64) and Amino Acid (size 64) embeddings are concatenated to form a joint embedding vector of size 128.
- Species Permutation Invariance: Row attention contains no positional encodings along the species dimension, ensuring the model is invariant to species ordering.
B. Learnable Phylogenetic Attention Bias
The model injects phylogenetic relationships directly into self-attention. The raw attention score between species j and k is penalized by the distance along the evolutionary tree:
Score(j, k) = (Qj · KkT / √dhead) - exp(w) · Djk
Where Djk is the patristic distance (the sum of branch lengths along the shortest path on the tree) between species, and w is a learnable scale parameter optimized during training. Evolutionarily close species receive higher attention weights, whereas evolutionary distance penalizes attention.
Regression Training Pipeline
A. Continuous Selection Target Formulation
To stabilize prediction target variance and handle the massive positive skewness of raw Likelihood Ratio Test (LRT) statistics, a log-transform is applied:
yi = ln(max(0.0, LRTi) + 1.0)
- LRT = 0.0 ⇒ yi = 0.0 (Neutral or conserved evolution)
- LRT = 100.0 ⇒ yi ≈ 4.615 (Decisive positive selection)
B. Weighted Huber Loss
Because positive selection is rare (≈ 3% of sites), the loss function scales individual site predictions based on the magnitude of the ground truth label:
Loss = (1/M) × ∑ (yi + 1.0) × Huber(yi, ŷi)
The multiplier (yi + 1.0) acts as a dynamic weight, penalizing prediction errors on high-selection sites up to 6.3x more heavily than neutral sites.
C. Data Splits & Metrics
To prevent homology data leakage, training splits are grouped strictly by gene:
- Training Set: 4,352 genes (2,596,611 sites)
- Validation Set: 1,088 genes (627,575 sites)
The primary optimization objective is maximizing the Spearman Rank Correlation (ρ), which evaluates how well the model ranks codon sites by selection intensity.
D. Selection Call Classification (Tiers)
To classify codon sites into confidence tiers of positive selection, the model applies a dual-threshold decision tree on variable sites only:
- Tier 1 (High Confidence): The site's local rank percentile among variable sites is ≥ 98% OR its absolute predicted LRT is ≥ 5.0.
- Tier 2 (Medium Confidence): The site's local rank percentile is ≥ 97% OR its absolute predicted LRT is ≥ 3.0, and it is not already classified as Tier 1.
- Neutral / Conserved: All other sites (including all invariable sites).