Evolutionary shortcuts via multi-nucleotide substitutions and their impact on natural selection analyses.

Alexander G Lucaci, Jordan D Zehr, David Enard, Joseph W. Thornton, Sergei L. Kosakovsky Pond

A preprint link is available at Evolutionary shortcuts via multi-nucleotide substitutions and their impact on natural selection analyses

This webpage contains empirical and synthetic data from our paper, along with scripts to run analysis, and links to useful visualizations for interpretation of results.

Information about our previous methods development for MultiHit and BUSTED[S] are available

Description Inference and interpretation of evolutionary processes - in particular of the types and targets of natural selection affecting coding sequences, are critically influenced by the assumptions built into statistical models for such analyses. If certain aspects of the substitution process (even when they are not of direct interest) are presumed absent or are modeled with too crude of a simplification, estimates of key model parameters can become biased - often systematically, and lead to poor statistical performance. Here, we performed a detailed characterization of how modeling instantaneous multi-nucleotide (or multi-hit, MH) substitutions impacts dN/dS based inference of episodic diversifying selection at the level of the entire alignment. The inclusion of MH reduces the rate (1.37-fold or 26.8%) at which positive selection is called based on the analysis of N = 9,861 empirical data-sets, while offering significantly better statistical fit to sequence data in 8.37% of cases. Through additional simulation studies, we show that this reduction is not simply due to loss of power because of additional model complexity. After a detailed examination of 21 benchmark alignments and a new high-resolution analysis showing which parts of the alignment provide support for positive selection, we reveal that MH substitutions occurring along shorter branches in the tree are largely responsible for discrepant results in selection detection. Our results add to the growing body of literature which examines decades-old modeling assumptions and finds them to be problematic for biological data analysis. Because multi-nucleotide substitutions have a significant impact on natural selection detection even at the level of an entire gene, we recommend that routine selection analysis of this type consider their inclusion. To facilitate this procedure, we developed a simple model testing selection detection framework able to screen an alignment for positive selection with two biologically important confounding processes: synonymous rate variation, and multi-nucleotide instantaneous substitutions.


Empirical Data

Synthetic Data

See below


Empirical Results

Synthetic Results

We combined Synthetic data and results files for convenience.


BUSTED[S]-MH Hyphy Batch Language (HBL) Implementation

An implementation of our method is available as part of the HyPhy (hyphy.org) software suite. The HyPhy Batch Language (HBL) code is available here, BUSTED Method

How to generate synthetic data

We rely on the simulation code from the following repository SimulateMG94 which allows for MG94 based simulations in the BS-REL family of models. Additional details can be found in the README file in GitHub.

BUSTED ModelTesting

We provide a link to a dedicated repository which provides an easy way to profile a dataset (multiple sequence alignment and phylogenetic tree) for evidence of episodic diversifying selection using our methods. You can find our BUSTED ModelTesting Snakemake procedure here BUSTED ModelTesting

Running empirical and synthetic data analysis

Our methods can be utilized in one of three ways

Vizualizations / Interactive Notebooks

HyPhy JSON output files can be visualized with