Alexander G Lucaci, Jordan D Zehr, David Enard, Joseph W. Thornton, Sergei L. Kosakovsky Pond
A preprint link is available at Evolutionary shortcuts via multi-nucleotide substitutions and their impact on natural selection analyses
This webpage contains empirical and synthetic data from our paper, along with scripts to run analysis, and links to useful visualizations for interpretation of results.
Information about our previous methods development for MultiHit and BUSTED[S] are available
Description Inference and interpretation of evolutionary processes - in particular of the types and targets of natural selection affecting coding sequences, are critically influenced by the assumptions built into statistical models for such analyses. If certain aspects of the substitution process (even when they are not of direct interest) are presumed absent or are modeled with too crude of a simplification, estimates of key model parameters can become biased - often systematically, and lead to poor statistical performance. Here, we performed a detailed characterization of how modeling instantaneous multi-nucleotide (or multi-hit, MH) substitutions impacts dN/dS based inference of episodic diversifying selection at the level of the entire alignment. The inclusion of MH reduces the rate (1.37-fold or 26.8%) at which positive selection is called based on the analysis of N = 9,861 empirical data-sets, while offering significantly better statistical fit to sequence data in 8.37% of cases. Through additional simulation studies, we show that this reduction is not simply due to loss of power because of additional model complexity. After a detailed examination of 21 benchmark alignments and a new high-resolution analysis showing which parts of the alignment provide support for positive selection, we reveal that MH substitutions occurring along shorter branches in the tree are largely responsible for discrepant results in selection detection. Our results add to the growing body of literature which examines decades-old modeling assumptions and finds them to be problematic for biological data analysis. Because multi-nucleotide substitutions have a significant impact on natural selection detection even at the level of an entire gene, we recommend that routine selection analysis of this type consider their inclusion. To facilitate this procedure, we developed a simple model testing selection detection framework able to screen an alignment for positive selection with two biologically important confounding processes: synonymous rate variation, and multi-nucleotide instantaneous substitutions.21 empirical alignments analyzed in Benchmark datasets
in NEXUS format and are available from data/21-empirical.zip
9,861 empirical alignments from 24 mammalian species (compressed file) from the Enard and Petrov dataset in FASTA format are available from data/Enard.zip.
See below
21 empirical alignments analyzed in Benchmark datasets in JSON format and are available from results/21-empirical-results.zip
We combined Synthetic data and results files for convenience.
4-taxon simulations are available from (compressed file) synthetic/4-taxon-sims.zip
We rely on the simulation code from the following repository SimulateMG94 which allows for MG94 based simulations in the BS-REL family of models. Additional details can be found in the README file in GitHub.