BUSTED[S]-MH

Evolutionary shortcuts via multi-nucleotide substitutions and their impact on natural selection analyses.

Alexander G Lucaci, Jordan D Zehr, David Enard, Joseph W. Thornton, Sergei L. Kosakovsky Pond

A preprint link is available at Evolutionary shortcuts via multi-nucleotide substitutions and their impact on natural selection analyses

This webpage contains empirical and synthetic data from our paper, along with scripts to run analysis, and links to useful visualizations for interpretation of results.

Information about our previous methods development for MultiHit and BUSTED[S] are available

Description Inference and interpretation of evolutionary processes - in particular of the types and targets of natural selection affecting coding sequences, are critically influenced by the assumptions built into statistical models for such analyses. If certain aspects of the substitution process (even when they are not of direct interest) are presumed absent or are modeled with too crude of a simplification, estimates of key model parameters can become biased - often systematically, and lead to poor statistical performance. Here, we performed a detailed characterization of how modeling instantaneous multi-nucleotide (or multi-hit, MH) substitutions impacts dN/dS based inference of episodic diversifying selection at the level of the entire alignment. The inclusion of MH reduces the rate (1.37-fold or 26.8%) at which positive selection is called based on the analysis of N = 9,861 empirical data-sets, while offering significantly better statistical fit to sequence data in 8.37% of cases. Through additional simulation studies, we show that this reduction is not simply due to loss of power because of additional model complexity. After a detailed examination of 21 benchmark alignments and a new high-resolution analysis showing which parts of the alignment provide support for positive selection, we reveal that MH substitutions occurring along shorter branches in the tree are largely responsible for discrepant results in selection detection. Our results add to the growing body of literature which examines decades-old modeling assumptions and finds them to be problematic for biological data analysis. Because multi-nucleotide substitutions have a significant impact on natural selection detection even at the level of an entire gene, we recommend that routine selection analysis of this type consider their inclusion. To facilitate this procedure, we developed a simple model testing selection detection framework able to screen an alignment for positive selection with two biologically important confounding processes: synonymous rate variation, and multi-nucleotide instantaneous substitutions.

Data

Empirical Data

21 empirical alignments analyzed in Benchmark datasets in NEXUS format and are available from data/21-empirical.zip
9,861 empirical alignments from 24 mammalian species (compressed file) from the Enard and Petrov dataset in FASTA format are available from data/Enard.zip.

Synthetic Data

See below

Results

Empirical Results

21 empirical alignments analyzed in Benchmark datasets in JSON format and are available from results/21-empirical-results.zip
9,861 empirical alignments from 24 mammalian species (compressed file) from the Enard and Petrov dataset in JSON format are available from results/Enard-results.zip.

Synthetic Results

We combined Synthetic data and results files for convenience.

4-taxon simulations are available from (compressed file) synthetic/4-taxon-sims.zip
Empirical simulations are available from (compressed file) are available from synthetic/empirical-sims.zip.
Simulations based on the Enard et al dataset are available from (compressed file) are available from synthetic/Enard-sims.zip.
4-taxon simulations with variable 2H and 3H rates are available from (compressed file) synthetic/4-taxon-sims-variable.zip.

Scripts

BUSTED[S]-MH Hyphy Batch Language (HBL) Implementation

An implementation of our method is available as part of the HyPhy (hyphy.org) software suite. The HyPhy Batch Language (HBL) code is available here, BUSTED Method

How to generate synthetic data

We rely on the simulation code from the following repository SimulateMG94 which allows for MG94 based simulations in the BS-REL family of models. Additional details can be found in the README file in GitHub.

BUSTED ModelTesting

We provide a link to a dedicated repository which provides an easy way to profile a dataset (multiple sequence alignment and phylogenetic tree) for evidence of episodic diversifying selection using our methods. You can find our BUSTED ModelTesting Snakemake procedure here BUSTED ModelTesting

Running empirical and synthetic data analysis

Our methods can be utilized in one of three ways

Through DataMonkey (datamonkey.org) our user-friendly online application which hosts "A Collection of State of the Art Statistical Models and Bioinformatics Tools" https://www.datamonkey.org/busted
Through the HyPhy command-line, see above for the HBL implementation
Through our BUSTED ModelTesting procedure, see above and our dedicated repository for details.

Vizualizations / Interactive Notebooks

HyPhy JSON output files can be visualized with