Browsing by Author "Scheffler, Konrad"
Now showing 1 - 9 of 9
Results Per Page
Sort Options
- ItemBenchmarking multi-rate codon models(Public Library of Science, 2010-07-21) Delport, Wayne; Scheffler, Konrad; Gravenor, Mike B.; Muse, Spencer V.; Pond, Sergei KosakovskyThe single rate codon model of non-synonymous substitution is ubiquitous in phylogenetic modeling. Indeed, the use of a non-synonymous to synonymous substitution rate ratio parameter has facilitated the interpretation of selection pressure on genomes. Although the single rate model has achieved wide acceptance, we argue that the assumption of a single rate of non-synonymous substitution is biologically unreasonable, given observed differences in substitution rates evident from empirical amino acid models. Some have attempted to incorporate amino acid substitution biases into models of codon evolution and have shown improved model performance versus the single rate model. Here, we show that the single rate model of non-synonymous substitution is easily outperformed by a model with multiple non-synonymous rate classes, yet in which amino acid substitution pairs are assigned randomly to these classes. We argue that, since the single rate model is so easy to improve upon, new codon models should not be validated entirely on the basis of improved model fit over this model. Rather, we should strive to both improve on the single rate model and to approximate the general time-reversible model of codon substitution, with as few parameters as possible, so as to reduce model over-fitting. We hint at how this can be achieved with a Genetic Algorithm approach in which rate classes are assigned on the basis of sequence information content. © 2010 Delport et al.
- ItemCodon test : modeling amino acid substitution preferences in coding sequences(PLOS Computational Biology, 2010-08) Delport, Wayne; Scheffler, Konrad; Botha, Gordon; Gravenor, Mike B.; Muse, Spencer V.; Pond, Sergei L. KosakovskyCodon models of evolution have facilitated the interpretation of selective forces operating on genomes. These models, however, assume a single rate of non-synonymous substitution irrespective of the nature of amino acids being exchanged. Recent developments have shown that models which allow for amino acid pairs to have independent rates of substitution offer improved fit over single rate models. However, these approaches have been limited by the necessity for large alignments in their estimation. An alternative approach is to assume that substitution rates between amino acid pairs can be subdivided into K rate classes, dependent on the information content of the alignment. However, given the combinatorially large number of such models, an efficient model search strategy is needed. Here we develop a Genetic Algorithm (GA) method for the estimation of such models. A GA is used to assign amino acid substitution pairs to a series of K rate classes, where K is estimated from the alignment. Other parameters of the phylogenetic Markov model, including substitution rates, character frequencies and branch lengths are estimated using standard maximum likelihood optimization procedures. We apply the GA to empirical alignments and show improved model fit over existing models of codon evolution. Our results suggest that current models are poor approximations of protein evolution and thus gene and organism specific multi-rate models that incorporate amino acid substitution biases are preferred. We further anticipate that the clustering of amino acid substitution rates into classes will be biologically informative, such that genes with similar functions exhibit similar clustering, and hence this clustering will be useful for the evolutionary fingerprinting of genes.
- ItemCorrecting the bias of empirical frequency parameter estimators in codon models(Public Library of Science -- PLOS, 2010-07) Kosakovsky Pond, Sergei; Delport, Wayne; Muse, Spencer V.; Scheffler, KonradMarkov models of codon substitution are powerful inferential tools for studying biological processes such as natural selection and preferences in amino acid substitution. The equilibrium character distributions of these models are almost always estimated using nucleotide frequencies observed in a sequence alignment, primarily as a matter of historical convention. In this note, we demonstrate that a popular class of such estimators are biased, and that this bias has an adverse effect on goodness of fit and estimates of substitution rates. We propose a ‘‘corrected’’ empirical estimator that begins with observed nucleotide counts, but accounts for the nucleotide composition of stop codons. We show via simulation that the corrected estimates outperform the de facto standard F3|4 estimates not just by providing better estimates of the frequencies themselves, but also by leading to improved estimation of other parameters in the evolutionary models. On a curated collection of 856 sequence alignments, our estimators show a significant improvement in goodness of fit compared to the F3|4 approach. Maximum likelihood estimation of the frequency parameters appears to be warranted in many cases, albeit at a greater computational cost. Our results demonstrate that there is little justification, either statistical or computational, for continued use of the F3|4-style estimators.
- ItemDetecting individual sites subject to episodic diversifying selection(Public Library of Science, 2012-07-02) Murrell, Ben; Wertheim, Joel O.; Moola, Sasha; Weighill, Thomas; Scheffler, Konrad; Pond, Sergei L. KosakovskyThe imprint of natural selection on protein coding genes is often difficult to identify because selection is frequently transient or episodic, i.e. it affects only a subset of lineages. Existing computational techniques, which are designed to identify sites subject to pervasive selection, may fail to recognize sites where selection is episodic: a large proportion of positively selected sites. We present a mixed effects model of evolution (MEME) that is capable of identifying instances of both episodic and pervasive positive selection at the level of an individual site. Using empirical and simulated data, we demonstrate the superior performance of MEME over older models under a broad range of scenarios. We find that episodic selection is widespread and conclude that the number of sites experiencing positive selection may have been vastly underestimated.
- ItemFrequent toggling between alternative amino acids Is driven by selection in HIV-1(Public Library of Science, 2008) Delport, Wayne; Scheffler, Konrad; Seoighe, CathalHost immune responses against infectious pathogens exert strong selective pressures favouring the emergence of escape mutations that prevent immune recognition. Escape mutations within or flanking functionally conserved epitopes can occur at a significant cost to the pathogen in terms of its ability to replicate effectively. Such mutations come under selective pressure to revert to the wild type in hosts that do not mount an immune response against the epitope. Amino acid positions exhibiting this pattern of escape and reversion are of interest because they tend to coincide with immune responses that control pathogen replication effectively. We have used a probabilistic model of protein coding sequence evolution to detect sites in HIV-1 exhibiting a pattern of rapid escape and reversion. Our model is designed to detect sites that toggle between a wild type amino acid, which is susceptible to a specific immune response, and amino acids with lower replicative fitness that evade immune recognition. Through simulation, we show that this model has significantly greater power to detect selection involving immune escape and reversion than standard models of diversifying selection, which are sensitive to an overall increased rate of non-synonymous substitution. Applied to alignments of HIV-1 protein coding sequences, the model of immune escape and reversion detects a significantly greater number of adaptively evolving sites in env and nef. In all genes tested, the model provides a significantly better description of adaptively evolving sites than standard models of diversifying selection. Several of the sites detected are corroborated by association between Human Leukocyte Antigen (HLA) and viral sequence polymorphisms. Overall, there is evidence for a large number of sites in HIV-1 evolving under strong selective pressure, but exhibiting low sequence diversity. A phylogenetic model designed to detect rapid toggling between wild type and escape amino acids identifies a larger number of adaptively evolving sites in HIV-1, and can in some cases correctly identify the amino acid that is susceptible to the immune response.
- ItemModeling HIV-1 drug resistance as episodic directional selection(PLOS Computational Biology, 2011-05) Murrell, Ben; De Oliveira, Tulio; Seebregts, Chris; Pond, Sergei L. Kosakovsky; Scheffler, KonradThe evolution of substitutions conferring drug resistance to HIV-1 is both episodic, occurring when patients are on antiretroviral therapy, and strongly directional, with site-specific resistant residues increasing in frequency over time. Whilemethods exist to detect episodic diversifying selection and continuous directional selection, no evolutionary model combining these two properties has been proposed. We present two models of episodic directional selection (MEDSand) which allow the a priori specification of lineages expected to have undergone directional selection. The models infer the sites and target residues that were likely subject to directional selection, using either codon or protein sequences. Compared to its null model of episodic diversifying selection, MEDS provides a superior fit to most sites known to be involved in drug resistance, and neither one test for episodic diversifying selection nor another for constant directional selection are able to detect as many true positives as MEDS and EDEPS while maintaining acceptable levels of false positives. This suggests that episodic directional selection is a better description of the process driving the evolution of drug resistance.
- ItemNon-Negative Matrix Factorization for Learning Alignment-Specific Models of Protein Evolution(PLOS, 2011-12-22) Murrell, Ben; Weighill, Thomas; Buys, Jan; Ketteringham, Robert; Moola, Sasha; Benade, Gerdus; du Buisson, Lise; Kaliski, Daniel; Hands, Tristan; Scheffler, KonradModels of protein evolution currently come in two flavors: generalist and specialist. Generalist models (e.g. PAM, JTT, WAG) adopt a one-size-fits-all approach, where a single model is estimated from a number of different protein alignments. Specialist models (e.g. mtREV, rtREV, HIVbetween) can be estimated when a large quantity of data are available for a single organism or gene, and are intended for use on that organism or gene only. Unsurprisingly, specialist models outperform generalist models, but in most instances there simply are not enough data available to estimate them. We propose a method for estimating alignment-specific models of protein evolution in which the complexity of the model is adapted to suit the richness of the data. Our method uses non-negative matrix factorization (NNMF) to learn a set of basis matrices from a general dataset containing a large number of alignments of different proteins, thus capturing the dimensions of important variation. It then learns a set of weights that are specific to the organism or gene of interest and for which only a smaller dataset is available. Thus the alignment-specific model is obtained as a weighted sum of the basis matrices. Having been constrained to vary along only as many dimensions as the data justify, the model has far fewer parameters than would be required to estimate a specialist model. We show that our NNMF procedure produces models that outperform existing methods on all but one of 50 test alignments. The basis matrices we obtain confirm the expectation that amino acid properties tend to be conserved, and allow us to quantify, on specific alignments, how the strength of conservation varies across different properties. We also apply our new models to phylogeny inference and show that the resulting phylogenies are different from, and have improved likelihood over, those inferred under standard models.
- ItemOn the validity of evolutionary models with site-specific parameters(PLoS, 2014-04-10) Scheffler, Konrad; Murrell, Ben; Pond, Sergei L. KosakovskyEvolutionary models that make use of site-specific parameters have recently been criticized on the grounds that parameter estimates obtained under such models can be unreliable and lack theoretical guarantees of convergence. We present a simulation study providing empirical evidence that a simple version of the models in question does exhibit sensible convergence behavior and that additional taxa, despite not being independent of each other, lead to improved parameter estimates. Although it would be desirable to have theoretical guarantees of this, we argue that such guarantees would not be sufficient to justify the use of these models in practice. Instead, we emphasize the importance of taking the variance of parameter estimates into account rather than blindly trusting point estimates – this is standardly done by using the models to construct statistical hypothesis tests, which are then validated empirically via simulation studies.
- ItemSocial and genetic networks of HIV-1 transmission in New York City(PLoS, 2017-01-09) Wertheim, Joel O.; Kosakovsky Pond, Sergei L.; Forgione, Lisa A.; Mehta, Sanjay R.; Murrell, Ben; Shah, Sharmila; Smith, Davey M.; Scheffler, Konrad; Torian, Lucia V.Background Sexually transmitted infections spread across contact networks. Partner elicitation and notification are commonly used public health tools to identify, notify, and offer testing to persons linked in these contact networks. For HIV-1, a rapidly evolving pathogen with low per-contact transmission rates, viral genetic sequences are an additional source of data that can be used to infer or refine transmission networks. Methods and Findings The New York City Department of Health and Mental Hygiene interviews individuals newly diagnosed with HIV and elicits names of sexual and injection drug using partners. By law, the Department of Health also receives HIV sequences when these individuals enter healthcare and their physicians order resistance testing. Our study used both HIV sequence and partner naming data from 1342 HIV-infected persons in New York City between 2006 and 2012 to infer and compare sexual/drug-use named partner and genetic transmission networks. Using these networks, we determined a range of genetic distance thresholds suitable for identifying potential transmission partners. In 48% of cases, named partners were infected with genetically closely related viruses, compatible with but not necessarily representing or implying, direct transmission. Partner pairs linked through the genetic similarity of their HIV sequences were also linked by naming in 53% of cases. Persons who reported high-risk heterosexual contact were more likely to name at least one partner with a genetically similar virus than those reporting their risk as injection drug use or men who have sex with men. Conclusions We analyzed an unprecedentedly large and detailed partner tracing and HIV sequence dataset and determined an empirically justified range of genetic distance thresholds for identifying potential transmission partners. We conclude that genetic linkage provides more reliable evidence for identifying potential transmission partners than partner naming, highlighting the importance and complementarity of both epidemiological and molecular genetic surveillance for characterizing regional HIV-1 epidemics.