A new statistical tool to identify the determinant of parallel evolution
Identifying drivers of parallel evolution: A regression model approach
In experimental evolution followed by whole genome resequencing, parallel evolution, defined as the increase in frequency of identical changes in independent populations adapting to the same environment, is often considered as the product of similar selection pressures and the parallel changes are interpreted as adaptive.
However, theory predicts that heterogeneity both in mutation rate and selection intensity across the genome can trigger patterns of parallel evolution. It is thus important to evaluate and quantify the contribution of both mutation and selection in determining parallel evolution to interpret more accurately experimental evolution genomic data and also potentially improve our capacity to predict the genes that will respond to selection.
In their manuscript, Bailey, Guo and Bataillon  derive a framework of statistical models to partition the role of mutation and selection in determining patterns of parallel evolution at the gene level. The rationale is to use the synonymous mutations dataset as a baseline to characterize the mutation rate heterogeneity, assuming a negligible impact of selection on synonymous mutations and then analyse the non-synonymous dataset to identify additional source(s) of heterogeneity, by examining the proportion of the variation explained by a number of genomic variables.
This framework is applied to a published data set of resequencing of 40 Saccharomyces cerevisiae populations adapting to a laboratory environment . The model explaining at best the synonymous mutations dataset is one of homogeneous mutation rate along the genome with a significant positive effect of gene length, likely reflecting variation in the size of the mutational target. For the non-synonymous mutations dataset, introducing heterogeneity between sites for the probability of a change to increase in frequency is improving the model fit and this heterogeneity can be partially explained by differences in gene length, recombination rate and number of functional protein domains.
The application of the framework to an experimental data set illustrates its capacity to disentangle the role of mutation and selection and to identify genomic variables explaining heterogeneity in parallel evolution probability but also points to potential limits, cautiously discussed by the authors: first, the number of mutations in the dataset analysed needs to be sufficient, in particular to establish the baseline on the synonymous dataset. Here, despite a high replication (40 populations evolved in the exact same conditions), the total number of synonymous mutations that could be analysed was not very high and there was only one case of a gene with synonymous mutation in two independent populations. Second, although the models are able to identify factors affecting the mutation counts, the proportion of the variation explained is quite low. The consequence is that the models correctly predicts the mutation count distribution but the objective of predicting on which genes the response to selection will occur still seems quite far away.
The framework developed in this manuscript  clearly represents a very useful tool for the analysis of large “evolve and resequence” data sets and to gain a better understanding of the determinants of parallel evolution in general. The extension of its application to mutations others than SNPs would provide the possibility to get a more complete picture of the differences in contributions of mutation and selection intensity heterogeneities depending on the mutation types.
 Bailey SF, Guo Q and Bataillon T (2018) Identifying drivers of parallel evolution: A regression model approach. bioRxiv 118695, ver. 4 peer-reviewed by Peer Community In Evolutionary Biology. doi: 10.1101/118695
 Lang GI, Rice DP, Hickman, MJ, Sodergren E, Weinstock GM, Botstein D, and Desai MM (2013) Pervasive genetic hitchhiking and clonal interference in forty evolving yeast populations. Nature 500: 571–574. doi: 10.1038/nature12344
Stephanie Bedhomme (2018) A new statistical tool to identify the determinant of parallel evolution. Peer Community in Evolutionary Biology, 100045. 10.24072/pci.evolbiol.100045
Evaluation round #2
DOI or URL of the preprint: 10.1101/118695
Version of the preprint: 2
Author's Reply, None
Decision by Stephanie Bedhomme, 23 Jan 2018
The two reviewers and myself have now read the revised version of your manuscript. Most of the comments on the previous versions have been addressed and the clarity of the manuscript is improved.
There are still some points that should be addressed before I can recommend this manuscript, in particular the two raised by anonymous. The first on MNM is likely to have very few impact on the results of the analysis but I agree that the contradiction between your answer to his previous comment and what you wrote in the manuscript is puzzling and should be clarified. Probably, also, having access to the list of mutations considered will help readers to follow and understand the subset of mutations used for this manuscript.
One additional comment: I find that “mutations per gene” is a confounding wording which should be changed. Indeed, it can both mean “the number of populations in which a particular gene is mutated” or “the number of mutations within this particular gene in a population” or “the number of different mutations found in a particular gene”.
Reviewed by anonymous reviewer, 28 Nov 2017
Reviewed by Bastien Boussau, 28 Nov 2017
Evaluation round #1
DOI or URL of the preprint: https://doi.org/10.1101/118695
Version of the preprint: 1
Author's Reply, None
Decision by Stephanie Bedhomme, 17 Aug 2017
The preprint “Identifying drivers of parallel evolution: A regression model approach” has now been read by two reviewers and myself. We all agree that the central topic of the paper and the methods derived are of interest but that the paper in its actual form cannot be recommended. Various problems have been pointed by the reviewers and I synthesize and complete them below:
There seems to be a disconnection between the title and the introduction which focus on parallel evolution and the results and discussion which focus on the factors affecting the probability of a gene to carry a mutation by the end of the experimental evolution. The introduction makes the reader expect that the methods developed is going to be able to determine to what extent parallel evolution is due to the probability of the mutation to happen and to selection. In other words, from the introduction, I expected the method to be able to discriminate cases where parallel evolution can be truly taken as a strong signal for adaptive mutations and cases where parallel evolution is due to neutral processes. The methods developed is not reaching this goal, at least not explicitly, and the added value of the methods does not appear clearly to the reader.
More details should be given on the experimental design, in particular on the ploidy of the yeast and the reproduction mode they had during experimental evolution (see point 4 of MA).
The authors rely on the hypothesis that synonymous mutations are neutral to selection, which is a classical one, but they write in their discussion (l 403): “Relying on the assumption that synonymous mutations are selectively neutral (which does appear to be the case for these data)” and I do not see where the neutrality is tested in the study. More importantly, for non-synonymous mutations, they fit a number of models to try and detect the effect of different genomic variables on the heterogeneity in the probability that a mutation rises. Among this genomic variables, some are likely to affect non-synonymous as well as synonymous mutations and their link to selection is not obvious and straightforward. The comment on GC content by MA is going in this direction and similar argument could be developed for CAI and recombination rate. As far as I understand the effect of these variables on the synonymous mutations has not been tested, so it cannot be claimed that they have an effect on NS mutation that they have not on S mutations.
All the manuscript is focussed on SNP when high levels of parallelism have been found for IS and large duplications and deletions (see for example Tenaillon et al. 2012). I recognized that it is more difficult to derive a modelling framework for them and that the present one cannot be easily adapted but I think that these mutations have a strong impact on adaptation and would like to see some comments on them, at least, in the discussion.