Detecting loci under natural selection from temporal genomic data of selfing populations
Power and limits of selection genome scans on temporal data from a selfing population
The observed levels of genomic diversity in contemporary populations are the result of changes imposed by several evolutionary processes. Among them, natural selection is known to dramatically shape the genetic diversity of loci associated with phenotypes which affect the fitness of carriers. As such, many efforts have been dedicated towards developing methods to detect signatures of natural selection from genomes of contemporary samples .
Recent technological advances made the generation of large-scale genomic data from temporal samples, either from experimental populations or historical or ancient samples, accessible to a wide scientific community . Notably, temporal population genomic data allow for a direct observation and study of how, for instance, allele frequencies change through time in response to evolutionary stimuli. Such information can be exploited to detect loci under natural selection, either via mathematical modelling or by investigating empirical distributions .
However, most of current methods to detect selection from temporal genomic data have largely ignored selfing populations, despite the latter comprising a significant proportion of species with social and economic importance. Selfing changes genomic patterns by reducing the effective recombination rate, which makes distinguishing between neutral evolution and natural selection even more challenging than for the case of outcrossing populations . Nevertheless, an outlier-approach based on temporal genomic data for the selfing Arabidopsis thaliana population revealed loci under selection .
This study suggested the promise of detecting selection for selfing populations and encouraged further investigations to test the power of selection scans under different mating systems.
To address this question, Navascués et al.  extended a previously proposed approach for temporal genome scan  to incorporate partial self-fertilization. In the original implementation , it is assumed that, under neutrality, all loci provide levels of genetic differentiation drawn from the same distribution. If some of the loci are under selection, such distribution should show heterogeneity. Navascués et al.  proposed a test for the homogeneity between loci-specific and genome-wide differentiation by deriving a null distribution of FST via simulations using SLiM . After filtering for low-frequency variants and correct for multiple tests, authors derived a statistical test for selection and assess its power under a wide range of scenarios of selfing rate, selection coefficient, duration and type of selection .
The newly proposed test achieved good performance to distinguish between neutral and selected loci in most tested scenarios.
As expected, the test's performance significantly drops for scenarios of high selfing rates and selection from standing variation. Additionally, the probability to correctly detect selection decreases with increasing distance from the causal variant. Intriguingly, the test showed high power when the selected ancestral allele had an initial low frequency, and when the selected derived allele had a high initial frequency. When applied to a data set of around 1,000 SNPs from the highly selfing Medicago truncatula population, an annual plant of the legume family , the test did not provide any candidate loci under selection .
In summary, the detection of loci under selection in selfing populations is and largely remains a challenging task even when explictly account for the different mating system. However, recombination events that occurred before the selective pressure allow ancestral beneficial alleles to exhibit a detectable pattern of non-neutrality. As such, in partially selfing populations, the strength of the footprint of selection depends on several factors, mostly on the selfing rate, the time of onset and type of selection.
One major assumption of this study is that the model implies unstructured population and continuity between samples obtained from the same geographical location over time. As such assumptions are typically violated in real populations, further research into the effect of more complex demographic scenarios is desired to fully understand the power to detect selection in selfing populations. Furthermore, more power could be gained by including additional genomic information at each time point. In this context, recent approaches that make full use of genomic data based on deep learning  may contribute significantly towards this goal. Similarly, the effect of data filtering on the power to detect selection should be further explored, especially in the context of DNA resequencing experiments. These analyses will help elucidate the power offered by selection scans from temporal genomic data in selfing populations.
 Stern AJ, Nielsen R (2019) Detecting Natural Selection. In: Handbook of Statistical Genomics , pp. 397–40. John Wiley and Sons, Ltd. https://doi.org/10.1002/9781119487845.ch14
 Leonardi M, Librado P, Der Sarkissian C, Schubert M, Alfarhan AH, Alquraishi SA, Al-Rasheid KAS, Gamba C, Willerslev E, Orlando L (2017) Evolutionary Patterns and Processes: Lessons from Ancient DNA. Systematic Biology, 66, e1–e29. https://doi.org/10.1093/sysbio/syw059
 Dehasque M, Ávila‐Arcos MC, Díez‐del‐Molino D, Fumagalli M, Guschanski K, Lorenzen ED, Malaspinas A-S, Marques‐Bonet T, Martin MD, Murray GGR, Papadopulos AST, Therkildsen NO, Wegmann D, Dalén L, Foote AD (2020) Inference of natural selection from ancient DNA. Evolution Letters, 4, 94–108. https://doi.org/10.1002/evl3.165
 Vitalis R, Couvet D (2001) Two-locus identity probabilities and identity disequilibrium in a partially selfing subdivided population. Genetics Research, 77, 67–81. https://doi.org/10.1017/S0016672300004833
 Frachon L, Libourel C, Villoutreix R, Carrère S, Glorieux C, Huard-Chauveau C, Navascués M, Gay L, Vitalis R, Baron E, Amsellem L, Bouchez O, Vidal M, Le Corre V, Roby D, Bergelson J, Roux F (2017) Intermediate degrees of synergistic pleiotropy drive adaptive evolution in ecological time. Nature Ecology and Evolution, 1, 1551–1561. https://doi.org/10.1038/s41559-017-0297-1
 Navascués M, Becheler A, Gay L, Ronfort J, Loridon K, Vitalis R (2020) Power and limits of selection genome scans on temporal data from a selfing population. bioRxiv, 2020.05.06.080895, ver. 4 peer-reviewed and recommended by PCI Evol Biol. https://doi.org/10.1101/2020.05.06.080895
 Goldringer I, Bataillon T (2004) On the Distribution of Temporal Variations in Allele Frequency: Consequences for the Estimation of Effective Population Size and the Detection of Loci Undergoing Selection. Genetics, 168, 563–568. https://doi.org/10.1534/genetics.103.025908
 Messer PW (2013) SLiM: Simulating Evolution with Selection and Linkage. Genetics, 194, 1037–1039. https://doi.org/10.1534/genetics.113.152181
 Siol M, Prosperi JM, Bonnin I, Ronfort J (2008) How multilocus genotypic pattern helps to understand the history of selfing populations: a case study in Medicago truncatula. Heredity, 100, 517–525. https://doi.org/10.1038/hdy.2008.5
 Sanchez T, Cury J, Charpiat G, Jay F Deep learning for population size history inference: Design, comparison and combination with approximate Bayesian computation. Molecular Ecology Resources, n/a. https://doi.org/10.1111/1755-0998.13224
Matteo Fumagalli (2020) Detecting loci under natural selection from temporal genomic data of selfing populations. Peer Community in Evolutionary Biology, 100110. 10.24072/pci.evolbiol.100110
Evaluation round #108 Jul 2020
DOI or URL of the preprint: 10.1101/2020.05.06.080895
Version of the preprint: 2
Decision by Matteo Fumagalli
many thanks for your submission and please accept my apologies for the delay in processing your study which has now been reviewed by three experts in the field.
All reviewers and I agree that this manuscript is well written and present a solid piece of work. The study’s scope and implications are of wide interest as detecting adaptation in selfing species is an important but neglected topic in evolutionary genomics. The method presented herein is an extension of previous work on detecting selection from temporal data (sampled allele frequencies) for inbred/selfing species. The findings that selection from standing variation leaves a more evident genetic pattern than from de novo mutation in selfing species is a potential novel aspect which could be further tested empirically.
Before recommending this study, I encourage Authors to address the main points raised by reviewers which I summarise below.
The main point raised by all reviewers is on the assumption of no population structure. While Authors acknowledge and discuss this issue, I believe the study will greatly improved if Authors provide more intuition of how much population structure / discontinuity / metapopulations would affect their results (e.g. estimation of parameters and power to detect selection). I am not advocating for additional large-scale simulations but for more specific discussion on potential limitations for not including more complex but realistic scenarios (e.g. change of Ne, variation in recombination rate, linked selection on deleterious alleles).
The text on the methodology should be clarified, as pointed out by Reviewers. This is an important aspect to avoid readers having to extensively look at cited papers to understand the methodology. I also found it difficult to understand, for instance, when Authors used unlinked or linked SNPs in different analyses.
Another point raised by one Reviewer is how to evaluate the statistical uncertainty of parameters’ estimates. Likewise, there are some concerns on the use of the arbitrary threshold of 0.05 for minimum global MAF. The text should be either clarified or additional results varying this threshold should be presented.
I have an additional comment. I appreciate the discussion on how to design sequencing experiments in light of these results. I’d like to see future directions and ideas for improving the detection of selection for selfing to be elaborated a bit more carefully. For instance, Authors briefly mentioned ABC and as such I wonder whether Authors have more precise thoughts on which aspect of the methodology could be improved to achieve higher power (e.g. use of more features than single allele frequencies? Different inferential framework such as ABC or ML?). This doesn’t have to be too extensive tough.
I have some personal minor comments. Should the estimates of Ne on real data be presented in the results as first instance? As beneficial alleles are randomly assigned to a position, is there any border effect if such sites are too close to one of the extremities of the simulated region (e.g. for Fig. 3)? I appreciate that all scripts are provided but the documentation is rather thin and as such it is possible but unnecessarily laborious to replicate all analyses reported herein. Were the parameters of simulations chosen to match any organism of interest? I believe some reference that these values (mutation and recombination rate, Ne) are what expected in nature. Please provide a citation when introducing the equation Ne=(2-sigma)N/2.
Finally, please also address all minor issues raised and check your text carefully for typos as I was able to spot a few (e.g. “?.” on page 8, line 9).
Please do not hesitate to contact me in you need further clarification on any of these comments.
Additional requirements of the managing board:
As indicated in the 'How does it work?’ section and in the code of conduct, please make sure that:
-Data are available to readers, either in the text or through an open data repository such as Zenodo (free), Dryad or some other institutional repository. Data must be reusable, thus metadata or accompanying text must carefully describe the data.
-Details on quantitative analyses (e.g., data treatment and statistical scripts in R, bioinformatic pipeline scripts, etc.) and details concerning simulations (scripts, codes) are available to readers in the text, as appendices, or through an open data repository, such as Zenodo, Dryad or some other institutional repository. The scripts or codes must be carefully described so that they can be reused.
-Details on experimental procedures are available to readers in the text or as appendices.
-Authors have no financial conflict of interest relating to the article. The article must contain a "Conflict of interest disclosure" paragraph before the reference section containing this sentence: "The authors of this preprint declare that they have no financial conflict of interest with the content of this article." If appropriate, this disclosure may be completed by a sentence indicating that some of the authors are PCI recommenders: “XXX is one of the PCI XXX recommenders.”