Review and Assessment of Performance of Genomic Inference Methods based on the Sequentially Markovian Coalescent
Limits and Convergence properties of the Sequentially Markovian Coalescent
The human genome not only encodes for biological functions and for what makes us human, it also encodes the population history of our ancestors. Changes in past population sizes, for example, affect the distribution of times to the most recent common ancestor (tMRCA) of genomic segments, which in turn can be inferred by sophisticated modelling along the genome.
A key framework for such modelling of local tMRCA tracts along genomes is the Sequentially Markovian Coalescent (SMC) (McVean and Cardin 2005, Marjoram and Wall 2006) . The problem that the SMC solves is that the mosaic of local tMRCAs along the genome is unknown, both in their actual ages and in their positions along the genome. The SMC allows to effectively sum across all possibilities and handle the uncertainty probabilistically. Several important tools for inferring the demographic history of a population have been developed built on top of the SMC, including PSMC (Li and Durbin 2011), diCal (Sheehan et al 2013), MSMC (Schiffels and Durbin 2014), SMC++ (Terhorst et al 2017), eSMC (Sellinger et al. 2020) and others.
In this paper, Sellinger, Abu Awad and Tellier (2020) review these SMC-based methods and provide a coherent simulation design to comparatively assess their strengths and weaknesses in a variety of demographic scenarios (Sellinger, Abu Awad and Tellier 2020). In addition, they used these simulations to test how breaking various key assumptions in SMC methods affects estimates, such as constant recombination rates, or absence of false positive SNP calls.
As a result of this assessment, the authors not only provide practical guidance for researchers who want to use these methods, but also insights into how these methods work. For example, the paper carefully separates sources of error in these methods by observing what they call “Best-case convergence” of each method if the data behaves perfectly and separating that from how the method applies with actual data. This approach provides a deeper insight into the methods than what we could learn from application to genomic data alone.
In the age of genomics, computational tools and their development are key for researchers in this field. All the more important is it to provide the community with overviews, reviews and independent assessments of such tools. This is particularly important as sometimes the development of new methods lacks primary visibility due to relevant testing material being pushed to Supplementary Sections in papers due to space constraints. As SMC-based methods have become so widely used tools in genomics, I think the detailed assessment by Sellinger et al. (2020) is timely and relevant.
In conclusion, I recommend this paper because it bridges from a mere review of the different methods to an in-depth assessment of performance, thereby addressing both beginners in the field who just seek an initial overview, as well as experienced researchers who are interested in theoretical boundaries and assumptions of the different methods.
 Li, H., and Durbin, R. (2011). Inference of human population history from individual whole-genome sequences. Nature, 475(7357), 493-496. doi: https://doi.org/10.1038/nature10231
 Marjoram, P., and Wall, J. D. (2006). Fast"" coalescent"" simulation. BMC genetics, 7(1), 16. doi: https://doi.org/10.1186/1471-2156-7-16
 McVean, G. A., and Cardin, N. J. (2005). Approximating the coalescent with recombination. Philosophical Transactions of the Royal Society B: Biological Sciences, 360(1459), 1387-1393. doi: https://doi.org/10.1098/rstb.2005.1673
 Schiffels, S., and Durbin, R. (2014). Inferring human population size and separation history from multiple genome sequences. Nature genetics, 46(8), 919-925. doi: https://doi.org/10.1038/ng.3015
 Sellinger, T. P. P., Awad, D. A., Moest, M., and Tellier, A. (2020). Inference of past demography, dormancy and self-fertilization rates from whole genome sequence data. PLoS Genetics, 16(4), e1008698. doi: https://doi.org/10.1371/journal.pgen.1008698
 Sellinger, T. P. P., Awad, D. A. and Tellier, A. (2020) Limits and Convergence properties of the Sequentially Markovian Coalescent. bioRxiv, 2020.07.23.217091, ver. 3 peer-reviewed and recommended by PCI Evolutionary Biology. doi: https://doi.org/10.1101/2020.07.23.217091
 Sheehan, S., Harris, K., and Song, Y. S. (2013). Estimating variable effective population sizes from multiple genomes: a sequentially Markov conditional sampling distribution approach. Genetics, 194(3), 647-662. doi: https://doi.org/10.1534/genetics.112.149096
 Terhorst, J., Kamm, J. A., and Song, Y. S. (2017). Robust and scalable inference of population history from hundreds of unphased whole genomes. Nature genetics, 49(2), 303-309. doi: https://doi.org/10.1038/ng.3748
Stephan Schiffels (2020) Review and Assessment of Performance of Genomic Inference Methods based on the Sequentially Markovian Coalescent. Peer Community in Evolutionary Biology, 100115. 10.24072/pci.evolbiol.100115
Reviewed by anonymous reviewer, 2020-11-02 01:14
Revision round #12020-08-25
Decision round #1
This preprint by Sellinger et al. describes several analyses around the Sequentially Markovian Coalescent, a methodological framework used heavily in the field of demographic inference from genomic data.
The preprint has now been read by three anonymous reviewers. I have also read the paper carefully, and I agree with the reviewers' generally positive assessment. As reviewer #3 noted, while some of these results are probably already scattered around in the literature (also in Supplements), a systematically conducted and concisely summarised analysis of these various important caveats for SMC methods is still missing. So I definitely think this will be a useful and relevant contribution.
As you can see, all three reviewers have some comments for improving clarity, and possibly expanding the study a bit. I personally find two suggestions for adding analysis to be particularly worth considering: First, reviewer #1 proposed to add a constant population size scenario as a “basic” model to supplement the more complex demographic scenarios you currently have. Second, reviewer #3 suggests to add error quantification in small tables in all analyses using the mean square error.
I’m in principle happy to recommend this paper after a revision addressing the raised points by the reviewers. Please give good reasons if you believe some suggestions should not be followed.
Thanks again for submitting this interesting paper and I look forward to receiving the revised version.
Additional requirements of the managing board:
As indicated in the 'How does it work?’ section and in the code of conduct, please make sure (if appropriate) that:
-Data are available to readers, either in the text or through an open data repository such as Zenodo (free), Dryad or some other institutional repository. Data must be reusable, thus metadata or accompanying text must carefully describe the data.
-Details on quantitative analyses (e.g., data treatment and statistical scripts in R, bioinformatic pipeline scripts, etc.) and details concerning simulations (scripts, codes) are available to readers in the text, as appendices, or through an open data repository, such as Zenodo, Dryad or some other institutional repository. The scripts or codes must be carefully described so that they can be reused.
-Details on experimental procedures are available to readers in the text or as appendices.
-Authors have no financial conflict of interest relating to the article. The article must contain a "Conflict of interest disclosure" paragraph before the reference section containing this sentence: "The authors of this preprint declare that they have no financial conflict of interest with the content of this article." If appropriate, this disclosure may be completed by a sentence indicating that some of the authors are PCI recommenders: “XXX is one of the PCI XXX recommenders.”