Close printable page

Recommendation

Review and Assessment of Performance of Genomic Inference Methods based on the Sequentially Markovian Coalescent

Stephan Schiffels based on reviews by 3 anonymous reviewers

A recommendation of:

Limits and Convergence properties of the Sequentially Markovian Coalescent

Thibaut Sellinger, Diala Abu Awad, Aurélien Tellier (2020), bioRxiv, 2020.07.23.217091, ver. 3 peer-reviewed and recommended by Peer Community in Evolutionary Biology https://doi.org/10.1101/2020.07.23.217091

Read preprint in preprint server Now published in a journal

Data used for results

Codes used in this study

Abstract

EN

AR

ES

FR

HI

JA

PT

RU

ZH-CN

Limits and Convergence properties of the Sequentially Markovian Coalescent

Many methods based on the Sequentially Markovian Coalescent (SMC) have been and are being developed. These methods make use of genome sequence data to uncover population demographic history. More recently, new methods have extended the original theoretical framework, allowing the simultaneous estimation of the demographic history and other biological variables. These methods can be applied to many different species, under different model assumptions, in hopes of unlocking the population/species evolutionary history. Although convergence proofs in particular cases have been given using simulated data, a clear outline of the performance limits of these methods is lacking. We here explore the limits of this methodology, as well as present a tool that can be used to help users quantify what information can be confidently retrieved from given datasets. In addition, we study the consequences for inference accuracy violating the hypotheses and the assumptions of SMC approaches, such as the presence of transposable elements, variable recombination and mutation rates along the sequence and SNP call errors. We also provide a new interpretation of the SMC through the use of the estimated transition matrix and offer recommendations for the most efficient use of these methods under budget constraints, notably through the building of data sets that would be better adapted for the biological question at hand.

Hidden Markov Model, Ancestral Recombination Graph, Population Genetics

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

خصائص النهايات والتقارب للتحالف الماركوفي المتسلسل

تم تطوير العديد من الطرق المعتمدة على التحالف الماركوفي المتسلسل (SMC). تستخدم هذه الأساليب بيانات تسلسل الجينوم للكشف عن التاريخ الديموغرافي للسكان. وفي الآونة الأخيرة، وسعت الأساليب الجديدة الإطار النظري الأصلي، مما يسمح بتقدير متزامن للتاريخ الديموغرافي والمتغيرات البيولوجية الأخرى. يمكن تطبيق هذه الأساليب على العديد من الأنواع المختلفة، في ظل افتراضات نموذجية مختلفة، على أمل فتح التاريخ التطوري للسكان/الأنواع. على الرغم من أن أدلة التقارب في حالات معينة قد تم تقديمها باستخدام بيانات محاكاة، إلا أنه لا يوجد مخطط واضح لحدود أداء هذه الطرق. نحن هنا نستكشف حدود هذه المنهجية، بالإضافة إلى تقديم أداة يمكن استخدامها لمساعدة المستخدمين على تحديد المعلومات التي يمكن استرجاعها بثقة من مجموعات بيانات معينة. بالإضافة إلى ذلك، قمنا بدراسة النتائج المترتبة على دقة الاستدلال التي تنتهك الفرضيات والافتراضات الخاصة بمناهج SMC، مثل وجود عناصر قابلة للنقل ومعدلات إعادة التركيب والطفرات المتغيرة على طول التسلسل وأخطاء استدعاء SNP. نحن نقدم أيضًا تفسيرًا جديدًا لـ SMC من خلال استخدام مصفوفة الانتقال المقدرة ونقدم توصيات للاستخدام الأكثر كفاءة لهذه الأساليب في ظل قيود الميزانية، ولا سيما من خلال بناء مجموعات البيانات التي من شأنها أن تتكيف بشكل أفضل مع المسألة البيولوجية المطروحة .

نموذج ماركوف المخفي، الرسم البياني لإعادة التركيب السلفي، علم الوراثة السكانية

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Límites y propiedades de convergencia del coalescente secuencialmente markoviano

Se han desarrollado y se están desarrollando muchos métodos basados en el coalescente secuencial de Markoviano (SMC). Estos métodos utilizan datos de secuencia del genoma para descubrir la historia demográfica de la población. Más recientemente, nuevos métodos han ampliado el marco teórico original, permitiendo la estimación simultánea de la historia demográfica y otras variables biológicas. Estos métodos se pueden aplicar a muchas especies diferentes, bajo diferentes supuestos del modelo, con la esperanza de desbloquear la historia evolutiva de la población/especie. Aunque en casos particulares se han proporcionado pruebas de convergencia utilizando datos simulados, falta una descripción clara de los límites de rendimiento de estos métodos. Aquí exploramos los límites de esta metodología, además de presentar una herramienta que puede usarse para ayudar a los usuarios a cuantificar qué información se puede recuperar con confianza de conjuntos de datos determinados. Además, estudiamos las consecuencias para la precisión de la inferencia que violan las hipótesis y los supuestos de los enfoques SMC, como la presencia de elementos transponibles, tasas variables de recombinación y mutación a lo largo de la secuencia y errores de llamada de SNP. También proporcionamos una nueva interpretación del SMC mediante el uso de la matriz de transición estimada y ofrecemos recomendaciones para el uso más eficiente de estos métodos bajo restricciones presupuestarias, en particular mediante la creación de conjuntos de datos que se adaptarían mejor a la cuestión biológica en cuestión. .

Modelo oculto de Markov, gráfico de recombinación ancestral, genética de poblaciones

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Limites et propriétés de convergence du coalescent séquentiellement markovien

De nombreuses méthodes basées sur la Coalescence séquentiellement markovienne (SMC) ont été et sont en cours de développement. Ces méthodes utilisent les données de séquence du génome pour découvrir l’histoire démographique de la population. Plus récemment, de nouvelles méthodes ont étendu le cadre théorique original, permettant l'estimation simultanée de l'histoire démographique et d'autres variables biologiques. Ces méthodes peuvent être appliquées à de nombreuses espèces différentes, sous différentes hypothèses de modèle, dans l’espoir de découvrir l’histoire évolutive de la population/espèce. Bien que des preuves de convergence dans des cas particuliers aient été données à l'aide de données simulées, il manque une description claire des limites de performance de ces méthodes. Nous explorons ici les limites de cette méthodologie et présentons un outil qui peut être utilisé pour aider les utilisateurs à quantifier quelles informations peuvent être récupérées en toute confiance à partir d'ensembles de données donnés. De plus, nous étudions les conséquences sur la précision de l'inférence violant les hypothèses et les hypothèses des approches SMC, telles que la présence d'éléments transposables, les taux de recombinaison et de mutation variables le long de la séquence et les erreurs d'appel de SNP. Nous proposons également une nouvelle interprétation du SMC à travers l'utilisation de la matrice de transition estimée et proposons des recommandations pour l'utilisation la plus efficace de ces méthodes sous contraintes budgétaires, notamment à travers la construction d'ensembles de données qui seraient mieux adaptés à la question biologique posée. .

Modèle de Markov caché, graphique de recombinaison ancestrale, génétique des populations

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

अनुक्रमिक रूप से मार्कोवियन कोलेसेंट की सीमाएं और अभिसरण गुण

सीक्वेंशियली मार्कोवियन कोलेसेंट (एसएमसी) पर आधारित कई विधियां विकसित की गई हैं और विकसित की जा रही हैं। ये विधियाँ जनसंख्या जनसांख्यिकीय इतिहास को उजागर करने के लिए जीनोम अनुक्रम डेटा का उपयोग करती हैं। हाल ही में, नई विधियों ने मूल सैद्धांतिक ढांचे का विस्तार किया है, जिससे जनसांख्यिकीय इतिहास और अन्य जैविक चर का एक साथ अनुमान लगाया जा सकता है। जनसंख्या/प्रजाति के विकासवादी इतिहास को उजागर करने की उम्मीद में, इन तरीकों को विभिन्न मॉडल मान्यताओं के तहत कई अलग-अलग प्रजातियों पर लागू किया जा सकता है। यद्यपि विशेष मामलों में अभिसरण प्रमाण सिम्युलेटेड डेटा का उपयोग करके दिए गए हैं, इन विधियों की प्रदर्शन सीमाओं की स्पष्ट रूपरेखा का अभाव है। हम यहां इस पद्धति की सीमाओं का पता लगाते हैं, साथ ही एक उपकरण भी प्रस्तुत करते हैं जिसका उपयोग उपयोगकर्ताओं को यह निर्धारित करने में मदद करने के लिए किया जा सकता है कि दिए गए डेटासेट से कौन सी जानकारी आत्मविश्वास से प्राप्त की जा सकती है। इसके अलावा, हम एसएमसी दृष्टिकोण की परिकल्पनाओं और मान्यताओं का उल्लंघन करने वाले अनुमान सटीकता के परिणामों का अध्ययन करते हैं, जैसे अनुक्रम और एसएनपी कॉल त्रुटियों के साथ ट्रांसपोज़ेबल तत्वों की उपस्थिति, परिवर्तनीय पुनर्संयोजन और उत्परिवर्तन दर। हम अनुमानित संक्रमण मैट्रिक्स के उपयोग के माध्यम से एसएमसी की एक नई व्याख्या भी प्रदान करते हैं और बजट बाधाओं के तहत इन तरीकों के सबसे कुशल उपयोग के लिए सिफारिशें प्रदान करते हैं, विशेष रूप से डेटा सेट के निर्माण के माध्यम से जो कि मौजूदा जैविक प्रश्न के लिए बेहतर रूप से अनुकूलित होंगे। .

हिडन मार्कोव मॉडल, पैतृक पुनर्संयोजन ग्राफ, जनसंख्या आनुवंशिकी

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

逐次マルコフ合体の極限と収束特性

逐次マルコフ合体 (SMC) に基づく多くの手法がこれまでに開発されており、現在も開発されています。これらの方法では、ゲノム配列データを利用して人口統計の歴史を明らかにします。最近では、新しい方法が元の理論的枠組みを拡張し、人口統計の歴史と他の生物学的変数を同時に推定できるようになりました。これらの方法は、個体群や種の進化の歴史を解き明かすことを期待して、さまざまなモデル仮定の下で多くの異なる種に適用できます。特定の場合の収束証明はシミュレートされたデータを使用して示されていますが、これらの方法のパフォーマンス限界の明確な概要が不足しています。ここでは、この方法論の限界を探るとともに、ユーザーが特定のデータセットからどのような情報を確実に取得できるかを定量化するのに役立つツールを紹介します。さらに、転位因子の存在、配列に沿った可変の組換えおよび突然変異率、SNP コールエラーなど、SMC アプローチの仮説と仮定に違反する推論精度への影響を研究します。また、推定された遷移行列の使用を通じて SMC の新しい解釈を提供し、特に当面の生物学的問題によりよく適合するデータセットの構築を通じて、予算の制約の下でこれらの方法を最も効率的に使用するための推奨事項を提供します。 .

隠れマルコフモデル、祖先組換えグラフ、集団遺伝学

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Propriedades de Limites e Convergência do Coalescente Sequencialmente Markoviano

Muitos métodos baseados no Coalescente Sequencialmente Markoviano (SMC) foram e estão sendo desenvolvidos. Esses métodos utilizam dados de sequência do genoma para descobrir a história demográfica da população. Mais recentemente, novos métodos ampliaram o quadro teórico original, permitindo a estimativa simultânea da história demográfica e de outras variáveis biológicas. Estes métodos podem ser aplicados a muitas espécies diferentes, sob diferentes pressupostos de modelos, na esperança de desvendar a história evolutiva da população/espécie. Embora provas de convergência em casos particulares tenham sido dadas utilizando dados simulados, falta uma descrição clara dos limites de desempenho destes métodos. Exploramos aqui os limites desta metodologia, bem como apresentamos uma ferramenta que pode ser usada para ajudar os usuários a quantificar quais informações podem ser recuperadas com segurança de determinados conjuntos de dados. Além disso, estudamos as consequências para a precisão da inferência que violam as hipóteses e os pressupostos das abordagens SMC, como a presença de elementos transponíveis, recombinação variável e taxas de mutação ao longo da sequência e erros de chamada de SNP. Também fornecemos uma nova interpretação do SMC através da utilização da matriz de transição estimada e oferecemos recomendações para a utilização mais eficiente destes métodos sob restrições orçamentais, nomeadamente através da construção de conjuntos de dados que seriam mais bem adaptados à questão biológica em questão. .

Modelo oculto de Markov, gráfico de recombinação ancestral, genética populacional

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Пределы и свойства сходимости последовательно-марковского слияния

Многие методы, основанные на последовательно-марковском слиянии (SMC), были и разрабатываются. Эти методы используют данные последовательности генома для раскрытия демографической истории населения. Совсем недавно новые методы расширили первоначальную теоретическую основу, позволив одновременно оценивать демографическую историю и другие биологические переменные. Эти методы могут быть применены ко многим различным видам при различных модельных предположениях в надежде раскрыть историю эволюции популяций/видов. Хотя доказательства сходимости в конкретных случаях были даны с использованием смоделированных данных, четкое описание пределов эффективности этих методов отсутствует. Здесь мы исследуем ограничения этой методологии, а также представляем инструмент, который можно использовать, чтобы помочь пользователям количественно определить, какую информацию можно с уверенностью получить из заданных наборов данных. Кроме того, мы изучаем последствия для точности вывода, нарушающие гипотезы и предположения подходов SMC, такие как наличие мобильных элементов, переменная скорость рекомбинации и мутаций вдоль последовательности и ошибки вызова SNP. Мы также даем новую интерпретацию SMC посредством использования расчетной матрицы перехода и предлагаем рекомендации по наиболее эффективному использованию этих методов в условиях бюджетных ограничений, в частности, путем создания наборов данных, которые будут лучше адаптированы для рассматриваемого биологического вопроса. .

Скрытая марковская модель, граф предковой рекомбинации, популяционная генетика

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

顺序马尔可夫聚结的极限和收敛性质

许多基于顺序马尔可夫聚结（SMC）的方法已经并且正在开发中。这些方法利用基因组序列数据来揭示人口统计历史。最近，新方法扩展了原始的理论框架，允许同时估计人口历史和其他生物变量。这些方法可以在不同的模型假设下应用于许多不同的物种，以期解开种群/物种的进化历史。尽管已经使用模拟数据给出了特定情况下的收敛证明，但缺乏这些方法的性能限制的清晰轮廓。我们在这里探讨了这种方法的局限性，并提出了一种工具，可用于帮助用户量化可以从给定数据集中自信地检索哪些信息。此外，我们还研究了违反 SMC 方法的假设和假设的推论准确性的后果，例如转座元件的存在、序列上的可变重组和突变率以及 SNP 调用错误。我们还通过使用估计的转换矩阵对 SMC 提供了新的解释，并为在预算限制下最有效地使用这些方法提供了建议，特别是通过建立更适合当前生物学问题的数据集.

隐马尔可夫模型，祖先重组图，群体遗传学

Submission: posted 25 July 2020
Recommendation: posted 03 November 2020, validated 12 November 2020

Cite this recommendation as:
Schiffels, S. (2020) Review and Assessment of Performance of Genomic Inference Methods based on the Sequentially Markovian Coalescent. Peer Community in Evolutionary Biology, 100115. https://doi.org/10.24072/pci.evolbiol.100115

Recommendation

The human genome not only encodes for biological functions and for what makes us human, it also encodes the population history of our ancestors. Changes in past population sizes, for example, affect the distribution of times to the most recent common ancestor (tMRCA) of genomic segments, which in turn can be inferred by sophisticated modelling along the genome.
A key framework for such modelling of local tMRCA tracts along genomes is the Sequentially Markovian Coalescent (SMC) (McVean and Cardin 2005, Marjoram and Wall 2006) . The problem that the SMC solves is that the mosaic of local tMRCAs along the genome is unknown, both in their actual ages and in their positions along the genome. The SMC allows to effectively sum across all possibilities and handle the uncertainty probabilistically. Several important tools for inferring the demographic history of a population have been developed built on top of the SMC, including PSMC (Li and Durbin 2011), diCal (Sheehan et al 2013), MSMC (Schiffels and Durbin 2014), SMC++ (Terhorst et al 2017), eSMC (Sellinger et al. 2020) and others.
In this paper, Sellinger, Abu Awad and Tellier (2020) review these SMC-based methods and provide a coherent simulation design to comparatively assess their strengths and weaknesses in a variety of demographic scenarios (Sellinger, Abu Awad and Tellier 2020). In addition, they used these simulations to test how breaking various key assumptions in SMC methods affects estimates, such as constant recombination rates, or absence of false positive SNP calls.
As a result of this assessment, the authors not only provide practical guidance for researchers who want to use these methods, but also insights into how these methods work. For example, the paper carefully separates sources of error in these methods by observing what they call “Best-case convergence” of each method if the data behaves perfectly and separating that from how the method applies with actual data. This approach provides a deeper insight into the methods than what we could learn from application to genomic data alone.
In the age of genomics, computational tools and their development are key for researchers in this field. All the more important is it to provide the community with overviews, reviews and independent assessments of such tools. This is particularly important as sometimes the development of new methods lacks primary visibility due to relevant testing material being pushed to Supplementary Sections in papers due to space constraints. As SMC-based methods have become so widely used tools in genomics, I think the detailed assessment by Sellinger et al. (2020) is timely and relevant.
In conclusion, I recommend this paper because it bridges from a mere review of the different methods to an in-depth assessment of performance, thereby addressing both beginners in the field who just seek an initial overview, as well as experienced researchers who are interested in theoretical boundaries and assumptions of the different methods.

References

[1] Li, H., and Durbin, R. (2011). Inference of human population history from individual whole-genome sequences. Nature, 475(7357), 493-496. doi: https://doi.org/10.1038/nature10231
[2] Marjoram, P., and Wall, J. D. (2006). Fast"" coalescent"" simulation. BMC genetics, 7(1), 16. doi: https://doi.org/10.1186/1471-2156-7-16
[3] McVean, G. A., and Cardin, N. J. (2005). Approximating the coalescent with recombination. Philosophical Transactions of the Royal Society B: Biological Sciences, 360(1459), 1387-1393. doi: https://doi.org/10.1098/rstb.2005.1673
[4] Schiffels, S., and Durbin, R. (2014). Inferring human population size and separation history from multiple genome sequences. Nature genetics, 46(8), 919-925. doi: https://doi.org/10.1038/ng.3015
[5] Sellinger, T. P. P., Awad, D. A., Moest, M., and Tellier, A. (2020). Inference of past demography, dormancy and self-fertilization rates from whole genome sequence data. PLoS Genetics, 16(4), e1008698. doi: https://doi.org/10.1371/journal.pgen.1008698
[6] Sellinger, T. P. P., Awad, D. A. and Tellier, A. (2020) Limits and Convergence properties of the Sequentially Markovian Coalescent. bioRxiv, 2020.07.23.217091, ver. 3 peer-reviewed and recommended by PCI Evolutionary Biology. doi: https://doi.org/10.1101/2020.07.23.217091
[7] Sheehan, S., Harris, K., and Song, Y. S. (2013). Estimating variable effective population sizes from multiple genomes: a sequentially Markov conditional sampling distribution approach. Genetics, 194(3), 647-662. doi: https://doi.org/10.1534/genetics.112.149096
[8] Terhorst, J., Kamm, J. A., and Song, Y. S. (2017). Robust and scalable inference of population history from hundreds of unphased whole genomes. Nature genetics, 49(2), 303-309. doi: https://doi.org/10.1038/ng.3748

PDF recommendation

Conflict of interest:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article. The authors declared that they comply with the PCI rule of having no financial conflicts of interest in relation to the content of the article.

Reviews

Reviewed by anonymous reviewer 3, 02 Nov 2020

I am satisfied with the authors' revisions.

https://doi.org/10.24072/pci.evolbiol.100115.rev21

Evaluation round #1

DOI or URL of the preprint: https://doi.org/10.1101/2020.07.23.217091

Author's Reply, 22 Sep 2020

Download author's reply

Dear Recommender,

Please find attached our revised manuscript entitled “Limits and Convergence properties of the Sequentially Markovian Coalescent” by Thibaut Sellinger, Diala Abu Awad and Aurélien Tellier, which we would like to be considered for recommendation in PCI Evolutionary Biology.

First, we would like to thank you for giving us the opportunity to resubmit this manuscript and your positive comments. We would also like to thank all reviewers for appreciating the importance of our work and for their useful comments.

We paid close attention to answering all the reviewers’ comments and have modified the manuscript accordingly. We believe that we have improved its readability, as we rewrote some sections of the manuscript that the reviewers felt were unclear. We have also included new Supplementary Figures (seven in total) corresponding to the requested analyses and six Supplementary Tables, containing the mean square error of past demographic inferences of all the figures in the manuscript. These measures have helped us, and hopefully will help the readers, to better understand our results. However we found that the MSE alone cannot precisely measure the performances of the methods (see the reply to the reviewers' comments for more detail).

In addition to what reviewers requested we made some additional corrections. First, we realised that the time window of the theoretical convergence analyses (now called best-case convergence) was ill-defined by a factor 2. All analyses were therefore run again to fix this. Secondly, there were minor errors in the msprime command lines when simulating data for SMC++, which required that we re-simulate all data using msprime and have re-run all analysis of SMC++. Slightly different results are observed for Figure 4 and for Supplementary Figure 14 compared to the first version of the manuscript, but all other SMC++ results are identical. We noticed that the section concerning transposable elements was confusing, and thus rewrote the section while adding two supplementary Figures (35 and 36). We hope our motivations and our results now appear in a clearer way. Lastly, we fixed a plotting issue in Supplementary Figures 15 and 22.

We hope that this revised version fulfils the criteria for recommendation in PCI,

Many thanks in advance,

Yours sincerely,

On behalf of the authors, Thibaut Sellinger.

https://doi.org/10.24072/pci.evolbiol.100251.ar1

Decision by Stephan Schiffels, posted 25 Aug 2020

This preprint by Sellinger et al. describes several analyses around the Sequentially Markovian Coalescent, a methodological framework used heavily in the field of demographic inference from genomic data.

The preprint has now been read by three anonymous reviewers. I have also read the paper carefully, and I agree with the reviewers' generally positive assessment. As reviewer #3 noted, while some of these results are probably already scattered around in the literature (also in Supplements), a systematically conducted and concisely summarised analysis of these various important caveats for SMC methods is still missing. So I definitely think this will be a useful and relevant contribution.

As you can see, all three reviewers have some comments for improving clarity, and possibly expanding the study a bit. I personally find two suggestions for adding analysis to be particularly worth considering: First, reviewer #1 proposed to add a constant population size scenario as a “basic” model to supplement the more complex demographic scenarios you currently have. Second, reviewer #3 suggests to add error quantification in small tables in all analyses using the mean square error.

I’m in principle happy to recommend this paper after a revision addressing the raised points by the reviewers. Please give good reasons if you believe some suggestions should not be followed.

Thanks again for submitting this interesting paper and I look forward to receiving the revised version.

Additional requirements of the managing board:
As indicated in the 'How does it work?’ section and in the code of conduct, please make sure (if appropriate) that:
-Data are available to readers, either in the text or through an open data repository such as Zenodo (free), Dryad or some other institutional repository. Data must be reusable, thus metadata or accompanying text must carefully describe the data.
-Details on quantitative analyses (e.g., data treatment and statistical scripts in R, bioinformatic pipeline scripts, etc.) and details concerning simulations (scripts, codes) are available to readers in the text, as appendices, or through an open data repository, such as Zenodo, Dryad or some other institutional repository. The scripts or codes must be carefully described so that they can be reused.
-Details on experimental procedures are available to readers in the text or as appendices.
-Authors have no financial conflict of interest relating to the article. The article must contain a "Conflict of interest disclosure" paragraph before the reference section containing this sentence: "The authors of this preprint declare that they have no financial conflict of interest with the content of this article." If appropriate, this disclosure may be completed by a sentence indicating that some of the authors are PCI recommenders: “XXX is one of the PCI XXX recommenders.”

https://doi.org/10.24072/pci.evolbiol.100251.d1

Reviewed by anonymous reviewer 1, 18 Aug 2020

Download the review https://doi.org/10.24072/pci.evolbiol.100251.rev11

Reviewed by anonymous reviewer 2, 17 Aug 2020

Download the review https://doi.org/10.24072/pci.evolbiol.100251.rev12

Reviewed by anonymous reviewer 3, 25 Aug 2020

The authors conducted a simulation study of the strengths and weaknesses of some demographic inference packages based on the sequentially Markov coalescent, under various data and parameter regimes. SMC methods are now widely used, in an increasingly diverse array of settings, and it is important to understand what causes them to succeed and fail. Although several of the conclusions reached here are scattered about in the literature, this is a more systematic attempt to organize them into a coherent set of recommendations for practitioners. So, it seems like a useful contribution.

I don't have any major concerns or objections, but I think the paper could be improved a bit, and perhaps expanded in a few related directions.

Major comments

Error quantification: The performance of a statistical estimator is generally measured in terms of mean-squared error. The results shown in Figures 1-7 are qualitatively useful for building intuition about how each of the scenarios affects inference, but it is impossible to quantify the difference in performance between (or even within) different figures. Consequently, the discussion is entirely qualitative. Each figure should have an accompanying table with the MSE for the corresponding methods and scenarios, and those could be used to argue more rigorously about the strengths and weaknesses of various methods.
Regularization: in several of the scenarios analyzed, the results seem like they could be improved by adding a penalty term. SMC++ supports regularization natively, and it could be easily added to the authors' eSMC package, but regularization is not really explored in the paper except briefly in Table 2. A thorough study of how regularization affects demographic inference, both in terms of what form of regularization to use as well as how to tune the hyperparameters, is currently missing from the literature to the best of my knowledge (but see the recent preprint from Kelley Harris' lab on their method mushi). I realize that one could easily write a whole other paper on this, and am not advocating for major additions along these lines. Still, another subsection or two on this topic would be very useful in applications.
There are various other confounders that could be taken into consideration. I think ascertainment bias in particular would be interesting to look at. The ASMC paper delved into this a bit, but there is more that could be done. How badly does ascertainment bias (potentially in a related population) and SNP sparsity affect PSMC? This could have important practical consequences since a lot of fields still rely on microarrays. It could be incorporated into the present paper by running msprime on a large sample size and only keeping SNPs above a certain MAF threshold.
I don't quite get the focus on estimating rho/theta (though I understand its effects on inference). I tend to think of this as a nuisance parameter when running SMC methods. It is not reasonable to assume that rho is constant over the whole chromosome anyways, nor should we expect this to be a good estimate of the chromosome-wide average rho when the true underlying rates are heterogeneous.

Minor comments

This sawtooth demography is slightly different from one in the original MSMC paper. The final nadir at 10^1 generations occurs too recently, and the population size should be constant with Ne=14312 from 33 generations ago to present. This isn't such a big deal since the model was pulled from thin air in the first place, but since the community has coalesced around this model as a benchmark, it’s better if the paper used the same version of it as everyone else. The stdpopsim package can simulate directly from this demography in a few lines of code.
Sections 2.1.4 and 3.1 / Fig. 1: The titles give a somewhat misleading impression. A theoretical convergence result would be very nice, but that's not what is offered. I would prefer to call this something like "Best-case convergence".
381: "SMC++ seems especially sensitive" -- I don't see this reflected in Figure 4. If anything, it looks less sensitive than the other three methods. This makes sense to me since for high values of rho, the frequency spectrum is a better estimator of demography than methods which use only linkage information.
565f: "Could be used in more complex scenarios". Recent theoretical work (see the article 'How Many Subpopulations Is Too Many? Exponential Lower Bounds for Inferring Population Histories' by Kim et al, JCB 2020) strongly suggests that this is not possible. The IICR is not useful for recovering complex demographic histories.
Various spelling or grammar errors:
- 35: ecologist
- 44: estimations/interpretations
- 47: state-of-the-art
- 50: well-known, Pairewise
- 101: simulates
- 149: discretized
- 287: I think it should be "whose dynamics"

https://doi.org/10.24072/pci.evolbiol.100251.rev13