Recommendation

Beyond the standard coalescent: demographic inference with complete genomes and graph neural networks under the beta coalescent

Julien Yann Dutheil based on reviews by 2 anonymous reviewers

A recommendation of:

Simultaneous Inference of Past Demography and Selection from the Ancestral Recombination Graph under the Beta Coalescent

Kevin Korfmann, Thibaut Sellinger, Fabian Freund, Matteo Fumagalli, Aurélien Tellier (2024), bioRxiv, ver.5, peer-reviewed and recommended by PCI Evolutionary Biology https://doi.org/10.1101/2022.09.28.508873

Read preprint in preprint server Now published in Peer Community Journal

Codes used in this study

Scripts used to obtain or analyze results

Abstract

EN

AR

ES

FR

HI

JA

PT

RU

ZH-CN

Simultaneous Inference of Past Demography and Selection from the Ancestral Recombination Graph under the Beta Coalescent

The reproductive mechanism of a species is a key driver of genome evolution. The standard Wright-Fisher model for the reproduction of individuals in a population assumes that each individual produces a number of offspring negligible compared to the total population size. Yet many species of plants, invertebrates, prokaryotes or fish exhibit neutrally skewed offspring distribution or strong selection events yielding few individuals to produce a number of offspring of up to the same magnitude as the population size. As a result, the genealogy of a sample is characterized by multiple individuals (more than two) coalescing simultaneously to the same common ancestor. The current methods developed to detect such multiple merger events do not account for complex demographic scenarios or recombination, and require large sample sizes. We tackle these limitations by developing two novel and different approaches to infer multiple merger events from sequence data or the ancestral recombination graph (ARG): a sequentially Markovian coalescent (SMβC) and a graph neural network (GNNcoal). We first give proof of the accuracy of our methods to estimate the multiple merger parameter and past demographic history using simulated data under the β-coalescent model. Secondly, we show that our approaches can also recover the effect of positive selective sweeps along the genome. Finally, we are able to distinguish skewed offspring distribution from selection while simultaneously inferring the past variation of population size. Our findings stress the aptitude of neural networks to leverage information from the ARG for inference but also the urgent need for more accurate ARG inference approaches.

kingman coalescent, beta coalescent, selective sweep, deep learning, graph neural networks, population genetics, multiple merger coalescent, sequentially markovian coalescent, ancestral recombination graph

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

الاستدلال المتزامن للديموغرافيا الماضية والاختيار من الرسم البياني لإعادة تركيب الأجداد تحت ائتلاف بيتا

إن آلية التكاثر لدى أي نوع هي المحرك الرئيسي لتطور الجينوم. يفترض نموذج رايت فيشر القياسي لتكاثر الأفراد في مجتمع ما أن كل فرد ينتج عددًا من النسل لا يُذكر مقارنة بإجمالي حجم السكان. ومع ذلك، فإن العديد من أنواع النباتات أو اللافقاريات أو بدائيات النوى أو الأسماك تظهر توزيعًا منحرفًا محايدًا للذرية أو أحداث اختيار قوية تؤدي إلى إنتاج عدد قليل من الأفراد لعدد من النسل يصل إلى نفس حجم حجم السكان. ونتيجة لذلك، فإن نسب العينة يتميز بوجود عدة أفراد (أكثر من اثنين) يلتحمون في وقت واحد لنفس السلف المشترك. الأساليب الحالية التي تم تطويرها للكشف عن أحداث الاندماج المتعددة هذه لا تأخذ في الاعتبار السيناريوهات الديموغرافية المعقدة أو إعادة التركيب، وتتطلب أحجام عينات كبيرة. نحن نتعامل مع هذه القيود من خلال تطوير طريقتين جديدتين ومختلفتين لاستنتاج أحداث اندماج متعددة من بيانات التسلسل أو الرسم البياني لإعادة تركيب الأجداد (ARG): الائتلاف الماركوفي المتسلسل (SMβC) والشبكة العصبية الرسومية (GNNcoal). نقدم أولاً دليلاً على دقة أساليبنا لتقدير معلمة الاندماج المتعددة والتاريخ الديموغرافي السابق باستخدام البيانات المحاكاة ضمن نموذج التكافؤ β. ثانيًا، نظهر أن أساليبنا يمكنها أيضًا استعادة تأثير عمليات المسح الانتقائية الإيجابية على طول الجينوم. وأخيرًا، نحن قادرون على التمييز بين توزيع النسل المنحرف وبين الاختيار مع استنتاج التباين السابق في حجم السكان في نفس الوقت. تؤكد النتائج التي توصلنا إليها على قدرة الشبكات العصبية على الاستفادة من المعلومات من ARG للاستدلال ولكن أيضًا على الحاجة الملحة إلى أساليب استدلال ARG أكثر دقة.

كينغمان اندماج، اندماج بيتا، اكتساح انتقائي، التعلم العميق، الرسم البياني للشبكات العصبية، علم الوراثة السكانية، اندماج الاندماج المتعدد، اندماج ماركوفيان بالتتابع، الرسم البياني لإعادة تركيب الأجداد

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Inferencia simultánea de demografía pasada y selección del gráfico de recombinación ancestral bajo el coalescente beta

El mecanismo reproductivo de una especie es un factor clave en la evolución del genoma. El modelo estándar de Wright-Fisher para la reproducción de individuos en una población supone que cada individuo produce un número insignificante de descendientes en comparación con el tamaño total de la población. Sin embargo, muchas especies de plantas, invertebrados, procariotas o peces exhiben una distribución de descendencia neutralmente sesgada o fuertes eventos de selección que producen pocos individuos para producir un número de descendencia de hasta la misma magnitud que el tamaño de la población. Como resultado, la genealogía de una muestra se caracteriza por múltiples individuos (más de dos) que se fusionan simultáneamente en el mismo ancestro común. Los métodos actuales desarrollados para detectar tales eventos de fusiones múltiples no tienen en cuenta escenarios demográficos complejos ni recombinaciones, y requieren tamaños de muestra grandes. Abordamos estas limitaciones desarrollando dos enfoques novedosos y diferentes para inferir múltiples eventos de fusión a partir de datos de secuencia o el gráfico de recombinación ancestral (ARG): un coalescente secuencialmente markoviano (SMβC) y una red neuronal de gráficos (GNNcoal). Primero damos prueba de la precisión de nuestros métodos para estimar el parámetro de fusión múltiple y la historia demográfica pasada utilizando datos simulados bajo el modelo β-coalescente. En segundo lugar, mostramos que nuestros enfoques también pueden recuperar el efecto de barridos selectivos positivos a lo largo del genoma. Finalmente, podemos distinguir la distribución sesgada de la descendencia de la selección y al mismo tiempo inferir la variación pasada del tamaño de la población. Nuestros hallazgos enfatizan la capacidad de las redes neuronales para aprovechar la información del ARG para la inferencia, pero también la necesidad urgente de enfoques de inferencia ARG más precisos.

coalescente kingman, coalescente beta, barrido selectivo, aprendizaje profundo, gráfico de redes neuronales, genética de poblaciones, coalescente de fusión múltiple, coalescente secuencialmente markoviano, gráfico de recombinación ancestral

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Inférence simultanée de la démographie passée et sélection à partir du graphique de recombinaison ancestrale sous le bêta-coalescent

Le mécanisme de reproduction d'une espèce est un moteur clé de l'évolution du génome. Le modèle standard de Wright-Fisher pour la reproduction des individus dans une population suppose que chaque individu produit un nombre de descendants négligeable par rapport à la taille totale de la population. Pourtant, de nombreuses espèces de plantes, d'invertébrés, de procaryotes ou de poissons présentent une distribution de progéniture asymétrique de manière neutre ou de fortes sélections produisant peu d'individus pour produire un nombre de progéniture allant jusqu'à la même ampleur que la taille de la population. En conséquence, la généalogie d’un échantillon est caractérisée par plusieurs individus (plus de deux) fusionnant simultanément avec le même ancêtre commun. Les méthodes actuelles développées pour détecter de tels événements de fusion multiples ne tiennent pas compte des scénarios démographiques complexes ou de la recombinaison et nécessitent de grandes tailles d'échantillon. Nous abordons ces limitations en développant deux approches nouvelles et différentes pour déduire plusieurs événements de fusion à partir de données de séquence ou du graphe de recombinaison ancestral (ARG) : un coalescent séquentiellement markovien (SMβC) et un réseau neuronal graphique (GNNcoal). Nous donnons d'abord la preuve de l'exactitude de nos méthodes pour estimer le paramètre de fusion multiple et l'historique démographique passé en utilisant des données simulées sous le modèle β-coalescent. Deuxièmement, nous montrons que nos approches peuvent également récupérer l’effet de balayages sélectifs positifs le long du génome. Enfin, nous sommes en mesure de distinguer la répartition asymétrique de la progéniture de la sélection tout en déduisant simultanément la variation passée de la taille de la population. Nos résultats soulignent l'aptitude des réseaux de neurones à exploiter les informations de l'ARG à des fins d'inférence, mais également le besoin urgent d'approches d'inférence ARG plus précises.

coalescent kingman, bêta-coalescent, balayage sélectif, apprentissage en profondeur, réseaux de neurones graphiques, génétique des populations, fusion multiple coalescent, coalescent séquentiellement markovien, graphe de recombinaison ancestrale

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

बीटा कोलेसेंट के तहत पैतृक पुनर्संयोजन ग्राफ से पिछली जनसांख्यिकी और चयन का एक साथ अनुमान

किसी प्रजाति का प्रजनन तंत्र जीनोम विकास का एक प्रमुख चालक है। जनसंख्या में व्यक्तियों के प्रजनन के लिए मानक राइट-फिशर मॉडल मानता है कि प्रत्येक व्यक्ति कुल जनसंख्या आकार की तुलना में नगण्य संतान पैदा करता है। फिर भी पौधों, अकशेरुकी जीवों, प्रोकैरियोट्स या मछलियों की कई प्रजातियाँ तटस्थ रूप से विषम संतान वितरण या मजबूत चयन घटनाओं का प्रदर्शन करती हैं, जिससे कुछ ही व्यक्ति जनसंख्या आकार के समान परिमाण तक की संख्या में संतान पैदा कर पाते हैं। परिणामस्वरूप, एक नमूने की वंशावली में एक ही सामान्य पूर्वज के साथ एक साथ जुड़े कई व्यक्तियों (दो से अधिक) की विशेषता होती है। ऐसी कई विलय घटनाओं का पता लगाने के लिए विकसित की गई मौजूदा विधियां जटिल जनसांख्यिकीय परिदृश्यों या पुनर्संयोजन के लिए जिम्मेदार नहीं हैं, और बड़े नमूना आकार की आवश्यकता होती है। हम अनुक्रम डेटा या पैतृक पुनर्संयोजन ग्राफ (ARG) से कई विलय की घटनाओं का अनुमान लगाने के लिए दो उपन्यास और अलग-अलग दृष्टिकोण विकसित करके इन सीमाओं से निपटते हैं: क्रमिक रूप से मार्कोवियन कोलेसेंट (SMβC) और एक ग्राफ न्यूरल नेटवर्क (GNNcoal)। हम पहले β-कोलेसेंट मॉडल के तहत सिम्युलेटेड डेटा का उपयोग करके एकाधिक विलय पैरामीटर और पिछले जनसांख्यिकीय इतिहास का अनुमान लगाने के लिए अपने तरीकों की सटीकता का प्रमाण देते हैं। दूसरे, हम दिखाते हैं कि हमारे दृष्टिकोण जीनोम के साथ सकारात्मक चयनात्मक स्वीप के प्रभाव को भी पुनः प्राप्त कर सकते हैं। अंत में, हम जनसंख्या आकार के पिछले बदलाव का अनुमान लगाते हुए चयन से विषम संतान वितरण को अलग करने में सक्षम हैं। हमारे निष्कर्ष अनुमान के लिए एआरजी से जानकारी का लाभ उठाने के लिए तंत्रिका नेटवर्क की योग्यता पर जोर देते हैं, लेकिन साथ ही अधिक सटीक एआरजी अनुमान दृष्टिकोण की तत्काल आवश्यकता पर भी जोर देते हैं।

किंगमैन कोलेसेंट, बीटा कोलेसेंट, सेलेक्टिव स्वीप, डीप लर्निंग, ग्राफ न्यूरल नेटवर्क, जनसंख्या आनुवंशिकी, एकाधिक विलय कोलेसेंट, क्रमिक रूप से मार्कोवियन कोलेसेंट, पैतृक पुनर्संयोजन ग्राफ

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

過去の人口動態の推定とベータ合体下の祖先組換えグラフからの選択の同時推論

種の生殖メカニズムは、ゲノム進化の主要な推進力です。集団内の個体の再生産に関する標準的なライト-フィッシャーモデルでは、各個体が産む子孫の数は集団全体のサイズに比べて無視できるほどであると仮定しています。しかし、植物、無脊椎動物、原核生物、魚類の多くの種は、中立的に偏った子孫分布や強い選択現象を示し、少数の個体が集団サイズと同じ規模までの数の子孫を生み出すことになります。その結果、サンプルの系図は、同じ共通の祖先に同時に合体する複数の個体 (2 人以上) によって特徴付けられます。このような複数の合併イベントを検出するために開発された現在の方法は、複雑な人口統計シナリオや再結合を考慮しておらず、大きなサンプルサイズを必要とします。私たちは、配列データまたは祖先組換えグラフ (ARG) から複数のマージイベントを推論するための 2 つの新規で異なるアプローチ、つまり逐次マルコフ合体 (SMβC) とグラフニューラルネットワーク (GNNcoal) を開発することで、これらの制限に取り組みます。まず、β合体モデルの下でシミュレートされたデータを使用して、多重合併パラメータと過去の人口統計履歴を推定する方法の正確性を証明します。第二に、我々のアプローチはゲノムに沿った正の選択的スイープの効果も回復できることを示します。最後に、選択からの偏った子孫分布を区別しながら、同時に集団サイズの過去の変動を推測することができます。私たちの調査結果は、ARG からの情報を推論に活用するニューラルネットワークの適性を強調していますが、より正確な ARG 推論アプローチが緊急に必要であることも強調しています。

キングマン合体、ベータ合体、選択スイープ、ディープラーニング、グラフニューラルネットワーク、集団遺伝学、多重合体合体、逐次マルコフ合体、祖先組換えグラフ

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Inferência Simultânea de Demografia Passada e Seleção a partir do Gráfico de Recombinação Ancestral sob o Beta Coalescente

O mecanismo reprodutivo de uma espécie é um motor chave da evolução do genoma. O modelo padrão de Wright-Fisher para a reprodução de indivíduos em uma população assume que cada indivíduo produz um número de descendentes insignificante em comparação com o tamanho total da população. No entanto, muitas espécies de plantas, invertebrados, procariontes ou peixes exibem distribuição de descendentes neutramente distorcida ou fortes eventos de seleção, produzindo poucos indivíduos para produzir um número de descendentes de até a mesma magnitude que o tamanho da população. Como resultado, a genealogia de uma amostra é caracterizada por múltiplos indivíduos (mais de dois) coalescendo simultaneamente no mesmo ancestral comum. Os métodos atuais desenvolvidos para detectar tais eventos de fusão múltiplos não levam em conta cenários demográficos complexos ou recombinação e requerem amostras grandes. Nós abordamos essas limitações desenvolvendo duas abordagens novas e diferentes para inferir múltiplos eventos de fusão a partir de dados de sequência ou do gráfico de recombinação ancestral (ARG): um coalescente sequencialmente Markoviano (SMβC) e uma rede neural gráfica (GNNcoal). Primeiro, damos provas da precisão de nossos métodos para estimar o parâmetro de fusão múltipla e o histórico demográfico passado usando dados simulados sob o modelo β-coalescente. Em segundo lugar, mostramos que as nossas abordagens também podem recuperar o efeito de varreduras seletivas positivas ao longo do genoma. Finalmente, somos capazes de distinguir a distribuição distorcida da prole da seleção e, ao mesmo tempo, inferir a variação passada do tamanho da população. Nossas descobertas enfatizam a aptidão das redes neurais para aproveitar informações do ARG para inferência, mas também a necessidade urgente de abordagens de inferência de ARG mais precisas.

coalescente kingman, coalescente beta, varredura seletiva, aprendizagem profunda, redes neurais gráficas, genética populacional, coalescente de fusão múltipla, coalescente sequencialmente markoviano, gráfico de recombinação ancestral

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Одновременный вывод о прошлой демографии и отбор из графа предковой рекомбинации в условиях бета-коалесцента

Репродуктивный механизм вида является ключевым фактором эволюции генома. Стандартная модель воспроизводства особей в популяции Райта-Фишера предполагает, что каждая особь производит количество потомков, незначительное по сравнению с общей численностью популяции. Тем не менее, многие виды растений, беспозвоночных, прокариотов или рыб демонстрируют нейтрально искаженное распределение потомства или сильные события отбора, в результате которых лишь немногие особи производят количество потомков, равное размеру популяции. В результате генеалогия выборки характеризуется наличием нескольких особей (более двух), одновременно слившихся с одним и тем же общим предком. Текущие методы, разработанные для обнаружения таких множественных слияний, не учитывают сложные демографические сценарии или рекомбинацию и требуют больших размеров выборки. Мы преодолеваем эти ограничения, разрабатывая два новых и разных подхода к выводу о множественных событиях слияния на основе данных о последовательностях или графа предковой рекомбинации (ARG): последовательно-марковское слияние (SMβC) и графовую нейронную сеть (GNNcoal). Сначала мы приводим доказательство точности наших методов оценки параметра множественных слияний и прошлой демографической истории с использованием смоделированных данных в рамках модели β-коалесцента. Во-вторых, мы показываем, что наши подходы также могут восстановить эффект положительных выборочных проверок генома. Наконец, мы можем отличить искаженное распределение потомства от отбора, одновременно делая вывод о прошлых изменениях размера популяции. Наши результаты подчеркивают способность нейронных сетей использовать информацию из ARG для вывода, а также острую необходимость в более точных подходах к выводу ARG.

коалесценция Кингмана, бета-коалесцент, выборочная развертка, глубокое обучение, графовые нейронные сети, популяционная генетика, коалесцент множественных слияний, последовательно-марковская коалесценция, граф предковой рекомбинации

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Beta 合并下的祖先重组图同时推断过去的人口统计和选择

物种的繁殖机制是基因组进化的关键驱动力。种群中个体繁殖的标准赖特费希尔模型假设每个个体产生的后代数量与种群总规模相比可以忽略不计。然而，许多植物、无脊椎动物、原核生物或鱼类表现出中性偏斜的后代分布或强烈的选择事件，导致很少有个体产生与种群规模相同数量的后代。因此，样本的谱系的特点是多个个体（两个以上）同时合并到同一个共同祖先。目前为检测此类多重合并事件而开发的方法无法考虑复杂的人口统计场景或重组，并且需要大量样本。我们通过开发两种新颖且不同的方法来解决这些限制，从序列数据或祖先重组图（ARG）推断多个合并事件：顺序马尔可夫合并（SMβC）和图神经网络（GNNcoal）。我们首先证明了我们使用 β 合并模型下的模拟数据来估计多重合并参数和过去人口统计历史的方法的准确性。其次，我们表明我们的方法还可以恢复沿基因组的正选择性扫描的效果。最后，我们能够将偏斜的后代分布与选择区分开来，同时推断种群规模的过去变化。我们的研究结果强调了神经网络利用 ARG 信息进行推理的能力，但也迫切需要更准确的 ARG 推理方法。

金曼合并、β 合并、选择性扫描、深度学习、图神经网络、群体遗传学、多重合并合并、顺序马尔可夫合并、祖先重组图

Submission: posted 31 July 2023, validated 02 August 2023
Recommendation: posted 04 March 2024, validated 04 March 2024

Cite this recommendation as:
Dutheil, J. (2024) Beyond the standard coalescent: demographic inference with complete genomes and graph neural networks under the beta coalescent. Peer Community in Evolutionary Biology, 100699. https://doi.org/10.24072/pci.evolbiol.100699

Recommendation

Modelling the evolution of complete genome sequences in populations requires accounting for the recombination process, as a single tree can no longer describe the underlying genealogy. The sequentially Markov coalescent (SMC, McVean and Cardin 2005; Marjoram and Wall 2006) approximates the standard coalescent with recombination process and permits estimating population genetic parameters (e.g., population sizes, recombination rates) using population genomic datasets. As such datasets become available for an increasing number of species, more fine-tuned models are needed to encompass the diversity of life cycles of organisms beyond the model species on which most methods have been benchmarked.

The work by Korfmann et al. (Korfmann et al. 2024) represents a significant step forward as it accounts for multiple mergers in SMC models. Multiple merger models account for simultaneous coalescence events so that more than two lineages find a common ancestor in a given generation. This feature is not allowed in standard coalescent models and may result from selection or skewed offspring distributions, conditions likely met by a broad range of species, particularly microbial.

Yet, this work goes beyond extending the SMC, as it introduces several methodological innovations. The "classical" SMC-based inference approaches rely on hidden Markov models to compute the likelihood of the data while efficiently integrating over the possible ancestral recombination graphs (ARG). Following other recent works (e.g. Gattepaille et al. 2016), Korfmann et al. propose to separate the ARG inference from model parameter estimation under maximum likelihood (ML). They introduce a procedure where the ARG is first reconstructed from the data and then taken as input in the model fitting step. While this approach does not permit accounting for the uncertainty in the ARG reconstruction (which is typically large), it potentially allows for the extraction of more information from the ARG, such as the occurrence of multiple merging events. Going away from maximum likelihood inference, the authors trained a graph neural network (GNN) on simulated ARGs, introducing a new, flexible way to estimate population genomic parameters.

The authors used simulations under a beta-coalescent model with diverse demographic scenarios and showed that the ML and GNN approaches introduced can reliably recover the simulated parameter values. They further show that when the true ARG is given as input, the GNN outperforms the ML approach, demonstrating its promising power as ARG reconstruction methods improve. In particular, they showed that trained GNNs can disentangle the effects of selective sweeps and skewed offspring distributions while inferring past population size changes.

This work paves the way for new, exciting applications, though many questions must be answered. How frequent are multiple mergers? As the authors showed that these events "erase" the record of past demographic events, how many genomes are needed to conduct reliable inference, and can the methods computationally cope with the resulting (potentially large) amounts of required data? This is particularly intriguing as micro-organisms, prone to strong selection and skewed offspring distributions, also tend to carry smaller genomes.

References

Gattepaille L, Günther T, Jakobsson M. 2016. Inferring Past Effective Population Size from Distributions of Coalescent Times. Genetics 204:1191-1206.
https://doi.org/10.1534/genetics.115.185058

Korfmann K, Sellinger T, Freund F, Fumagalli M, Tellier A. 2024. Simultaneous Inference of Past Demography and Selection from the Ancestral Recombination Graph under the Beta Coalescent. bioRxiv, 2022.09.28.508873. ver. 5 peer-reviewed and recommended by Peer Community in Evolutionary Biology. https://doi.org/10.1101/2022.09.28.508873

Marjoram P, Wall JD. 2006. Fast "coalescent" simulation. BMC Genet. 7:16.
https://doi.org/10.1186/1471-2156-7-16

McVean GAT, Cardin NJ. 2005. Approximating the coalescent with recombination. Philos. Trans. R. Soc. Lond. B. Biol. Sci. 360:1387-1393.
https://doi.org/10.1098/rstb.2005.1673

PDF recommendation

Conflict of interest:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article. The authors declared that they comply with the PCI rule of having no financial conflicts of interest in relation to the content of the article.

Funding:
This work was supported by the BMBF-funded de.NBI Cloud within the German Network for Bioinformatics Infrastructure (de.NBI) (031A532B, 031A533A, 031A533B, 031A534A, 031A535A, 031A537A, 031A537B, 031A537C, 031A537D, 031A538A). KK is supported by a grant from the Deutsche Forschungsgemeinschaft (DFG) through the TUM International Graduate School of Science and Engineering (IGSSE), GSC 81, within the project GENOMIE QADOP. TS is supported by the Austrian Science Fund (project no. TAI 151-B). AT acknowledges funding from the DFG grant TE809/1-4 (project 254587930) and TE809/7-1 (project 317616126). FF and AT acknowledge funding from the DFG Priority Program SPP1590 on "Probabilistic Structures in Evolution". MF and AT acknowledge the support from the Imperial College - TUM Partnership award.

Reviews

Evaluation round #2

DOI or URL of the preprint: https://doi.org/10.1101/2022.09.28.508873

Version of the preprint: 4

Author's Reply, 27 Feb 2024

Download author's reply Download tracked changes file

Dear Recommender,

Please find attached our revised manuscript entitled “Simultaneous Inference of Past Demography and Selection from the Ancestral Recombination Graph under the Beta Coalescent” by, Kevin Korfmann, Thibaut Sellinger, Fabian Freund, Matteo Fumagalli and Aurélien Tellier.

First we would like to both reviewers for reading and appreciating the revised version of the manuscript. We understood that the main and last remaining problem was the scaling discrepancy between the kingman coalescent and the implementation of the beta coalescent in msprime.

We found ourselves puzzled at first as the implementation of msprime was beyond the scope of our study but we however understood the concern and modified the manuscript to improve the clarity behind the implementation of the Beta coalescent in msprime.

Many thanks in advance,

On behalf of the authors,

Kevin Korfmann & Thibaut Sellinger

Revision round #2

Decision for round #2 : Revision needed

Revision needed

Dear authors,

I have received feedback from the two reviewers, and you will see that they are generally satisfied with your revision. There is one remaining point from reviewer 2 (why a beta-coalescent with alpha=2 does not exactly converge to a standard coalescent) that needs further clarification. As also pointed out by reviewer 1, the beta-coalescent might not be as widespread knowledge as more classical models; therefore, I believe it is important to make its presentation as clear as possible. If you would be able to address this last point, I would then recommend the manuscript for PCi Evol Biol.

Best regards,

Julien Dutheil.

Reviewer 1 :

The preprint has improved significantly from the previous version in the presentation and communication. I would like to acknowledge the authors for addressing all major and most specific comments, evident in the main text and the inclusion of new figures. Regarding one of the major comments, it was exciting to see the newly added results of GNNcoal trained with true or inferred genealogies. While I maintain my concern about the reader-friendliness of presenting some data in tables instead of in figures, I recognize that this is not a critique of the scientific content. This matter, therefore, can be appropriately discussed in the correspondence between the authors and the recommender. Finally, I would like to congratulate the authors on the revised version of the preprint and I thank the recommender for inviting me to review this exciting manuscript.

Reply: We thank Reviewer 1 for their comments and their appreciation of the revised version of the manuscript.

Reviewer 2 :

“Most importantly, I do not understand, why the Beta-coalescent is not exactly transitioning to the Kingman coalescent for α = 2.”

Reply: We would like to thank the reviewer for the detailed inspection of the underlying model. The first point we would like to make is that the scaling indeed plays an important part as is evident in Figure 2 and recognized by the reviewer. This figure describes the evaluation of PSMC/MSMC on the msprime implementation of the Beta coalescent. Likewise both SMBC and GNNcoal have been evaluated on the msprime version of the Beta coalescent and any scalings introduced by msprime are either directly mathematically transferred in SMBC or learned implicitly through training on simulations by GNNcoal.

To find justification for the reason of implementing the scaling we reached out to Dr Jere Koskela, Reader in Statistics at Newcastle, who is involved with the implementation of the respective parts in msprime. In his reply he confirms that the Beta coalescent and its coalescent rates in the limit as alpha goes to 2 (and plugging it in the Delta-distribution), we obtain the Kingman coalescent. However, this view lacks any relation or notion of time scales, which is where the issue lies. The msprime paper indeed implements the Galton-Watson-process of the Schweinsberg, 2003 paper, which adds a notion of a time scale, whose scaling can be found either in the msprime documentation and literature mentioned below. Furthermore Dr Koskela, also highlights a discontinuous jump in timescale as alpha->2 and actually is equal to 2.

For completeness we attach the relevant part of Dr Koskelas’ kind reply in the following: “[...] If I just take a Beta-coalescent as an abstract mathematical object and send α to 2, I get the Kingman coalescent with no further caveats or complications (indeed, there is no notion of a timescale). If I specify that I'm working with the pre-limiting sequence of supercritical Galton-Watson-type population models in Schweinsberg's paper, then there is a notion of timescale and sending α to 2 affects it. Setting α >= 2 gives a timescale of C(α)N generations for a constant C(α) > 0 which depends on α but not N (Schweinsberg's Lemma 6), while 1 < α < 2 yields the timescale in the msprime BetaCoalescent documentation (Schweinsberg's Lemma 13). In fact, the 1 < α < 2 timescale collapses to zero as α -> 2 [...], so there is a discontinuous jump in the timescale from α -> 2 to α = 2, i.e. the limit and the timescale do not commute. [...]”

What does that mean for SMBC and GNNcoal?

As stated earlier, any time scales implemented in msprime are also inherited by our models (e.g. α is upper bounder by 1.99 in SMBC). Due to the discontinuity when moving from Beta coalescent to Kingman coalescent, studies are required to carefully evaluate the expected strength of the underlying sweepstakes of the model organism and choose the appropriate neutral model. This is especially crucial since the phase from α>1.9 up to 1.99 where msprime suffers from numerical instability issues, which are actively being addressed and improved currently by Dr Koskela and the authors of msprime.

Second, the “standard” SMC-based methods do indeed assume a Wright-Fisher model scaling (N generations, as the coalescence probability =1/N exactly), so they will derive a different timescale even when alpha is very close to 2 (see below reply to question 3).

Schweinsberg paper: https://www.sciencedirect.com/science/article/pii/S0304414903000280

Doesn’t this show that there is some scaling problem of the mutation rates in your simulation?

Reply: We checked our simulation scripts and our mutation rates are in line with how our msprime was designed. As explained above the discrepancy originates from the simulator implementation and not by our use of it. We have now introduced one sentence in the introduction (Line 61-67) and in the method section to clarify the issue with msprime (Line 256-261).

Minor points:

1.) The explicit formulas for the scaling-factor are incomplete: In the formula for the so-called ”scaling constant” on Line 64, there appears a β, which has not been introduced or defined as a parameter.

Reply: We apologize for this confusion, this beta stands for the Beta function. We corrected the manuscript.

2.) The quotations after these formulas are unhelpful, at least to me. I took a look at all three papers (refs. 8, 55 and 56), and while I admit I didn’t read them in all detail, I could not really find these formulas. Perhaps these formulas could be derived for the reader (with references) in a short Supplementary Chapter or a methods paragraph. They can then be taken out of the text in lines 62-64, actually, where they are a bit overwhelming I think

Reply: Once again we apologize for the confusion. The formulas can be explicitly found in the msprime documentation and we have therefore added the reference to the msprime manuscript where the beta coalescent was introduced (2022 in Genetics) as well as the article from Schweinsberg in 2003 where the events rates are derived. We also added a short description in our methods as well as the documentation of msprime in the data availability section to make it easier for the reader to find information specific to msprime.

3.) The authors’ response about my critique of their figure 2 is partly convincing. I get that you want to make the point that indeed the population size inference gets wrong if the assumptions break down. But, coming back to my main point above, this point only comes across if you actually show that the discrepancy between expectation and fit actually vanishes for α → 2. I find it hard to believe that for α = 1.9, the violation of the Kingman-coalescent assumption is already so stark that the population size is mis-estimated by a factor 100, which is what I see in Figure 2a. To repeat myself: I think there is something wrong with that. What I would have expected from that figure is a fit which looks very good for, say, α = 1.99, perhaps marginally worse for α = 1.9, and then perhaps increasingly bad for lower values. Instead, what I see in your Figure 2 is a terrible fit in all four cases, with a discrepancy ranging from a factor 100 to 1000

Reply: We understand your point and hope to have addressed the scaling issue above.

However concerning the underlying point about scaling discrepancy due to biological factors (and not implementation) we agree with you. If the inferred alpha is greater than 1.9 (or even 1.8) we would simply assume the underlying model to be a Kingman coalescent and use eSMC2 (or msmc2). That is why the user is free to choose the scaling with SMBC (the Kingman coalescent one or the beta coalescent one resulting from the implementation of msprime), the output shape will not change, just its position on the y and x axis. The output from SMBC can also be scaled according to the user preference if they wish to introduce knowledge that SMBC does not have.

We now clarify the issue in the method part of the manuscript using the msprime manual as reference and beginning of the results part (Line 328-355).

Minor point: In Line 62 there is a typo, I think. It says Beta(2α,α), but I think it should be Beta(2 −α,α)

Reply: Thank you for spotting this issue. The minus was in the .tex but was not displayed in the pdf. We now fixed it.

https://doi.org/10.24072/pci.evolbiol.100699.ar2

Decision by Julien Yann Dutheil, posted 07 Feb 2024, validated 07 Feb 2024

Dear authors,

I have received feedback from the two reviewers, and you will see that they are generally satisfied with your revision. There is one remaining point from reviewer 2 (why a beta-coalescent with alpha=2 does not exactly converge to a standard coalescent) that needs further clarification. As also pointed out by reviewer 1, the beta-coalescent might not be as widespread knowledge as more classical models; therefore, I believe it is important to make its presentation as clear as possible. If you would be able to address this last point, I would then recommend the manuscript for PCi Evol Biol.

Best regards,

Julien Dutheil.

https://doi.org/10.24072/pci.evolbiol.100699.d2

Reviewed by anonymous reviewer 1, 30 Jan 2024

The preprint has improved significantly from the previous version in the presentation and communication.
I would like to acknowledge the authors for addressing all major and most specific comments, evident in the main text and the inclusion of new figures.
Regarding one of the major comments, it was exciting to see the newly added results of GNNcoal trained with true or inferred genealogies.
While I maintain my concern about the reader-friendliness of presenting some data in tables instead of in fugures, I recognize that this is not a critique of the scientific content.
This matter, therefore, can be appropriately discussed in the correspondence between the authors and the recommender.
Finally, I would like to congratulate the authors on the revised version of the preprint and I thank the recommender for inviting me to review this exciting manuscript.

https://doi.org/10.24072/pci.evolbiol.100699.rev21

Reviewed by anonymous reviewer 2, 07 Feb 2024

Download the review https://doi.org/10.24072/pci.evolbiol.100699.rev22

Evaluation round #1

DOI or URL of the preprint: https://doi.org/10.1101/2022.09.28.508873

Version of the preprint: 3

Author's Reply, 04 Jan 2024

Download author's reply Download tracked changes file https://doi.org/10.24072/pci.evolbiol.100699.ar1

Decision by Julien Yann Dutheil, posted 03 Oct 2023, validated 04 Oct 2023

This manuscript by Korfmann and collaborators reports extensive developments of new genomic inference methods based on the beta-coalescent. This work extends classic models based on the Kingman coalescent, possibly bringing such approaches to a broader range of organisms, notably microbes. The manuscript represents a significant methodological advance, which comes in three ways:

A new inference model, extending the multiple sequentially Markov coalescent approach (MSMC) to account for multiple mergers.
A new graph neural network approach that can learn coalescence parameters from ancestral recombination graphs.
Approaches based on the newly introduced models to infer regions under selection along the genome.

The two reviewers highlight the innovative aspects of the work and its great application potential. However, both indicate that the presentation of the model and results should be improved. They provide detailed comments and suggestions that, I believe, will be useful to the authors to improve their manuscript. I further highlight below some points that I think the authors should address:

1) The extensive mathematical developments require that their detailed exposure be provided as supplementary material. This exposure is, however, incomplete in several places, and some critical information is missing in the main text:

It is indicated that SMbetaC can be run on ARGs instead of sequences (e.g. l211). How does the inference work in such a case? That is, what are the hidden states? I could not find a description of this approach in the supplementary text, which only describes the standard model where genealogies are the hidden states.
How are selection scans performed? Is the alpha parameter allowed to vary along the genome? How can it be inferred in a "local" manner?
I agree with reviewer 2 that a detailed assessment of the math in the supplementary material would require much time. However, a relatively simple and efficient check can be made for such complex models: simulating data under the exact inference model (that is, under the SMbetaC model). The maximum likelihood theorem stipulates that the parameter inference should be unbiased under such conditions. Simulating data under the "real" process, as the authors perform, is of greater practical importance. Still, simulations under the inference model offer insurance that the model is correctly implemented, and I encourage the authors to verify this.

2) I did not understand why the author looked at the "classic" LD, and as pointed out by reviewer 2, the discussion on the Markovian hypothesis is unclear, if not inaccurate. First, the Markovian assumption is also violated under the Kingman coalescent; this is not specific to the beta-coalescent. Furthermore, while the SMC captures some kind of LD (so-called topological LD), how it relates to the more classic notion of LD based on haplotype frequencies is not straightforward. As the manuscript is already dense, I suggest removing this part and focusing on the topological LD (the transition matrix).

Minor:

l29: I am not sure how common knowledge the "survivorship types" are. Maybe a reference could be added?

l50: Haploid organisms: could a few sentences be added to indicate the main differences with a diploid model? It is discussed in the "Discussion" part, but I feel some information for the non-expert would be helpful here.

l162 "All SMC approaches used in this manuscript are found in the R package eSMC2.": as I understand this sentence, the authors have reimplemented the MSMC model. Is that so? l263 (also l283), the authors say that they used MSMC and MSMC2. If the authors do not mean the original software, they should state it clearly. In such a case, they should also indicate how the implementation differs from the original in terms of parametrisation, estimation procedure, etc.

Fig1: I agree with reviewer 1 that Figure 1 is not informative. It isn't easy to guess what the various graphs, dots, squares and curves represent.

l171: number OF coalescence trees. As THE batch size is fixed.

l212: can msprime simulate selection? How exactly?

l291: I agree with reviewer 2's comment and suggestion on the scaling. Furthermore, some references should be added.

Fig2: I did not get why PSMC is mentioned here (and, unless I am mistaken, only here).

l328: It does not seem to me that the GNNcoal approach exhibits "high accuracy" in the case where alpha = 1.3

FigS4-S7: the figure titles should state the demographic scenario (currently, all figures share the same title). Furthermore, in the case of population expansion/collapse, the population size change falls out of the resolution of the inference model so that it only infers constant population sizes in several cases. For alpha = 1.7 and 1.9, a more ancient size shift should be considered (Figures S5-7).

Fig4: what are the light grey lines?

Fig5: I think this figure might be easier to read (notably to compare the panels) if the y-axis represented (relative) errors ((estimated value - true value)/(true value)

l363: it seems that the wrong figures are mentioned here.

l463: In practice, we will never get the true ARG, so this does not constitute an advantage of the GNNcoal. Maybe this should be rephrased as a perspective, like "as ARG inference method improve, GNN models will offer a promising alternative to..."

https://doi.org/10.24072/pci.evolbiol.100699.d1

Reviewed by anonymous reviewer 1, 30 Aug 2023

Download the review https://doi.org/10.24072/pci.evolbiol.100699.rev11

Reviewed by anonymous reviewer 2, 28 Sep 2023

This preprint describes two new methods to estimate evolutionary parameters (coalescence rates and a parameter alpha describing multiple merger rates) from sequence data. The methods address the impressively hard problem of demographic inference in the presence of multiple-merger coalescent dynamics which is certainly novel. While I must admit that I could not go through the two supplementary texts in the necessary detail to fully review it (they are too extensive for me and a review of them is simply beyond my time budget), I see no reason to doubt the authors' expertise and suggest to then rely on community review after publication.

I have a number of comments on the main article and supplementary Figures which hopefully help improving the clarity of the paper or possibly point to some gaps in the story that need to be filled before recommendation:

1) L 248ff: With the GNN method, I did not understand why the smoothing of the inferred demography from the GNN happens after the inference. It appears to me as if regularization should be built in right into the inference method. For example, why not infer B-splines, or if that is too hard, put penalties on large jumps between the piecewise constant rates in the model?

2) Figure 2: I am quite confused about the "scaling discrepancy between the Kingman and beta-coalescent" (L 289f), as seen in the figure. In the figure, it looks like the notion of "population size" in the beta-coalescent is something that is between 2 and 3 orders of magnitude below what is called a "population size" in the Kingman-coalescent. Surely this cannot then mean the same concept? I don't know beta-coalescent theory well, but I suppose whatever is described there cannot be interpreted as a "population size" in the same sense as in the Kingman coalescent.

Maybe I am overlooking something, but I think if this is really just some artifact in the definitions of rates, they should simply always be shown in their "corrected version" in the main text. At the very minimum, I suggest to replace Figure 2 by Supplementary Figure S1. But even better would be a good explanation, or perhaps general synchonisation, of the 100-1000fold difference in the concept of "population size" between the two models.

3) Figure 3 and text describing it: I think the authors made a confusing choice for Figure 3 to show different x-axis scales. The three plots all look the same, but have different scales, so the difference is hard to see. I suggest to use the same scale, so the reader can appreciate the difference.

4) Related to point above in Figure 3: I don't quite understand whether the shown LD decay for lower values of alpha is really qualitatively different from the Kingman-coalescent. I believe the authors when they say that multiple mergers lead to long-range effects, but on the Figures, it doesn't look qualitatively different, it just looks quantitatively different. Where can I see the "qualitative"? Why does a longer LD decay necessarily demonstrate "violation of the Markovian hypothesis"? I think this both needs to be explained better, and it needs to be shown more convincingly.

5) Fig S2 and S3: The authors show these residual matrices of the observed vs. theoretical transition matrices. This is in principle nice, but after all leaves me a bit puzzled about what I'm supposed to see. The authors point out the fact that S2 looks more random, while S3 looks more structured, but I don't get why the seeming randomness in S2 should be interpreted such that the matrix is "well approximated" (L 313), nor do I get why the patterns in S3 should be interpreted such that there are "significant differences" between observed and predicted" (L 316f). It seems to me that whether or not the residuals are structured or not is somewhat of an orthogonal question to question whether the differences are significant or not. In particular, they live on the same color scales.

6) By the way, the tick marks in the color legends of Figures S2 and S3 have an error as far as I can see. The topmost tick marks should be "[10^{-7}, 1[", and not "[10^{-10}, 1[", right?

7) Fig S6 and S7: Why did you choose the timing of the expansion or contraction of the population size to be so recent? It seems that for most chosen alpha values, the inference is far away from the "interesting" time period.

8) L 347ff: I was confused by the text here, which I understood in such a way that the GNN was run on a downsampled dataset to sample size of three, then pointing at Futures S4-S7. But in those figures, the figure legend indicates that the full 10 haplotypes were used. Is this just a typo, or did I misunderstand something?

10) L 384f: "In contrast, SMbC produces better inferences of alpha ..." -> better than what?

11) Figure S1, caption: The math seems a bit garbled to me, with single brackets as superscripts and such.

https://doi.org/10.24072/pci.evolbiol.100699.rev12

User comments

No user comments yet

or Register
Submit a preprint