Recommendation

Dating nodes in a phylogeny using inferred horizontal gene transfers

Tatiana Giraud and Toni Gabaldon based on reviews by Alexandros Stamatakis, Mukul Bansal and 2 anonymous reviewers

A recommendation of:

MaxTiC: Fast ranking of a phylogenetic tree by Maximum Time Consistency with lateral gene transfers

Cédric Chauve, Akbar Rafiey, Adrian A. Davin, Celine Scornavacca, Philippe Veber, Bastien Boussau, Gergely J Szöllosi, Vincent Daubin, and Eric Tannier (2017), bioRxiv, 127548, ver. 6 peer-reviewed and recommended by Peer Community in Evolutionary Biology 10.1101/127548

Read preprint in preprint server

Abstract

EN

AR

ES

FR

HI

JA

PT

RU

ZH-CN

MaxTiC: Fast ranking of a phylogenetic tree by Maximum Time Consistency with lateral gene transfers

Lateral gene transfers (LGTs) between ancient species contain information about the relative timing of species diversification. Specifically, the ancestors of a donor species must have existed before the descendants of the recipient species. Hence, the detection of a LGT event can be translated into a time constraint between nodes of a phylogeny if donors and recipients can be identified. When a set of LGTs are detected by interpreting the phylogenetic discordance between gene trees and a species tree, the set of all deduced time constraints can be used to order totally the internal nodes and thus produce a ranked tree. Unfortunately LGT detection is still very challenging and current methods produce a significant proportion of false positives. As a result the set of time constraints is not always compatible with a ranked species tree. We propose an optimization method, which we call MaxTiC (Maximum Time Consistency), for obtaining a ranked species tree compatible with a maximum number of time constraints. The problem in general inherits NP-completeness from feedback arc sets. However we give an exact polynomial time method based on dynamic programming to compute an optimal ranked binary tree supposing that its two children are ranked. We turn this principle into a heuristic to solve the general problem and test it on simulated datasets. Under a wide range of conditions, which we compare to biological datasets, the obtained ranked tree is very close to the real one, confirming the theoretical possibility of dating in the history of life with transfers by maximizing time consistency. MaxTiC is available within the ALE package: https://github.com/ssolo/ALE/tree/master/misc.

dating; simulations; phylogeny; optimization; transfer; reconciliation; ultrametric tree; algorithmics;

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

MaxTiC: تصنيف سريع لشجرة النشوء والتطور من خلال الحد الأقصى للتناسق الزمني مع عمليات نقل الجينات الجانبية

تحتوي عمليات نقل الجينات الجانبية (LGTs) بين الأنواع القديمة على معلومات حول التوقيت النسبي لتنويع الأنواع. على وجه التحديد، يجب أن يكون أسلاف الأنواع المانحة موجودين قبل أحفاد الأنواع المتلقية. ومن ثم، يمكن ترجمة اكتشاف حدث LGT إلى قيد زمني بين عقد السلالة إذا كان من الممكن تحديد الجهات المانحة والمتلقية. عندما يتم اكتشاف مجموعة من LGTs من خلال تفسير الخلاف التطوري بين أشجار الجينات وشجرة الأنواع، يمكن استخدام مجموعة جميع القيود الزمنية المستنتجة لترتيب العقد الداخلية بالكامل وبالتالي إنتاج شجرة مرتبة. لسوء الحظ، لا يزال اكتشاف LGT يمثل تحديًا كبيرًا وتنتج الطرق الحالية نسبة كبيرة من النتائج الإيجابية الكاذبة. ونتيجة لذلك، فإن مجموعة القيود الزمنية لا تتوافق دائمًا مع شجرة الأنواع المصنفة. نقترح طريقة تحسين، والتي نسميها MaxTiC (الثبات الأقصى للوقت)، للحصول على شجرة الأنواع المصنفة المتوافقة مع الحد الأقصى لعدد القيود الزمنية. المشكلة بشكل عام ترث اكتمال NP من مجموعات قوس التغذية الراجعة. ومع ذلك، فإننا نقدم طريقة زمنية دقيقة متعددة الحدود تعتمد على البرمجة الديناميكية لحساب شجرة ثنائية مرتبة مثالية بافتراض أن فرعيها مرتبان. نحول هذا المبدأ إلى إرشادي لحل المشكلة العامة واختباره على مجموعات البيانات المحاكاة. في ظل مجموعة واسعة من الظروف، التي نقارنها بمجموعات البيانات البيولوجية، تكون الشجرة المصنفة التي تم الحصول عليها قريبة جدًا من الشجرة الحقيقية، مما يؤكد الاحتمال النظري للتأريخ في تاريخ الحياة من خلال التحويلات عن طريق تعظيم الاتساق الزمني. يتوفر MaxTiC ضمن حزمة ALE: https://github.com/ssolo/ALE/tree/master/misc.

مواعدة؛ المحاكاة؛ علم تطور السلالات؛ تحسين؛ تحويل؛ مصالحة؛ شجرة فائقة القياس الخوارزميات.

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

MaxTiC: clasificación rápida de un árbol filogenético por máxima coherencia temporal con transferencias laterales de genes

Las transferencias laterales de genes (LGT) entre especies antiguas contienen información sobre el momento relativo de la diversificación de las especies. Específicamente, los antepasados de una especie donante deben haber existido antes que los descendientes de la especie receptora. Por lo tanto, la detección de un evento LGT puede traducirse en una limitación de tiempo entre los nodos de una filogenia si se pueden identificar los donantes y los receptores. Cuando se detecta un conjunto de LGT mediante la interpretación de la discordancia filogenética entre árboles de genes y un árbol de especies, el conjunto de todas las restricciones de tiempo deducidas se puede utilizar para ordenar totalmente los nodos internos y así producir un árbol clasificado. Desafortunadamente, la detección de LGT sigue siendo un gran desafío y los métodos actuales producen una proporción significativa de falsos positivos. Como resultado, el conjunto de limitaciones de tiempo no siempre es compatible con un árbol de especies clasificado. Proponemos un método de optimización, al que llamamos MaxTiC (Maximum Time Consistency), para obtener un árbol de especies clasificado compatible con un número máximo de restricciones de tiempo. El problema en general hereda la completitud NP de los conjuntos de arcos de retroalimentación. Sin embargo, damos un método de tiempo polinómico exacto basado en programación dinámica para calcular un árbol binario clasificado óptimamente suponiendo que sus dos hijos estén clasificados. Convertimos este principio en una heurística para resolver el problema general y probarlo en conjuntos de datos simulados. Bajo una amplia gama de condiciones, que comparamos con conjuntos de datos biológicos, el árbol clasificado obtenido es muy cercano al real, lo que confirma la posibilidad teórica de datar la historia de la vida con transferencias maximizando la coherencia temporal. MaxTiC está disponible dentro del paquete ALE: https://github.com/ssolo/ALE/tree/master/misc.

tener una cita; simulaciones; filogenia; mejoramiento; transferir; reconciliación; árbol ultramétrico; algoritmia;

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

MaxTiC : classement rapide d'un arbre phylogénétique par cohérence temporelle maximale avec transferts latéraux de gènes

Les transferts latéraux de gènes (LGT) entre espèces anciennes contiennent des informations sur le calendrier relatif de la diversification des espèces. Plus précisément, les ancêtres d’une espèce donneuse doivent avoir existé avant les descendants de l’espèce receveuse. Ainsi, la détection d’un événement LGT peut se traduire par une contrainte de temps entre les nœuds d’une phylogénie si les donneurs et les receveurs peuvent être identifiés. Lorsqu'un ensemble de LGT est détecté en interprétant la discordance phylogénétique entre les arbres génétiques et un arbre d'espèces, l'ensemble de toutes les contraintes de temps déduites peut être utilisé pour ordonner totalement les nœuds internes et ainsi produire un arbre classé. Malheureusement, la détection du LGT reste très difficile et les méthodes actuelles produisent une proportion importante de faux positifs. En conséquence, l’ensemble des contraintes temporelles n’est pas toujours compatible avec un arbre d’espèces classé. Nous proposons une méthode d'optimisation, que nous appelons MaxTiC (Maximum Time Consistency), permettant d'obtenir un arbre d'espèces classé compatible avec un nombre maximum de contraintes temporelles. Le problème en général hérite de la complétude NP des ensembles d'arcs de rétroaction. Cependant nous donnons une méthode en temps polynomial exact basée sur la programmation dynamique pour calculer un arbre binaire classé optimal en supposant que ses deux enfants sont classés. Nous transformons ce principe en heuristique pour résoudre le problème général et le testons sur des ensembles de données simulés. Dans un large éventail de conditions, que nous comparons à des ensembles de données biologiques, l'arbre classé obtenu est très proche de l'arbre réel, confirmant la possibilité théorique de dater l'histoire de la vie avec des transferts en maximisant la cohérence temporelle. MaxTiC est disponible dans le package ALE : https://github.com/ssolo/ALE/tree/master/misc.

datation; simulations; phylogénie; optimisation; transfert; réconciliation; arbre ultramétrique; algorithmique;

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

MaxTiC: पार्श्व जीन स्थानांतरण के साथ अधिकतम समय संगति द्वारा फ़ाइलोजेनेटिक पेड़ की तेज़ रैंकिंग

प्राचीन प्रजातियों के बीच पार्श्व जीन स्थानांतरण (एलजीटी) में प्रजातियों के विविधीकरण के सापेक्ष समय के बारे में जानकारी होती है। विशेष रूप से, दाता प्रजाति के पूर्वज प्राप्तकर्ता प्रजाति के वंशजों से पहले अस्तित्व में रहे होंगे। इसलिए, यदि दाताओं और प्राप्तकर्ताओं की पहचान की जा सकती है, तो एलजीटी घटना का पता लगाने को फाइलोजेनी के नोड्स के बीच समय की कमी में अनुवादित किया जा सकता है। जब जीन पेड़ों और एक प्रजाति के पेड़ के बीच फ़ाइलोजेनेटिक विसंगति की व्याख्या करके एलजीटी के एक सेट का पता लगाया जाता है, तो सभी अनुमानित समय बाधाओं के सेट का उपयोग पूरी तरह से आंतरिक नोड्स को ऑर्डर करने के लिए किया जा सकता है और इस प्रकार एक रैंक वाला पेड़ तैयार किया जा सकता है। दुर्भाग्य से एलजीटी का पता लगाना अभी भी बहुत चुनौतीपूर्ण है और मौजूदा तरीके गलत सकारात्मकता का एक महत्वपूर्ण अनुपात उत्पन्न करते हैं। परिणामस्वरूप समय की कमी का सेट हमेशा एक रैंक वाली प्रजाति के पेड़ के साथ संगत नहीं होता है। अधिकतम समय की कमी के साथ संगत रैंक वाली प्रजाति के पेड़ को प्राप्त करने के लिए हम एक अनुकूलन विधि का प्रस्ताव करते हैं, जिसे हम MaxTiC (अधिकतम समय संगति) कहते हैं। सामान्य तौर पर समस्या फीडबैक आर्क सेट से एनपी-पूर्णता प्राप्त करती है। हालाँकि, हम एक इष्टतम रैंक वाले बाइनरी ट्री की गणना करने के लिए गतिशील प्रोग्रामिंग पर आधारित एक सटीक बहुपद समय विधि देते हैं, यह मानते हुए कि इसके दो बच्चे रैंक किए गए हैं। हम सामान्य समस्या को हल करने के लिए इस सिद्धांत को अनुमानी में बदलते हैं और सिम्युलेटेड डेटासेट पर इसका परीक्षण करते हैं। परिस्थितियों की एक विस्तृत श्रृंखला के तहत, जिनकी तुलना हम जैविक डेटासेट से करते हैं, प्राप्त रैंक वाला पेड़ वास्तविक के बहुत करीब है, जो समय की स्थिरता को अधिकतम करके स्थानान्तरण के साथ जीवन के इतिहास में डेटिंग की सैद्धांतिक संभावना की पुष्टि करता है। MaxTiC ALE पैकेज में उपलब्ध है: https://github.com/ssolo/ALE/tree/master/misc।

डेटिंग; अनुकरण; फाइलोजेनी; अनुकूलन; स्थानांतरण करना; सुलह; अल्ट्रामेट्रिक पेड़; एल्गोरिथम;

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

MaxTiC: 遺伝子水平伝達の最大時間一貫性による系統樹の高速ランキング

古代種間の遺伝子水平伝達 (LGT) には、種の多様化の相対的なタイミングに関する情報が含まれています。具体的には、ドナー種の祖先は、レシピエント種の子孫よりも前に存在していなければなりません。したがって、ドナーとレシピエントが特定できれば、LGT イベントの検出は系統発生のノード間の時間制約に変換できます。遺伝子ツリーと種ツリーの間の系統的不一致を解釈することによって LGT のセットが検出されると、推定されたすべての時間制約のセットを使用して内部ノードを全体的に順序付けし、ランク付けされたツリーを生成できます。残念ながら、LGT の検出は依然として非常に困難であり、現在の方法ではかなりの割合で偽陽性が発生します。結果として、時間制約のセットは、ランク付けされた種ツリーと常に互換性があるとは限りません。我々は、最大数の時間制約と互換性のあるランク付けされた種ツリーを取得するために、MaxTiC (Minimum Time Consistency) と呼ぶ最適化方法を提案します。一般に、この問題はフィードバックアークセットから NP 完全性を継承します。ただし、2 つの子がランク付けされていると仮定して、最適なランク付けされたバイナリツリーを計算するために、動的プログラミングに基づく正確な多項式時間方法を提供します。この原則をヒューリスティックに変換して一般的な問題を解決し、シミュレートされたデータセットでテストします。生物学的データセットと比較した幅広い条件下で、取得されたランク付けされたツリーは実際のツリーに非常に近く、時間の一貫性を最大化することにより、転移を伴う生命の歴史における年代測定の理論的可能性が確認されました。 MaxTiC は ALE パッケージ内で入手できます: https://github.com/ssolo/ALE/tree/master/misc。

デート;シミュレーション。系統発生;最適化;移行;和解;超計量の木。アルゴリズム;

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

MaxTiC: Classificação rápida de uma árvore filogenética por Máxima Consistência de Tempo com transferências laterais de genes

As transferências laterais de genes (LGTs) entre espécies antigas contêm informações sobre o momento relativo da diversificação das espécies. Especificamente, os ancestrais de uma espécie doadora devem ter existido antes dos descendentes da espécie receptora. Assim, a detecção de um evento LGT pode ser traduzida em uma restrição de tempo entre os nós de uma filogenia se doadores e receptores puderem ser identificados. Quando um conjunto de LGTs é detectado pela interpretação da discordância filogenética entre árvores genéticas e uma árvore de espécies, o conjunto de todas as restrições de tempo deduzidas pode ser usado para ordenar totalmente os nós internos e assim produzir uma árvore classificada. Infelizmente, a detecção de LGT ainda é muito desafiadora e os métodos atuais produzem uma proporção significativa de falsos positivos. Como resultado, o conjunto de restrições de tempo nem sempre é compatível com uma árvore de espécies classificada. Propomos um método de otimização, que chamamos de MaxTiC (Maximum Time Consistency), para obter uma árvore de espécies classificada compatível com um número máximo de restrições de tempo. O problema em geral herda a completude NP dos conjuntos de arcos de feedback. No entanto, fornecemos um método de tempo polinomial exato baseado em programação dinâmica para calcular uma árvore binária classificada ideal, supondo que seus dois filhos sejam classificados. Transformamos este princípio em uma heurística para resolver o problema geral e testá-lo em conjuntos de dados simulados. Sob uma ampla gama de condições, que comparamos com conjuntos de dados biológicos, a árvore classificada obtida é muito próxima da real, confirmando a possibilidade teórica de datação na história da vida com transferências maximizando a consistência temporal. MaxTiC está disponível no pacote ALE: https://github.com/ssolo/ALE/tree/master/misc.

namorando; simulações; filogenia; otimização; transferir; reconciliação; árvore ultramétrica; algorítmica;

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

MaxTiC: быстрое ранжирование филогенетического дерева по максимальной временной согласованности с латеральным переносом генов.

Латеральный перенос генов (LGT) между древними видами содержит информацию об относительных сроках видовой диверсификации. В частности, предки вида-донора должны были существовать раньше потомков вида-реципиента. Следовательно, обнаружение события LGT может быть переведено во временное ограничение между узлами филогении, если можно идентифицировать доноров и реципиентов. Когда набор LGT обнаруживается путем интерпретации филогенетического несоответствия между генными деревьями и деревом видов, набор всех выведенных временных ограничений можно использовать для полного упорядочивания внутренних узлов и, таким образом, для создания ранжированного дерева. К сожалению, обнаружение LGT по-прежнему остается очень сложной задачей, и современные методы дают значительную долю ложноположительных результатов. В результате набор временных ограничений не всегда совместим с ранжированным деревом видов. Мы предлагаем метод оптимизации, который мы называем MaxTiC (Максимальная временная согласованность), для получения ранжированного дерева видов, совместимого с максимальным количеством временных ограничений. В общем случае задача наследует NP-полноту от множеств дуг обратной связи. Однако мы даем точный метод полиномиального времени, основанный на динамическом программировании, для вычисления оптимального ранжированного двоичного дерева, предполагая, что два его дочерних элемента ранжированы. Мы превращаем этот принцип в эвристику для решения общей проблемы и проверяем ее на смоделированных наборах данных. В широком диапазоне условий, которые мы сравниваем с наборами биологических данных, полученное ранжированное дерево очень близко к реальному, подтверждая теоретическую возможность датирования в истории жизни с переносами за счет максимизации временной согласованности. MaxTiC доступен в пакете ALE: https://github.com/ssolo/ALE/tree/master/misc.

встречаться; симуляции; филогения; оптимизация; передача; примирение; ультраметрическое дерево; алгоритмика;

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

MaxTiC：通过横向基因转移的最大时间一致性对系统发育树进行快速排序

古代物种之间的横向基因转移（LGT）包含有关物种多样化相对时间的信息。具体来说，供体物种的祖先必须先于受体物种的后代而存在。因此，如果可以识别供体和受体，则 LGT 事件的检测可以转化为系统发育节点之间的时间约束。当通过解释基因树和物种树之间的系统发育不一致来检测一组 LGT 时，所有推导的时间约束集可用于对内部节点进行完全排序，从而生成排序树。不幸的是，LGT 检测仍然非常具有挑战性，当前的方法会产生很大比例的误报。因此，时间限制集并不总是与排序的物种树兼容。我们提出了一种优化方法，称为 MaxTiC（最大时间一致性），用于获得与最大数量时间约束兼容的排序物种树。该问题总体上继承了反馈弧集的 NP 完备性。然而，我们给出了一种基于动态规划的精确多项式时间方法来计算最优排序二叉树，假设它的两个子树已排序。我们将这一原则转化为启发式方法来解决一般问题，并在模拟数据集上进行测试。在我们与生物数据集进行比较的各种条件下，获得的排序树非常接近真实的树，证实了通过最大化时间一致性来确定生命历史中转移的理论可能性。 MaxTiC 在 ALE 包中可用：https://github.com/ssolo/ALE/tree/master/misc。

约会；模拟；系统发育；优化;转移;和解;超测树；算法；

Submission: posted 28 June 2017
Recommendation: posted 07 November 2017, validated 07 November 2017

Cite this recommendation as:
Giraud, T. and Gabaldon, T. (2017) Dating nodes in a phylogeny using inferred horizontal gene transfers. Peer Community in Evolutionary Biology, 100037. https://doi.org/10.24072/pci.evolbiol.100037

Recommendation

Dating nodes in a phylogeny is an important problem in evolution and is typically performed by using molecular clocks and fossil age estimates [1]. The manuscript by Chauve et al. [2] reports a novel method, which uses lateral gene transfers to help ordering nodes in a species tree. The idea is that a lateral gene transfer can only occur between two species living at the same time, which indirectly informs on node relative ages in a phylogeny: the donor species cannot be more recent than the recipient species. Horizontal gene transfers are increasingly recognized as frequent, even in eukaryotes, and especially in micro-organisms that have little fossil records [3-7]. Yet, such an important source of information has been very rarely used so far for inferring relative node ages in phylogenies. In this context, the method by Chauve et al. [2] represents an innovative and original approach to a difficult problem. An obvious limitation of the approach is that it relies on inferences of horizontal transfers, which detection is in itself a difficult problem. Incomplete taxon sampling, or the extinction of the true donor lineage may render patterns difficult to interpret in a temporary fashion. Yet, for clades with no fossils this may be the only piece of information we have at hand, and the growing amount of sequence data is likely to minimize issues derived from incomplete sampling.

The developed method, MaxTiC (for Maximal Time Consistency) [2], represents a very nice application of theoretical developments on the well-known « Feedback Arc Set » computer science problem to the evolutionary question of ordering nodes in a phylogeny. MaxTiC uses as input a species tree and a set of time constraints based on lateral gene transfers inferred using other softwares, and minimizes conflicts between node ordering and these time constraints. The application of MaxTiC on simulated datasets indicated that node ordering was fairly accurate [2]. MaxTiC is implemented in a freely available software, which represents original and relevant contribution to the field of evolutionary biology.

References

[1] Donoghue P and Smith M, editors. 2003. Telling the evolutionary time. CRC press.

[2] Chauve C, Rafiey A, Davin AA, Scornavacca C, Veber P, Boussau B, Szöllősi GJ, Daubin V and Tannier E. 2017. MaxTiC: Fast ranking of a phylogenetic tree by Maximum Time Consistency with lateral gene transfers. bioRxiv 127548, ver. 6 of 6th November 2017. doi: 10.1101/127548

[3] Ropars J, Rodríguez de la Vega RC, Lopez-Villavicencio M, Gouzy J, Sallet E, Debuchy R, Dupont J, Branca A and Giraud T. 2015. Adaptive horizontal gene transfers between multiple cheese-associated fungi. Current Biology 19, 2562–2569. doi: 10.1016/j.cub.2015.08.025

[4] Novo M, Bigey F, Beyne E, Galeote V, Gavory F, Mallet S, Cambon B, Legras JL, Wincker P, Casaregola S and Dequin S. 2009. Eukaryote-to-eukaryote gene transfer events revealed by the genome sequence of the wine yeast Saccharomyces cerevisiae EC1118. Proceeding of the National Academy of Science USA, 106, 16333–16338. doi: 10.1073/pnas.0904673106

[5] Naranjo-Ortíz MA, Brock M, Brunke S, Hube B, Marcet-Houben M, Gabaldón T. 2016. Widespread inter- and intra-domain horizontal gene transfer of d-amino acid metabolism enzymes in Eukaryotes. Frontiers in Microbiology 7, 2001. doi: 10.3389/fmicb.2016.02001

[6] Alexander WG, Wisecaver JH, Rokas A, Hittinger CT. 2016. Horizontally acquired genes in early-diverging pathogenic fungi enable the use of host nucleosides and nucleotides. Proceeding of the National Academy of Science USA. 113, 4116–4121. doi: 10.1073/pnas.1517242113

[7] Marcet-Houben M, Gabaldón T. 2010. Acquisition of prokaryotic genes by fungal genomes. Trends in Genetics. 26, 5–8. doi: 10.1016/j.tig.2009.11.007

PDF recommendation

Conflict of interest:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article. The authors declared that they comply with the PCI rule of having no financial conflicts of interest in relation to the content of the article.

Reviews

Evaluation round #2

DOI or URL of the preprint: 10.1101/127548

Version of the preprint: 3

Author's Reply, 02 Nov 2017

Download author's reply https://doi.org/10.24072/pci.evolbiol.100083.ar2

Decision by Tatiana Giraud, posted 27 Oct 2017

I was pleased to see that this manuscript has been further carefully revised, and there remain only a few minor additional suggestions that should be addressed before the manuscript can be recommended by PCI

https://doi.org/10.24072/pci.evolbiol.100083.d2

Reviewed by Mukul Bansal, 18 Oct 2017

The authors have addressed my major concerns and the updated manuscript is a clear improvement over the initial submission. The manuscript now provides an improved description of the heuristic algorithm and of the experimental analysis. The work is definitely interesting, and the proposed method has the potential to be quite useful for species tree dating in prokaryotes. I only have a single minor comment, which the authors can address as they see fit: The new text added to the manuscript has more grammatical errors than the original text from the initial submission. Carefully proofreading the newly added text would help.

https://doi.org/10.24072/pci.evolbiol.100083.rev21

Reviewed by anonymous reviewer 2, 07 Oct 2017

This paper seems now in mostly good shape, following the previous reviews and the revisions the authors have made in response to those earlier comments (I was not one of the previous reviewers). The idea of developing algorithms to rank nodes in trees using transfer events is a timely one, and the two algorithms described for minimizing conflicts (one heuristic and one exact) appear to be both new and sound. Accordingly I am happy to recommend publication, however, I have a few suggestions that will be easy for the authors to address.

The authors should cite and briefly discuss this paper A Method for Investigating Relative Timing Information on Phylogenetic Trees Daniel Ford, Frederick A. Matsen and Tanja Stadler Systematic biology, 58 (2): 167-183, 2009. While it doesn't directly deal with transfers, nevertheless the ideas in it are very relevant to this paper.
[optional] In the proof of Theorem 1, the authors could point out that the choice of a comb tree (line 3) is entirely arbitrary. Also, with slightly more work (and more "dummy" leaves o_i) one could also even ensure that each node is associated with just one arrow (unlike fig 3(b) where some nodes are associated with 2 and 3 arrows).
page 11 "Figure 6" - in my version I see the caption but no figure (>?!)
Proof of Theorem 2. I'd suggest starting it with {\em Proof}
Maybe also flag at the start that the algorithm is based on dynamic programming techniques. Then replace "Indeed, call, ...., the sequence" -> "Let.... denote the sequence" line 4 of proof: "note $CN$ the set" -> "let $CN$ denote the set" Figure 4 - make the arrows on the end of the transfer arrows bigger next para: "Note $N{ij}=... Let then" -> "Let $N{ij}= .., and let" page 9 - may put the usual square box for \end{proof} just before the para "Applying the mixing..."

https://doi.org/10.24072/pci.evolbiol.100083.rev22

Evaluation round #1

DOI or URL of the preprint: 10.1101/127548

Version of the preprint: 2

Author's Reply, 28 Sep 2017

Download author's reply https://doi.org/10.24072/pci.evolbiol.100083.ar1

Decision by Tatiana Giraud, posted 09 Aug 2017

The manuscript has been evaluated by two referees, who agree that this method using lateral gene transfers to help finding the temporal ordering (or ranking) of nodes in a given species tree is sound and should be of interest for scientists in evolutionary biology. The referees nevertheless raise concerns about the possible target journal, about lack of sufficient details and of clarity and suggest some improvements. I have to agree that, as it stands, the manuscript may not be readable for most biologists who could use this interesting method, which could prevent a wide use of the sofware. I would therefore recommend writing the abstract and introduction for a broader audience and explain there the method more intuitively. The conclusion does a better job in this regards than the abstract, but could still be improved. To sum up, there is potential for an interesting and relevant contribution to the field of evolutionary biology. However, the paper needs careful revision along the lines above. If you are able to accommodate these points, I would encourage resubmission to PCI Evol Biol.

https://doi.org/10.24072/pci.evolbiol.100083.d1

Reviewed by Alexandros Stamatakis, 14 Jul 2017

The authors present a very nice application of theoretical computer science results (the feedback arc problem) to a real biological problem. They develop a heuristic for minimizing the number of conflicts between a ranked order of nodes in the species tree and corresponding time constraints as obtained by programs for detecting lateral gene transfer.

The paper is overall nicely written and in general I would recommend acceptance as a reviewer. However, it is unclear for which journal this would be appropriate. The algorithms and theory are not described in sufficient detail (see some comments below) to merit publication in a more theoretical CS-style journal (like Journal of Theoretical Biology or BMC Algorithms for Molecular Biology) . In addition, there is too much algorithms and not enough biology for a journal like Syst Bio or MBE. So, I believe, the options here are to either make it more biological by moving most of the algorithms stuff to an on-line supplement and analyzing some recently published high-profile biological datasets or describe the algorithms in more detail and opt for a more theoretical journal.

Detailed comments:

The link to the github repo with the python scripts is insufficient for reproducing the results. The authors should describe in detail how APE etc. needs to be installed, how the python scripts were executed, where the simulated datasets can be downloaded etc. etc., i.e. a full transcript that allows for easily reproducing the results must be put together.

While I did not do this here, I usually also check the software that was developed with various tools (e.g., for C/C++ compiling with clang and all warnings enabled, checking with valgrind, checking for cyclomatic complexity etc etc.) to obtain a feeling for the respective code quality.

page 3: The authors should provide a more extended rationale regarding the simulation settings with SimPhy (why 1000 gene trees, why pop size between 2 and 10^6, why a transfer rate from 10^-9 t0 10^-6, etc.).

page 5: the proof and algorithm description needs at least 2-3 additional figures that would make everything much easier to follow, e.g., Theorem I needs a figure, the mixing principle needs a figure, the dynamic programming algorithm needs a figure.

page 5: the log n approximation should be mentioned earlier in the sentence where you mention that there is no constant factor approximation.

page 6: For the sake of completeness: provide (i) time and space complexities (ii) pseudocode of the algorithm

page 6: The description of the local search is a bit fuzzy and incomplete, e.g., I don't understand when it terminates and how exactly it works, apart from the fact that it apparently does some sort of randomized search.

page 7: would it be possible to design a program that solves the problem exhaustively on small instances and us it on some empirical dataset, e.g., the small yeast genome dataset from Antonis Rokas?

As already stated above, I believe that this manuscript could become more interesting to the user community if you showed that the method produces "interesting" results on some recently published phylogenomic studies.

page 10: Why did you fix the transfer rate to 10^-6 for assessing uncertainties in the species tree?

https://doi.org/10.24072/pci.evolbiol.100083.rev11

Reviewed by anonymous reviewer 1, 08 Aug 2017

This paper introduces a technique for using lateral gene transfers (LGTs) to estimate the temporal ordering (or ranking) of nodes in a given species tree. The technique is based on the idea that any correctly inferred LGT must be compatible with the true ranking of the species tree (i.e. donor species could not have lived more recently than the recipient species). The paper proposes a heuristic algorithm that takes as input an unranked species tree and a weighted list of LGTs, inferred using existing methods, and computes a ranking of the species nodes that is compatible with a maximum weight subset of the LGTs. An experimental study using simulated data suggests that the objective of seeking a ranking of the species nodes that is compatible with a maximum weight subset of the LGTs is generally reasonable, even though the true ranking often does not maximize the weight of compatible LGTs. The experiments also show that the heuristic algorithm generally produces fairly accurate rankings.

Some aspects of the algorithm description and experimental setup can be improved as follows.

a. The paper vaguely suggests, but does not prove, that the proposed heuristic algorithm is a log n-approximation algorithm for the maximum compatibility problem. Since the species tree can be unbalanced, it is not clear if this is the case. This should be clarified in the text.

b. The description of the “mixing” problem in the abstract and in section 3 is confusing. It should be clarified that the mixing step only solves the constrained problem where the given orders for the two subproblems are preserved. The current description suggests that an optimal ranking is computed, which is not the case.

c. The experimental study is interesting and informative but uses an overly simplified model of evolution. The paper also claims that the data was generated “under conditions comparable to published biological datasets”, but this is not correct. In simulating the gene trees, no gene duplications or gene losses are allowed. This makes the simulation study a bit unrealistic. There should at least be a reasonable lost rate used (approximately equal to the LGT rate), even if gene duplications are not allowed.

d. To properly understand normalized Kendall similarity, it would help to include the average normalized Kendall similarity for a random ranking of the nodes in the species tree. It looks like Figure 8 might include this information, but the description is confusing. This information should be included in the main text and the description of Figure 8 should also be clarified.

e. The authors investigate the relationship between the number of input LGTs and the accuracy of the ranking. However, from the perspective of an end user, it would still be difficult to determine if the input set of LGTs is sufficient to confidently rank the entire species tree. Is it possible to extend the heuristic algorithm to only output the portions of the ranking that are well-supported by the input LGTs?

https://doi.org/10.24072/pci.evolbiol.100083.rev12

User comments

No user comments yet

or Register
Submit a preprint