Fitting diversification models on undated or partially dated trees
Probabilities of tree topologies with temporal constraints and diversification shifts
Recommendation: posted 09 January 2020, validated 10 January 2020
Lartillot, N. (2020) Fitting diversification models on undated or partially dated trees. Peer Community in Evolutionary Biology, 100088. 10.24072/pci.evolbiol.100088
Phylogenetic trees can be used to extract information about the process of diversification that has generated them. The most common approach to conduct this inference is to rely on a likelihood, defined here as the probability of generating a dated tree T given a diversification model (e.g. a birth-death model), and then use standard maximum likelihood. This idea has been explored extensively in the context of the so-called diversification studies, with many variants for the models and for the questions being asked (diversification rates shifting at certain time points or in the ancestors of particular subclades, trait-dependent diversification rates, etc).
However, all this assumes that the dated tree T is known without error. In practice, trees (that is, both the tree topology and the divergence times) are inferred based on DNA sequences, possibly combined with fossil information for calibrating and informing the divergence times. Molecular dating is a delicate exercise, however, and much more so in fact than reconstructing the tree topology. In particular, a mis-specificied model for the relaxed molecular clock, or a mis-specifiied prior, can have a substantial impact on the estimation of divergence dates - which in turn could severely mislead the inference about the underlying diversification process. This thus raises the following question: would that be possible to conduct inference and testing of diversification models without having to go through the dangerous step of molecular dating?
In his article ""Probabilities of tree topologies with temporal constraints and diversification shifts"" , Gilles Didier introduces a recursive method for computing the probability of a tree topology under some diversification model of interest, without knowledge of the exact dates, but only interval constraints on the dates of some of the nodes of the tree. Such interval constraints, which are derived from fossil knowledge, are typically used for molecular dating: they provide the calibrations for the relaxed clock analysis. Thus, what is essentially proposed by Gilles Didier is to use them in combination with the tree topology only, thus bypassing the need to estimates divergence times first, before fitting a diversification model to a phylogenetic tree.
This article, which is primarily a mathematical and algorithmic contribution, is then complemented with several applications: testing for a diversification shift in a given subclade of the phylogeny, just based on the (undated) tree topology, with interval constraints on some of its internal nodes; but also, computing the age distribution of each node and sampling on the joint distribution on node ages, conditional on the interval constraints. The test for the presence of a diversification shift is particularly interesting: an application to simulated data (and without any interval constraint in that case) suggests that the method based on the undated tree performs about as well as the classical method based on a dated tree, and this, even granting the classical approach a perfect knowledge of the dates - given that, in practice, one in fact relies on potentially biased estimates. Finally, an application to a well-known example (rate shifts in cetacean phylogeny) is presented.
This article thus represents a particularly meaningful contribution to the methodology for diversification studies; but also, for molecular dating itself: it is a well known problem in molecular dating that computing and sampling from the conditional distributions on node ages, given fossil constraints, and more generally understanding and visualizing how interval constraints on some nodes of the tree impact the distribution at other nodes, is a particularly difficult exercise. For that reason, the algorithmic routines presented in the present article will be useful in this context as well.
 Didier, G. (2020) Probabilities of tree topologies with temporal constraints and diversification shifts. bioRxiv, 376756, ver. 4 peer-reviewed and recommended by PCI Evolutionary Biology. doi: 10.1101/376756
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article. The authors declared that they comply with the PCI rule of having no financial conflicts of interest in relation to the content of the article.
Evaluation round #2
DOI or URL of the preprint: https://doi.org/10.1101/376756
Author's Reply, 19 Dec 2019
Decision by Nicolas Lartillot, posted 23 Nov 2019
Your revised manuscript has been reviewed by Pr. Amaury Lambert. As you will see, only minor points remain to be fixed: if you could just have a look at them. In particular, I agree that it would be important to refer to the alternative method proposed by Amaury Lambert directly in the main text, referring to his review. The reviewing process is public, and thus it is probably a good thing to refer the Readers to it directly from the manuscript, so as to invite them to read it and compare the two algorithms.
We are very close to final acceptance. Once you have submitted your final version, I will proceed with the recommendation.
witth best regards,
Reviewed by Amaury Lambert, 19 Nov 2019
Evaluation round #1
DOI or URL of the preprint: https://doi.org/10.1101/376756
Author's Reply, 04 Sep 2019
Decision by Nicolas Lartillot, posted 24 Apr 2019
Dear Gilles Didier,
There is a general consensus among the reviewers that this manuscript represents an important contribution in the field of diversification studies. The algorithmic and computational results are potentially useful, and their derivation is tight and rigorous.
On the other hand, there is also a general feeling that, as it stands, the manuscript is very technical and does not sufficiently emphasize the intuitions behind the mathematical developments or the potential applications to specific research questions in diversification studies. In the end, there is a legitimate concern that this highly technical presentation will make the manuscript not accessible to most readers of the targeted audience and will not do justice to the practical significance of the work.
The reviewers have made several suggestions to improve the overall presentation and make it less arduous, among which:
- getting rid of the combinatorial factors related to the labelling of the tree, by labelling it from the start;
- doing the recursion only in terms of the constraints on node ages, leaving the piece-wise constant aspect of the model hidden in the details -- in fact, the whole derivation could even be conducted under a homogeneous birth-death, then just suggesting that the calculation could be generalized to arbitrary piecewise constant. or even other time-varying, versions of the process, without major modifications.
- using both simpler and more explicit notations;
- relying a graphical example for explaining the intuition behind the quadratic recursive algorithm (e.g. continuing on the example given in figure 3).
I agree with those suggestions. I would even go further, and suggest a different way to organize the manuscript: in the main text, a more general and more intuitive description of the main algorithmic ideas could be given, relying more heavily on a graphical example such as the one given in figure 3, and leaving all technical aspects of the derivation (much of the current main text) in an appendix. Then, as suggested by one of the reviewers, more emphasis could be put on the applications. This would give the reader with two options: either a fast track (to get the general idea and appreciate the significance of the work in terms of its potential applications), or the complete story, for the more theoretically inclined readers.
The english also needs improvement.
Of note: one of the reviewers point out an alternative integration method, which might have a better complexity as a function of tree size. This should probably be examined and discussed.
Concerning the application to testing for diversification shifts, I would have some additional comments:
(1) in practice, the shift time is not known, but one may have good fossil data giving an upper and/or lower bound for the age of the last common ancestor of the subclade. Similarly, the time of origin of the entire clade is not known either, but some interval constraint derived from fossil information might be available concerning the age of the root. I was wondering if the test could be designed so as to rely on this practically more relevant fossil information instead of relying on the knowledge of the shift time (and of the time of origin, which is fixed and assumed known, right?).
(2) comparing LambdaN with LambdaP is theoretically interesting, but not so useful in practice (since exact knowledge of divergence times, such as assumed by LambdaP, is lacking). In real-world applications, one would instead want to compare LambdaN with a plug-in version of LambdaP relying on an explicit dating of the tree obtained using relaxed clock approaches. In this context, a key question is whether LambdaN shows more robustness, without loosing so much in sensitivity. This point could be discussed.
(3) ideally, a empirical example could be presented based on a previously published case (this relates to the suggestion of one of the reviewers, to put more emphasis on the applications).
Additional requirements of the managing board:
As indicated in the 'How does it work?’ section and in the code of conduct, please make sure that (if adequate):
-Data are available to readers, either in the text or through an open data repository such as Zenodo (free), Dryad (to pay) or some other institutional repository. Data must be reusable, thus metadata or accompanying text must carefully describe the data.
-Details on quantitative analyses (e.g., data treatment and statistical scripts in R, bioinformatic pipeline scripts, etc.) and details concerning simulations (scripts, codes) are available to readers in the text, as appendices, or through an open data repository, such as Zenodo, Dryad or some other institutional repository. The scripts or codes must be carefully described so that they can be reused.
-Details on experimental procedures are available to readers in the text or as appendices.
-Authors have no financial conflict of interest relating to the article. The article must contain a "Conflict of interest disclosure" paragraph before the reference section containing this sentence: "The authors of this preprint declare that they have no financial conflict of interest with the content of this article." If appropriate, this disclosure may be completed by a sentence indicating that some of the authors are PCI recommenders: “XXX is one of the PCI XXX recommenders.”
All the best. The Managing Board of PCI Evol Biol.