In this manuscript by Opazzo et al., the authors use homology searches to identify genes from the DAN gene family (Differential screening-selected gene Aberrant in Neuroblastoma) across chordate lineages. The phylogenetic relationships of these genes were inferred and the toplogy of the resulting tree was used to describe the evolutionary history of the gene family.
Interestingly, the authors identify a new family member related to the Gremlin genes, which they dub Grem3. Next, in the Gnathostome lineage, the authors show evidence for five genes being present in its MRCA that are also widely retained across its descendents (e.g. the major Gnathosome lineages listed in figure 4). These genes include Grem1, Grem2, SOST, SOSTDC1, and NBL1. The authors also identify 3 gene family members that they conclude are likely in the gnathostome ancestor, but have experienced loss in some of the ancestors: Grem3, Cer1, and DAND5.
Over all, the manuscript is well-written and lays out its case fairly well. And for the most part, I find the major arguments to be reasonable. However, there are a lot of areas that I feel would benefit from feedback described here.
The methods are insufficiently detailed to permit the work to be repeated
The authors do not define the pool of sequences from which query and subject sequences are drawn. The specific implementation of blast and its version isn't cited. The filtering criteria used to determine whether hits are retained or discarded are not documented. The nature of the multiple alignment wasn't described. How much of the genes were alignable at the greatest divergences? In the introduction, the authors claim that there is "low inter-parallog conservation", indicating that the alignment may not be reliable in many regions. What was aligned? Nucleotides or amino acids (I assume amino acids)?
The results are fairly sparse on details
For example, display items aren't thoroughly described. The captions are very terse. For example, there appears to be a convention in the synteny plots where the absence of a bar indicates the absense of the gene (ag CER1 in Spotted Gar in Figure 2B). However, in Figures 5 and 6, dotted lines apparently indicate missing DAN genes but missing bars for flanking genes means that the gene isn't in the syntenic region. What is the scale in Figure 1? A bar with the number "0.7" is included. The caption doesn't elaborate. I'm accustomed to bootstrap support to be reported in Numerator/Denominator or explicity in %. The numbers corresponding to bootstrap support in Figure 1 are just bare integers.
The authors often point out disagreements with the literature, which is commendable. However, little effort is made to reconcile these observed disagreements. I'd feel better if the authors would discuss the discrepancies they point out.
"Although the study of Walsh et al. (2010) supports..., two other studies report alternative topologies."
"Nolan et al. (2014) recovered NBL1 as sister... However, in support of our study Avsian-Kretchmer et al. (2004) recovered NBL1 as sister to the GREM lineages."
"However, in contrast to Petillon et al. (2013), we did not find..."
The claim of "recovering monophyly" is confusing to me.
"Our results recovered monophyly of all DAN gene family members"
My parsing of this statement in the abstract (and others like it throughout the manuscript) is probably not what the authors intended. To me, this sounds like "we confirmed that, as a group, all DAN genes are monophyletic". This doesn't make sense in an analysis where the recovery of a gene from EnsEMBL is viewed as conferring DAN membership on that gene. So, by definition, every gene in the analysis is DAN, and with no non-DAN genes for contrast, no determination about monophyly can be made.
While I can't confidently interpolate what the authors actually meant, perhaps the following is closer to the authors' meaning:
"For each member of the gene family (e.g. CER1, SOST, SOSTDC1, DAND5, NBL1, GREM1, GREM2, and a new member, GREM5), the group of species sequences corresponding to each gene is monophyletic."
Even this formulation is a bit confusing to me, as the monophyly seems to be how the authors would assign a particular sequence in a particular species to particular family member. And in any event, this gets a bit muddied when there is gene duplication. What is monophyly when for some taxa, there are duplicates, and others, there aren't? Is "recovery of monophyly" a result as implied by the authors? Or rather is it part of how the authors are classifying the sequences into family members like CER1, etc.?
Perhaps this "recovery of monophyly" could be reconciled if the authors inferred the full duplication history with synteny for every species they examined and then layered the phylogenetic analysis of the gene family on top of that. But, as far as I can tell, this was not the strategy the authors followed in most cases.
Finally, DAND5 doesn't appear to offer strong support for monophyly given that lack of support for placing the Coelacanth as sister to the other DAND5 genes. The strong synteny argument doesn't change this assessment, as it could be a brute fact that the Coelacanth sequence is simultaneously the DAND5 ortholog and there is no strong evidence of monophyly with the remaining DAND5 orthologs.
One comment relating to paralogy confused me.
"The fourth clade corresponds to the NBL1 gene, the founding member of the DAN gene family, and was recovered as monophyletic with strong support (pink clade; Fig. 1)."
This way of discussing paralogy (ie "founding member") seems clumsy to me. Barring clear mechanistic reasons to assign one paralog the label "founder" or "parent" (e.g. the template for the RNA in retrogenes or the copy maintaining the ancestral structure in a chimeric duplicate), immediately after duplication, the copies are provisionally assumed to be redundant. And as such, it would only be confusing to label one member the "founding member". The authors even discuss this in relation to the putative redundancy between DAND5 and CER1.
The discussion of cancer on pages 14 and 15 isn't well-integrated into the rest of the manuscript. The reference to RPRM and p53 in particular seems like it could be better incorporated into the narrative of the manuscript. Personally, I'd recommend dropping it, but a smoother integration could also work.
In a manuscript like this one, I would like to see more in depth discussion of sources of error. The task the authors set before themselves is quite ambitious and requires marshaling a lot of data from many genes across many different taxa. These taxa were sequenced by different groups, at different times, with different technology, exhibit different levels of contiguity and likely accuracy and completeness, etc. Sources of error can include errors in multiple alignment, misannotation of the genes, and evolution in gene structure, all of which can lead to aligned non-homologous residues. Moreover, low assembly or annotation completeness can lead to missing genes.