Papillomaviruses (PVs) infect almost all mammals and possibly amniotes and bony fishes. While most of them have no significant effects on the hosts, some induce physical lesions. Phylogeny of PVs consists of a few crown groups , among which AlphaPVs that infect primates including human have been well studied. They are associated to largely different clinical manifestations: non-oncogenic PVs causing anogenital warts, oncogenic and non-oncogenic PVs causing mucosal lesions, and non-oncogenic PVs causing cutaneous warts.
The PV genome consists of a double stranded circular DNA genome, roughly organized into three parts: an early region coding for six open reading frames (ORFs: E1, E2, E4, E5, E6 and E7) involved in multiple functions including viral replication and cell transformation; a late region coding for structural proteins (L1 and L2); and a non-coding regulatory region (URR) that contains the cis-elements necessary for replication and transcription of the viral genome.
The E5, E6, and E7 are known to act as oncogenes. The E6 protein binds to the cellular p53 protein . The E7 protein binds to the retinoblastoma tumor suppressor gene product, pRB . However, the E5 has been poorly studied, even though a high correlation between the type of E5 protein and the infection phenotype is observed. E5s, being present on the E2/L2 intergenic region in the genomes of a few polyphyletic PV lineages, are so diverged and can only be characterized by high hydrophobicity. No similar sequences have been found in the sequence database.
Willemsen et al.  provide valuable evidence on the origin and evolutionary history of E5 genes and their genomic environments. First, they tested common ancestry vs independent origins . Because alignment can lead to biased testing toward the hypothesis of common ancestry , they took full account of alignment uncertainty  and conducted random permutation test . Although the strong chemical similarity hampered decisive conclusion on the test, they could confirm that E5 may do code proteins, and have unique evolutionary history with far different topology from the neighboring genes.
Still, there is mysteries with the origin and evolution of E5 genes. One of the largest interest may be the evolution of hydrophobicity, because it may be the main cause of variable infection phenotype. The inference has some similarity in nature with the inference of evolutionary history of G+C contents in bacterial genomes . The inference may take account of possible opportunity of convergent or parallel evolution by setting an anchor to the topologies of neighboring genes.
 Bravo, I. G., & Alonso, Á. (2004). Mucosal human papillomaviruses encode four different E5 proteins whose chemistry and phylogeny correlate with malignant or benign growth. Journal of virology, 78, 13613-13626. doi: 10.1128/JVI.78.24.13613-13626.2004
 Werness, B. A., Levine, A. J., & Howley, P. M. (1990). Association of human papillomavirus types 16 and 18 E6 proteins with p53. Science, 248, 76-79. doi: 10.1126/science.2157286
 Dyson, N., Howley, P. M., Munger, K., & Harlow, E. D. (1989). The human papilloma virus-16 E7 oncoprotein is able to bind to the retinoblastoma gene product. Science, 243, 934-937. doi: 10.1126/science.2537532
 Willemsen, A., Félez-Sánchez, M., & Bravo, I. G. (2019). Genome plasticity in Papillomaviruses and de novo emergence of E5 oncogenes. bioRxiv, 337477, ver. 3 peer-reviewed and recommended by PCI Evol Biol. doi: 10.1101/337477
 Theobald, D. L. (2010). A formal test of the theory of universal common ancestry. Nature, 465, 219–222. doi: 10.1038/nature09014
 Yonezawa, T., & Hasegawa, M. (2010). Was the universal common ancestry proved?. Nature, 468, E9. doi: 10.1038/nature09482
 Redelings, B. D., & Suchard, M. A. (2005). Joint Bayesian estimation of alignment and phylogeny. Systematic biology, 54(3), 401-418. doi: 10.1080/10635150590947041
 de Oliveira Martins, L., & Posada, D. (2014). Testing for universal common ancestry. Systematic biology, 63(5), 838-842. doi: 10.1093/sysbio/syu041
 Galtier, N., & Gouy, M. (1998). Inferring pattern and process: maximum-likelihood implementation of a nonhomogeneous model of DNA sequence evolution for phylogenetic analysis. Molecular biology and evolution, 15(7), 871-879. doi: 10.1093/oxfordjournals.molbev.a025991
MDS obtains a map based on the distance matrix. Correspondence analysis obtains a map that corresponds samples and categories. Based on these properties of the methods, I am afraid that Figures 2 and 6 were obtained not by correspondence analysis and MDS respectively and but by MDS and correspondence analysis respectively. Please confirm quickly whether the explanations of these figures are correct.
Dear Hirohisa Kishino,
Thank you for your revision. We have verified the legends of Figures 2 and 6, and we confirm the explanations are correct. The correspondence analysis in Figure 2 has been performed on a symmetrical matrix containing the unweighted and weighted Robinson-Foulds tree distances. The MDS in Figure 6 was performed on a matrix containing the distances in codon usage preferences for the different AlphaPV ORFS. I hope that this explanation clarified any doubts.
Dear Anouk Willemsen,
Thank you for submitting the manuscript to PCI Evolutionary Biology. Now, we have received comments from two reviewers. each of the reviewers' comments. Both of the reviewers appreciate the work. However, Leonardo de Oliveira Martins raised methodological concerns on the UCA test, which is a core part of the manuscript, and made a constructive suggestion. Please read the comments carefully and revise the manuscript, responding to each of them.
Sincerely yours, Hiro
The authors describe independent insertions of a non-coding stretch of DNA in the intergenic E2–L2 region of Papillomavirus (PV) genomes, with subsequent acquisition of coding capacity leading to clinically important novel proteins. The manuscript is well written, and describes concisely an important problem ― de novo oncogene emergence ― using an ingenious solution. At different phylogenetic scales, the authors tested explicitly if the inter–E2–L2 region (a highly variable and clinically relevant region of the circular PV genome) has a single common ancestry or appeared independently, given its low similarity and diverse composition. Within a particularly important clade (AlphaPV) they furthermore explored the genetic characteristics of the so-called E5 ORFs. Although the common ancestry hypothesis is properly addressed, I am afraid that the particular model used may not be very convincing without a few modifications. I will describe this problem in more detail below, together with a few other suggestions.
◾ In  we give a hint on potential complications when using Bali-phy (and Bayesian models, in general) for model selection: the difficulty in achieving convergence, and the poor quality of the marginal likelihood estimation:
Therefore I would like to suggest a few options that may corroborate your conclusions and help convince readers of the independent ancestry of the inter-E2–L2 regions, at your discretion:
Notice that you don’t need to add all the suggested analyses, but some further evidence for the independent origins hypothesis will be welcome.
◾ I am bit confused about the section “DNA Sequences in The inter-E2–L2 Region in AlphaPVs are Monophyletic but The E5 ORFs Therein Encoded are Not” (page 5): If E5β has an independent origin, then Cut should also be inferred as independently originated, unless they represent non-overlapping regions. Or maybe the inter-E2-L2 regions described on Table 2 exclude the E5 ORFs (the “non-coding regions” described in the discussion)? A diagram showing which regions are being included in each test, or at least a bit more info (e.g. if some sequences miss the E5 ORF, or about the non-coding regions) would help, even for the previous analysis (Table 1). Notice that this confusion may be a product of my limited knowledge of these genomes, but hopefully you can make these points clearer to other reader like me.
◾ Furthermore I have a few minor suggestions, that nonetheless can be easily addressed:
 de Oliveira Martins, L. & Posada, D. Testing for Universal Common Ancestry. Systematic Biology 63, 838–842 (2014). http://www.ncbi.nlm.nih.gov/pubmed/24958930
 de Oliveira Martins, L. & Posada, D. Infinitely Long Branches and an Informal Test of Common Ancestry. Biology Direct 11 (1): 19. (2016) http://dx.doi.org/10.1186/s13062-016-0120-y
This paper uses computational analysis to examine the evolution of the papillomavirus (PV) E5 ORF, which is located between the early and late region (the inter-E2-L2 region) of the PV genome. First, it provides evidence that the nucleotide sequence of the inter-E2-L2 region among the various PV types is not derived from a common ancestor. Instead, at least five independent events, one occurring for each PV clade, resulted in the insertion of this region. This implies that the E5 ORFs in the AlphaPVs (e.g., HPV16) and those of the DeltaPVs (e.g., BPV-1) are evolutionarily unrelated, consistent with the fact that the E5 proteins of HPV16 and BPV-1 share little amino acid sequence similarity except for their hydrophobicity. The authors next focused on evolution of the E5 ORFs from the AlphaPVs, which includes the HPVs. They show that while the nucleotide sequence of the inter-E2-L2 region of these PVs arose from a common ancestor, their E5 ORFs did not. Specifically, the E5 ORFs from HPVs with mucosal tropism arose separately from those with cutaneous tropism. Since the oncogenic HPVs are mucosal and not cutaneous, the independent evolution of the E5 ORF in these HPV types suggests a role for E5 in the oncogenic potential of HPVs. Finally, this paper shows the that E5 ORFs in AlphaPVs display characteristics of actual coding sequences. The authors propose that the PV E5 genes evolved by the de novo emergence of new protein-coding sequences from non-coding regions. They speculate that the independent emergence of the E5 ORFs in different HPV types occurred by random nucleotide addition and/or recombination during viral DNA synthesis to insert a noncoding sequence, followed by mutation to generate a new protein coding sequence. But, although the PV E5 genes arose independently, they all encode a small hydrophobic protein. The occurrence of multiple independent selection events for a small hydrophobic protein suggests that modulating cellular membrane proteins or the membrane environment by such a protein is important for PV fitness.
Overall, this paper provides an interesting scenario for the evolution of a diverse class of small viral transmembrane proteins and should be accepted for publication with minimal revision.
Page 10, 4th paragraph, 5th and 6th sentences should read: "Experimentally, protein structures that have not been observed in nature have been isolated and shown to have biological activity. More specifically, Chacon et al., 2014, used genetic selection to isolate small artificial transmembrane proteins modeled after the BPV-1 E5 protein but lacking any preexisting sequences."
Page 10, 5th paragraph, 4th sentence: replace "rise" with "raise"