Improving the reliability of genotyping of multigene families in non-model organisms
A novel workflow to improve multi-locus genotyping of wildlife species: an experimental set-up with a known model system
The reliability of published scientific papers has been the topic of much recent discussion, notably in the biomedical sciences . Although small sample size is regularly pointed as one of the culprits, big data can also be a concern. The advent of high-throughput sequencing, and the processing of sequence data by opaque bioinformatics workflows, mean that sequences with often high error rates are produced, and that exact but slow analyses are not feasible.
The troubles with bioinformatics arise from the increased complexity of the tools used by scientists, and from the lack of incentives and/or skills from authors (but also reviewers and editors) to make sure of the quality of those tools. As a much discussed example, a bug in the widely used PLINK software  has been pointed as the explanation  for incorrect inference of selection for increased height in European Human populations .
High-throughput sequencing often generates high rates of genotyping errors, so that the development of bioinformatics tools to assess the quality of data and correct them is a major issue. The work of Gillingham et al.  contributes to the latter goal. In this work, the authors propose a new bioinformatics workflow (ACACIA) for performing genotyping analysis of multigene complexes, such as self-incompatibility genes in plants, major histocompatibility genes (MHC) in vertebrates, and homeobox genes in animals, which are particularly challenging to genotype in non-model organisms. PCR and sequencing of multigene families generate artefacts, hence spurious alleles. A key to Gillingham et al.‘ s method is to call candidate genes based on Oligotyping, a software pipeline originally conceived for identifying variants from microbiome 16S rRNA amplicons . This allows to reduce the number of false positives and the number of dropout alleles, compared to previous workflows.
This method is not based on an explicit probability model, and thus it is not conceived to provide a control of the rate of errors as, say, a valid confidence interval should (a confidence interval with coverage c for a parameter should contain the parameter with probability c, so the error rate 1- c is known and controlled by the user who selects the value of c). However, the authors suggest a method to adapt the settings of ACACIA to each application.
To compare and validate the new workflow, the authors have constructed new sets of genotypes representing different extents copy number variation, using already known genotypes from chicken MHC. In such conditions, it was possible to assess how many alleles are not detected and what is the rate of false positives. Gillingham et al. additionally investigated the effect of using non-optimal primers. They found better performance of ACACIA compared to a preexisting pipeline, AmpliSAS , for optimal settings of both methods. However, they do not claim that ACACIA will always be better than AmpliSAS. Rather, they warn against the common practice of using the default settings of the latter pipeline. Altogether, this work and the ACACIA workflow should allow for better ascertainment of genotypes from multigene families.
 Ioannidis, J. P. A, Greenland, S., Hlatky, M. A., Khoury, M. J., Macleod, M. R., Moher, D., Schulz, K. F. and Tibshirani, R. (2014) Increasing value and reducing waste in research design, conduct, and analysis. The Lancet, 383, 166-175. doi: 10.1016/S0140-6736(13)62227-8
 Chang, C. C., Chow, C. C., Tellier, L. C. A. M., Vattikuti, S., Purcell, S. M. and Lee, J. J. (2015) Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience, 4, 7, s13742-015-0047-8. doi: 10.1186/s13742-015-0047-8
 Robinson, M. R. and Visscher, P. (2018) Corrected sibling GWAS data release from Robinson et al. http://cnsgenomics.com/data.html
 Field, Y., Boyle, E. A., Telis, N., Gao, Z., Gaulton, K. J., Golan, D., Yengo, L., Rocheleau, G., Froguel, P., McCarthy, M.I . and Pritchard J. K. (2016) Detection of human adaptation during the past 2000 years. Science, 354(6313), 760-764. doi: 10.1126/science.aag0776
 Gillingham, M. A. F., Montero, B. K., Wihelm, K., Grudzus, K., Sommer, S. and Santos P. S. C. (2020) A novel workflow to improve multi-locus genotyping of wildlife species: an experimental set-up with a known model system. bioRxiv 638288, ver. 3 peer-reviewed and recommended by Peer Community In Evolutionary Biology. doi: 10.1101/638288
 Eren, A. M., Maignien, L., Sul, W. J., Murphy, L. G., Grim, S. L., Morrison, H. G., and Sogin, M.L. (2013) Oligotyping: differentiating between closely related microbial taxa using 16S rRNA gene data. Methods in Ecology and Evolution 4(12), 1111-1119. doi: 10.1111/2041-210X.12114
 Sebastian, A., Herdegen, M., Migalska, M. and Radwan, J. (2016) AMPLISAS: a web server for multilocus genotyping using next‐generation amplicon sequencing data. Mol Ecol Resour, 16, 498-510. doi: 10.1111/1755-0998.12453
François Rousset (2020) Improving the reliability of genotyping of multigene families in non-model organisms. Peer Community in Evolutionary Biology, 100092. 10.24072/pci.evolbiol.100092
Evaluation round #1
DOI or URL of the preprint: https://doi.org/10.1101/638288
Author's Reply, None
Decision by François Rousset, 16 Jul 2019
I managed to obtain two reviews. One of the reviews highlights why this ms may be eventually worth recommending by PCI. Nevertheless, it also notes two important weaknesses, and the other review points additional important issues. I summarize these criticisms below to make clear the main revisions that appear required for the ms to be eventually recommended.
From the first review:
"ACACIA might be advantageous to the existing programs / workflows, [but] this is not really fully tested in the manuscript": comparisons should be provided.
"The authors should either have run all settings in one study data-set or one setting in all data sets (or all combinations for all data sets)." Here the issue is : what can be concluded from the different analyses? I guess that the authors will be able to partially rebut this question, but it is not clear what is meant by "test" on l. 183 ("test ACACIA in wildlife species with unknown genotypes of varying CNV").
The second review highlights that ACACIA is not yet really a "pipeline" but rather an interactive script. Most importantly, it expresses concerns about the repeatability of the analyses. I concur with this review that reproducible(s) example(s) should be provided. This review also implies that the version described in the ms should be made permanently accessible. I see the point but I am not sure it is the best way to address the issue of reproducibility. An alternative view is that future versions should be tested against the results of the current version, which brings us back to the issue of providing reproducible examples.
I hope the authors will be able to submit a revised version addressing all these points.