Recommendation

Improving the reliability of genotyping of multigene families in non-model organisms

François Rousset based on reviews by Sebastian Ernesto Ramos-Onsins, Helena Westerdahl and Thomas Bigot

A recommendation of:

A novel workflow to improve multi-locus genotyping of wildlife species: an experimental set-up with a known model system

Gillingham, Mark A. F., Montero, B. Karina, Wilhelm, Kerstin, Grudzus, Kara, Sommer, Simone and Santos, Pablo S. C. (2020), bioRxiv, 376756, ver. 3 peer-reviewed by Peer Community in Evolutionary Biology https://doi.org/10.1101/638288

Read preprint in preprint server Now published in a journal

Data used for results

Codes used in this study

Abstract

EN

AR

ES

FR

HI

JA

PT

RU

ZH-CN

A novel workflow to improve multi-locus genotyping of wildlife species: an experimental set-up with a known model system

Genotyping novel complex multigene systems is particularly challenging in non-model organisms. Target primers frequently amplify simultaneously multiple loci leading to high PCR and sequencing artefacts such as chimeras and allele amplification bias. Most next-generation sequencing genotyping pipelines have been validated in non-model systems whereby the real genotype is unknown and the generation of artefacts may be highly repeatable. Further hindering accurate genotyping, the relationship between artefacts and copy number variation (CNV) within a PCR remains poorly described. Here we investigate the latter by experimentally combining multiple known major histocompatibility complex (MHC) haplotypes of a model organism (chicken, Gallus gallus, 43 artificial genotypes with 2-13 alleles per amplicon). In addition to well defined “optimal” primers, we simulated a non-model species situation by designing “naive” primers, with sequence data from closely related Galliform species. We applied a novel open-source genotyping pipeline (ACACIA) to the data, and compared its performance with another, previously published, pipeline. ACACIA yielded very high allele calling accuracy (>98%). Non-chimeric artefacts increased linearly with increasing CNV but chimeric artefacts leveled when amplifying more than 4-6 alleles. As expected, we found heterogeneous amplification efficiency of allelic variants when co-amplifying multiple loci. Using our validated ACACIA pipeline and the example data of this study, we discuss in detail the pitfalls researchers should avoid in order to reliably genotype complex multigene systems. ACACIA and the datasets used in this study are publicly available at GitLab and FigShare (https://gitlab.com/psc_santos/ACACIA and https://figshare.com/projects/ACACIA/66485).

open-source genotyping pipeline, ACACIA, next generation sequencing, amplicon genotyping, allele dropout, PCR amplification bias, sequencing bias, multigene family, MHC

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

سير عمل جديد لتحسين التنميط الجيني متعدد المواقع لأنواع الحياة البرية: إعداد تجريبي باستخدام نظام نموذجي معروف

يمثل التنميط الجيني لأنظمة متعددة الجينات المعقدة الجديدة تحديًا خاصًا في الكائنات غير النموذجية. تعمل الاشعال المستهدفة في كثير من الأحيان على تضخيم مواضع متعددة في وقت واحد مما يؤدي إلى ارتفاع PCR والمصنوعات اليدوية التسلسلية مثل الوهم وتحيز تضخيم الأليل. تم التحقق من صحة معظم خطوط أنابيب التنميط الجيني للجيل التالي في أنظمة غير نموذجية حيث يكون النمط الجيني الحقيقي غير معروف وقد يكون توليد المصنوعات اليدوية قابلاً للتكرار بدرجة كبيرة. مما يزيد من إعاقة التنميط الجيني الدقيق، فإن العلاقة بين المصنوعات اليدوية وتباين أرقام النسخ (CNV) داخل PCR لا تزال سيئة الوصف. نحن هنا نتحقق من هذا الأخير من خلال الجمع التجريبي بين العديد من الأنماط الفردية المعروفة لمعقد التوافق النسيجي الرئيسي (MHC) لكائن حي نموذجي (الدجاج، جالوس جالوس، 43 نمطًا وراثيًا صناعيًا مع 2-13 أليل لكل أمبليكون). بالإضافة إلى الاشعال "المثلى" المحددة جيدًا، قمنا بمحاكاة حالة الأنواع غير النموذجية من خلال تصميم الاشعال "الساذج"، مع بيانات تسلسلية من أنواع Galliform ذات الصلة الوثيقة. قمنا بتطبيق خط أنابيب التنميط الجيني مفتوح المصدر (ACACIA) على البيانات، وقمنا بمقارنة أدائه بخط أنابيب آخر تم نشره مسبقًا. أسفرت ACACIA عن دقة عالية جدًا في استدعاء الأليل (> 98%). زادت المصنوعات اليدوية غير الخيمرية خطيًا مع زيادة CNV لكن المصنوعات اليدوية الخيمرية تم تسويتها عند تضخيم أكثر من 4-6 أليلات. كما هو متوقع، وجدنا كفاءة تضخيم غير متجانسة للمتغيرات الأليلية عند التضخيم المشترك لمواقع متعددة. باستخدام خط أنابيب ACACIA الذي تم التحقق منه والبيانات النموذجية لهذه الدراسة، نناقش بالتفصيل المزالق التي يجب على الباحثين تجنبها من أجل التنميط الجيني للأنظمة متعددة الجينات المعقدة بشكل موثوق. ACACIA ومجموعات البيانات المستخدمة في هذه الدراسة متاحة للجمهور على GitLab وFigShare (https://gitlab.com/psc_santos/ACACIA وhttps://figshare.com/projects/ACACIA/66485).

خط أنابيب التنميط الجيني مفتوح المصدر، ACACIA، تسلسل الجيل التالي، التنميط الجيني amplicon، تسرب الأليل، تحيز تضخيم PCR، تحيز التسلسل، عائلة متعددة الجينات، MHC

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Un flujo de trabajo novedoso para mejorar el genotipado de múltiples locus de especies de vida silvestre: una configuración experimental con un sistema modelo conocido

La genotipificación de nuevos sistemas multigénicos complejos es particularmente desafiante en organismos que no son modelo. Los cebadores diana frecuentemente amplifican simultáneamente múltiples loci, lo que genera una PCR alta y artefactos de secuenciación como quimeras y sesgos de amplificación de alelos. La mayoría de los procesos de secuenciación de genotipado de próxima generación se han validado en sistemas no modelo en los que se desconoce el genotipo real y la generación de artefactos puede ser altamente repetible. Lo que dificulta aún más la genotipificación precisa es que la relación entre los artefactos y la variación del número de copias (CNV) dentro de una PCR sigue estando mal descrita. Aquí investigamos este último combinando experimentalmente múltiples haplotipos conocidos del complejo principal de histocompatibilidad (MHC) de un organismo modelo (pollo, Gallus gallus, 43 genotipos artificiales con 2-13 alelos por amplicón). Además de los cebadores "óptimos" bien definidos, simulamos una situación de especie no modelo mediante el diseño de cebadores "ingenuos", con datos de secuencia de especies galliformes estrechamente relacionadas. Aplicamos un novedoso proceso de genotipado de código abierto (ACACIA) a los datos y comparamos su rendimiento con otro proceso publicado anteriormente. ACACIA produjo una precisión de llamada de alelos muy alta (>98%). Los artefactos no quiméricos aumentaron linealmente al aumentar la CNV, pero los artefactos quiméricos se nivelaron cuando se amplificaron más de 4 a 6 alelos. Como era de esperar, encontramos una eficiencia de amplificación heterogénea de variantes alélicas al coamplificar múltiples loci. Utilizando nuestra tubería ACACIA validada y los datos de ejemplo de este estudio, analizamos en detalle los obstáculos que los investigadores deben evitar para genotipar de manera confiable sistemas multigénicos complejos. ACACIA y los conjuntos de datos utilizados en este estudio están disponibles públicamente en GitLab y FigShare (https://gitlab.com/psc_santos/ACACIA y https://figshare.com/projects/ACACIA/66485).

proceso de genotipado de código abierto, ACACIA, secuenciación de próxima generación, genotipado de amplicones, abandono de alelos, sesgo de amplificación por PCR, sesgo de secuenciación, familia multigénica, MHC

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Un nouveau flux de travail pour améliorer le génotypage multi-locus des espèces sauvages : une configuration expérimentale avec un système modèle connu

Le génotypage de nouveaux systèmes multigéniques complexes est particulièrement difficile dans les organismes non modèles. Les amorces cibles amplifient fréquemment simultanément plusieurs locus, conduisant à des artefacts de PCR et de séquençage élevés tels que des chimères et des biais d'amplification d'allèle. La plupart des pipelines de génotypage par séquençage de nouvelle génération ont été validés dans des systèmes non modèles dans lesquels le génotype réel est inconnu et la génération d'artefacts peut être hautement reproductible. Entrave encore davantage un génotypage précis, la relation entre les artefacts et la variation du nombre de copies (CNV) au sein d'une PCR reste mal décrite. Ici, nous étudions ce dernier en combinant expérimentalement plusieurs haplotypes du complexe majeur d'histocompatibilité (CMH) connus d'un organisme modèle (poulet, Gallus gallus, 43 génotypes artificiels avec 2 à 13 allèles par amplicon). En plus d'amorces « optimales » bien définies, nous avons simulé une situation d'espèce non modèle en concevant des amorces « naïves », avec des données de séquence d'espèces galliformes étroitement apparentées. Nous avons appliqué un nouveau pipeline de génotypage open source (ACACIA) aux données et comparé ses performances avec un autre pipeline précédemment publié. ACACIA a donné une très grande précision d'appel d'allèles (> 98 %). Les artefacts non chimériques ont augmenté de manière linéaire avec l'augmentation de la CNV, mais les artefacts chimériques se sont stabilisés lors de l'amplification de plus de 4 à 6 allèles. Comme prévu, nous avons constaté une efficacité d'amplification hétérogène des variants alléliques lors de la co-amplification de plusieurs loci. À l’aide de notre pipeline ACACIA validé et des exemples de données de cette étude, nous discutons en détail des pièges que les chercheurs devraient éviter afin de génotyper de manière fiable des systèmes multigéniques complexes. ACACIA et les ensembles de données utilisés dans cette étude sont accessibles au public sur GitLab et FigShare (https://gitlab.com/psc_santos/ACACIA et https://figshare.com/projects/ACACIA/66485).

pipeline de génotypage open source, ACACIA, séquençage nouvelle génération, génotypage d'amplicons, abandon d'allèles, biais d'amplification PCR, biais de séquençage, famille multigénique, CMH

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

वन्यजीव प्रजातियों के मल्टी-लोकस जीनोटाइपिंग में सुधार के लिए एक नया वर्कफ़्लो: एक ज्ञात मॉडल प्रणाली के साथ एक प्रयोगात्मक सेट-अप

गैर-मॉडल जीवों में नवीन जटिल मल्टीजीन प्रणालियों का जीनोटाइपिंग विशेष रूप से चुनौतीपूर्ण है। लक्ष्य प्राइमर अक्सर एक साथ कई लोकी को बढ़ाते हैं जिससे उच्च पीसीआर और अनुक्रमण कलाकृतियाँ जैसे कि काइमेरा और एलील प्रवर्धन पूर्वाग्रह होता है। अधिकांश अगली पीढ़ी की अनुक्रमण जीनोटाइपिंग पाइपलाइनों को गैर-मॉडल प्रणालियों में मान्य किया गया है, जिससे वास्तविक जीनोटाइप अज्ञात है और कलाकृतियों की पीढ़ी अत्यधिक दोहराई जा सकती है। सटीक जीनोटाइपिंग में और बाधा उत्पन्न करते हुए, पीसीआर के भीतर कलाकृतियों और प्रतिलिपि संख्या भिन्नता (सीएनवी) के बीच संबंध का खराब वर्णन किया गया है। यहां हम एक मॉडल जीव (चिकन, गैलस गैलस, 43 कृत्रिम जीनोटाइप प्रति एम्प्लिकॉन 2-13 एलील्स के साथ) के कई ज्ञात प्रमुख हिस्टोकम्पैटिबिलिटी कॉम्प्लेक्स (एमएचसी) हैप्लोटाइप को प्रयोगात्मक रूप से जोड़कर बाद की जांच करते हैं। अच्छी तरह से परिभाषित "इष्टतम" प्राइमरों के अलावा, हमने निकट से संबंधित गैलिफ़ॉर्म प्रजातियों के अनुक्रम डेटा के साथ, "भोले" प्राइमरों को डिजाइन करके एक गैर-मॉडल प्रजाति की स्थिति का अनुकरण किया। हमने डेटा के लिए एक उपन्यास ओपन-सोर्स जीनोटाइपिंग पाइपलाइन (एसीएसीआईए) लागू किया, और इसके प्रदर्शन की तुलना पहले प्रकाशित एक अन्य पाइपलाइन से की। ACACIA ने बहुत उच्च एलील कॉलिंग सटीकता (>98%) प्राप्त की। बढ़ते सीएनवी के साथ गैर-काइमेरिक कलाकृतियों में रैखिक रूप से वृद्धि हुई लेकिन 4-6 से अधिक एलील्स को बढ़ाने पर काइमेरिक कलाकृतियों को समतल किया गया। जैसा कि अपेक्षित था, हमने कई लोकी को सह-प्रवर्धित करते समय एलील वेरिएंट की विषम प्रवर्धन दक्षता पाई। हमारी मान्य ACACIA पाइपलाइन और इस अध्ययन के उदाहरण डेटा का उपयोग करते हुए, हम उन नुकसानों के बारे में विस्तार से चर्चा करते हैं जिनसे शोधकर्ताओं को जटिल मल्टीजीन सिस्टम को विश्वसनीय रूप से जीनोटाइप करने से बचना चाहिए। ACACIA और इस अध्ययन में उपयोग किए गए डेटासेट सार्वजनिक रूप से GitLab और FigShare (https://gitlab.com/psc_santos/ACACIA और https://figshare.com/projects/ACACIA/66485) पर उपलब्ध हैं।

ओपन-सोर्स जीनोटाइपिंग पाइपलाइन, ACACIA, अगली पीढ़ी अनुक्रमण, एम्प्लिकॉन जीनोटाइपिंग, एलील ड्रॉपआउट, पीसीआर प्रवर्धन पूर्वाग्रह, अनुक्रमण पूर्वाग्रह, मल्टीजीन परिवार, एमएचसी

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

野生生物種の多座位ジェノタイピングを改善するための新しいワークフロー: 既知のモデルシステムを使用した実験セットアップ

新規の複雑な多重遺伝子システムのジェノタイピングは、非モデル生物では特に困難です。ターゲットプライマーは、頻繁に複数の遺伝子座を同時に増幅し、キメラや対立遺伝子増幅バイアスなどの高い PCR およびシーケンシングアーチファクトを引き起こします。ほとんどの次世代シーケンシングのジェノタイピングパイプラインは非モデルシステムで検証されており、そのため実際のジェノタイプは不明であり、アーティファクトの生成の再現性が高い可能性があります。さらに正確なジェノタイピングを妨げているため、PCR 内のアーチファクトとコピー数変動 (CNV) との関係は依然として十分に説明されていません。今回我々は、モデル生物（ニワトリ、ガルス・ガルス、アンプリコンあたり2〜13の対立遺伝子を持つ43の人工遺伝子型）の複数の既知の主要組織適合性複合体（MHC）ハプロタイプを実験的に組み合わせることで、後者を調査します。明確に定義された「最適な」プライマーに加えて、密接に関連したガリフォーム種からの配列データを使用して「ナイーブ」プライマーを設計することにより、非モデル種の状況をシミュレートしました。新しいオープンソースのジェノタイピングパイプライン (ACACIA) をデータに適用し、そのパフォーマンスを以前に公開された別のパイプラインと比較しました。 ACACIA は非常に高い対立遺伝子呼び出し精度 (>98%) をもたらしました。非キメラアーチファクトはCNVの増加とともに直線的に増加しましたが、キメラアーチファクトは4〜6を超える対立遺伝子を増幅すると平準化しました。予想どおり、複数の遺伝子座を同時増幅すると、対立遺伝子変異体の増幅効率が不均一になることがわかりました。検証済みの ACACIA パイプラインとこの研究のデータ例を使用して、複雑な多重遺伝子システムを確実に遺伝子型特定するために研究者が避けるべき落とし穴について詳しく説明します。 ACACIA とこの研究で使用されたデータセットは、GitLab および FigShare (https://gitlab.com/psc_santos/ACACIA および https://figshare.com/projects/ACACIA/66485) で公開されています。

オープンソースジェノタイピングパイプライン、ACACIA、次世代シーケンシング、アンプリコンジェノタイピング、対立遺伝子ドロップアウト、PCR 増幅バイアス、シーケンシングバイアス、多重遺伝子ファミリー、MHC

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Um novo fluxo de trabalho para melhorar a genotipagem multi-locus de espécies selvagens: uma configuração experimental com um sistema modelo conhecido

A genotipagem de novos sistemas multigênicos complexos é particularmente desafiadora em organismos não-modelo. Os primers alvo frequentemente amplificam simultaneamente vários loci, levando a alta PCR e artefatos de sequenciamento, como quimeras e viés de amplificação de alelos. A maioria dos pipelines de genotipagem de sequenciamento de próxima geração foram validados em sistemas não-modelo, em que o genótipo real é desconhecido e a geração de artefatos pode ser altamente repetível. Dificultando ainda mais a genotipagem precisa, a relação entre artefatos e variação do número de cópias (CNV) dentro de uma PCR permanece pouco descrita. Aqui investigamos o último combinando experimentalmente vários haplótipos conhecidos do complexo principal de histocompatibilidade (MHC) de um organismo modelo (frango, Gallus gallus, 43 genótipos artificiais com 2-13 alelos por amplicon). Além de primers “ótimos” bem definidos, simulamos uma situação de espécie não-modelo projetando primers “ingênuos”, com dados de sequência de espécies Galliform intimamente relacionadas. Aplicamos um novo pipeline de genotipagem de código aberto (ACACIA) aos dados e comparamos seu desempenho com outro pipeline publicado anteriormente. ACACIA produziu uma precisão de chamada de alelos muito alta (>98%). Os artefatos não quiméricos aumentaram linearmente com o aumento da CNV, mas os artefatos quiméricos se estabilizaram ao amplificar mais de 4-6 alelos. Como esperado, encontramos eficiência de amplificação heterogênea de variantes alélicas ao co-amplificar múltiplos loci. Usando nosso pipeline ACACIA validado e os dados de exemplo deste estudo, discutimos em detalhes as armadilhas que os pesquisadores devem evitar para genotipar de forma confiável sistemas multigênicos complexos. ACACIA e os conjuntos de dados usados neste estudo estão disponíveis publicamente no GitLab e FigShare (https://gitlab.com/psc_santos/ACACIA e https://figshare.com/projects/ACACIA/66485).

pipeline de genotipagem de código aberto, ACACIA, sequenciamento de próxima geração, genotipagem de amplicon, abandono de alelos, viés de amplificação por PCR, viés de sequenciamento, família multigênica, MHC

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

Новый рабочий процесс для улучшения мультилокусного генотипирования видов диких животных: экспериментальная установка с известной модельной системой

Генотипирование новых сложных мультигенных систем особенно сложно у немодельных организмов. Целевые праймеры часто амплифицируют одновременно несколько локусов, что приводит к высоким показателям ПЦР и артефактам секвенирования, таким как химеры и систематическая ошибка амплификации аллелей. Большинство конвейеров генотипирования секвенирования следующего поколения были проверены в немодельных системах, в результате чего реальный генотип неизвестен, а генерация артефактов может быть очень повторяемой. Точному генотипированию препятствует еще и то, что взаимосвязь между артефактами и вариацией числа копий (CNV) в ПЦР остается плохо описанной. Здесь мы исследуем последнее, экспериментально комбинируя несколько известных гаплотипов главного комплекса гистосовместимости (MHC) модельного организма (курица, Gallus Gallus, 43 искусственных генотипа с 2-13 аллелями на ампликон). В дополнение к четко определенным «оптимальным» праймерам мы смоделировали ситуацию с немодельными видами, разработав «наивные» праймеры с данными о последовательностях близкородственных видов Galliform. Мы применили к данным новый конвейер генотипирования с открытым исходным кодом (ACACIA) и сравнили его производительность с другим, ранее опубликованным конвейером. ACACIA дала очень высокую точность определения аллелей (>98%). Нехимерные артефакты увеличивались линейно с увеличением CNV, но химерные артефакты выравнивались при амплификации более 4-6 аллелей. Как и ожидалось, мы обнаружили гетерогенную эффективность амплификации аллельных вариантов при совместной амплификации нескольких локусов. Используя наш проверенный конвейер ACACIA и примеры данных этого исследования, мы подробно обсуждаем ловушки, которых следует избегать исследователям, чтобы надежно генотипировать сложные мультигенные системы. ACACIA и наборы данных, использованные в этом исследовании, общедоступны на GitLab и FigShare (https://gitlab.com/psc_santos/ACACIA и https://figshare.com/projects/ACACIA/66485).

конвейер генотипирования с открытым исходным кодом, ACACIA, секвенирование следующего поколения, генотипирование ампликонов, выпадение аллелей, смещение амплификации ПЦР, смещение секвенирования, мультигенное семейство, MHC

This is an automatically generated version. The authors and PCI decline all responsibility concerning its content

改进野生动物物种多位点基因分型的新颖工作流程：使用已知模型系统的实验装置

在非模式生物中，对新型复杂多基因系统进行基因分型尤其具有挑战性。目标引物经常同时扩增多个基因座，导致高 PCR 和测序假象，例如嵌合体和等位基因扩增偏差。大多数下一代测序基因分型流程已在非模型系统中得到验证，其中真实基因型未知，并且人工制品的生成可能具有高度可重复性。 PCR 中的伪影和拷贝数变异 (CNV) 之间的关系仍然缺乏描述，这进一步阻碍了基因分型的准确。在这里，我们通过实验组合模型生物（鸡、原鸡、每个扩增子具有 2-13 个等位基因的 43 种人工基因型）的多个已知主要组织相容性复合体 (MHC) 单倍型来研究后者。除了明确定义的“最佳”引物外，我们还通过设计“原始”引物以及密切相关的鸡形动物物种的序列数据来模拟非模型物种的情况。我们对数据应用了一种新颖的开源基因分型管道 (ACACIA)，并将其性能与之前发布的另一个管道进行了比较。 ACACIA 产生非常高的等位基因识别准确度(＞98％)。非嵌合伪影随 CNV 的增加而线性增加，但嵌合伪影在扩增超过 4-6 个等位基因时呈水平。正如预期的那样，我们发现当共扩增多个位点时，等位基因变体的异质扩增效率。使用我们经过验证的 ACACIA 流程和本研究的示例数据，我们详细讨论了研究人员应避免的陷阱，以便可靠地对复杂的多基因系统进行基因分型。 ACACIA 和本研究中使用的数据集可在 GitLab 和 FigShare 上公开获取（https://gitlab.com/psc_santos/ACACIA 和 https://figshare.com/projects/ACACIA/66485）。

开源基因分型流程、ACACIA、下一代测序、扩增子基因分型、等位基因丢失、PCR 扩增偏差、测序偏差、多基因家族、MHC

Submission: posted 15 May 2019
Recommendation: posted 22 January 2020, validated 23 January 2020

Cite this recommendation as:
Rousset, F. (2020) Improving the reliability of genotyping of multigene families in non-model organisms. Peer Community in Evolutionary Biology, 100092. https://doi.org/10.24072/pci.evolbiol.100092

Recommendation

The reliability of published scientific papers has been the topic of much recent discussion, notably in the biomedical sciences [1]. Although small sample size is regularly pointed as one of the culprits, big data can also be a concern. The advent of high-throughput sequencing, and the processing of sequence data by opaque bioinformatics workflows, mean that sequences with often high error rates are produced, and that exact but slow analyses are not feasible.
The troubles with bioinformatics arise from the increased complexity of the tools used by scientists, and from the lack of incentives and/or skills from authors (but also reviewers and editors) to make sure of the quality of those tools. As a much discussed example, a bug in the widely used PLINK software [2] has been pointed as the explanation [3] for incorrect inference of selection for increased height in European Human populations [4].
High-throughput sequencing often generates high rates of genotyping errors, so that the development of bioinformatics tools to assess the quality of data and correct them is a major issue. The work of Gillingham et al. [5] contributes to the latter goal. In this work, the authors propose a new bioinformatics workflow (ACACIA) for performing genotyping analysis of multigene complexes, such as self-incompatibility genes in plants, major histocompatibility genes (MHC) in vertebrates, and homeobox genes in animals, which are particularly challenging to genotype in non-model organisms. PCR and sequencing of multigene families generate artefacts, hence spurious alleles. A key to Gillingham et al.‘ s method is to call candidate genes based on Oligotyping, a software pipeline originally conceived for identifying variants from microbiome 16S rRNA amplicons [6]. This allows to reduce the number of false positives and the number of dropout alleles, compared to previous workflows.
This method is not based on an explicit probability model, and thus it is not conceived to provide a control of the rate of errors as, say, a valid confidence interval should (a confidence interval with coverage c for a parameter should contain the parameter with probability c, so the error rate 1- c is known and controlled by the user who selects the value of c). However, the authors suggest a method to adapt the settings of ACACIA to each application.
To compare and validate the new workflow, the authors have constructed new sets of genotypes representing different extents copy number variation, using already known genotypes from chicken MHC. In such conditions, it was possible to assess how many alleles are not detected and what is the rate of false positives. Gillingham et al. additionally investigated the effect of using non-optimal primers. They found better performance of ACACIA compared to a preexisting pipeline, AmpliSAS [7], for optimal settings of both methods. However, they do not claim that ACACIA will always be better than AmpliSAS. Rather, they warn against the common practice of using the default settings of the latter pipeline. Altogether, this work and the ACACIA workflow should allow for better ascertainment of genotypes from multigene families.

References

[1] Ioannidis, J. P. A, Greenland, S., Hlatky, M. A., Khoury, M. J., Macleod, M. R., Moher, D., Schulz, K. F. and Tibshirani, R. (2014) Increasing value and reducing waste in research design, conduct, and analysis. The Lancet, 383, 166-175. doi: 10.1016/S0140-6736(13)62227-8
[2] Chang, C. C., Chow, C. C., Tellier, L. C. A. M., Vattikuti, S., Purcell, S. M. and Lee, J. J. (2015) Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience, 4, 7, s13742-015-0047-8. doi: 10.1186/s13742-015-0047-8
[3] Robinson, M. R. and Visscher, P. (2018) Corrected sibling GWAS data release from Robinson et al. http://cnsgenomics.com/data.html
[4] Field, Y., Boyle, E. A., Telis, N., Gao, Z., Gaulton, K. J., Golan, D., Yengo, L., Rocheleau, G., Froguel, P., McCarthy, M.I . and Pritchard J. K. (2016) Detection of human adaptation during the past 2000 years. Science, 354(6313), 760-764. doi: 10.1126/science.aag0776
[5] Gillingham, M. A. F., Montero, B. K., Wihelm, K., Grudzus, K., Sommer, S. and Santos P. S. C. (2020) A novel workflow to improve multi-locus genotyping of wildlife species: an experimental set-up with a known model system. bioRxiv 638288, ver. 3 peer-reviewed and recommended by Peer Community In Evolutionary Biology. doi: 10.1101/638288
[6] Eren, A. M., Maignien, L., Sul, W. J., Murphy, L. G., Grim, S. L., Morrison, H. G., and Sogin, M.L. (2013) Oligotyping: differentiating between closely related microbial taxa using 16S rRNA gene data. Methods in Ecology and Evolution 4(12), 1111-1119. doi: 10.1111/2041-210X.12114
[7] Sebastian, A., Herdegen, M., Migalska, M. and Radwan, J. (2016) AMPLISAS: a web server for multilocus genotyping using next‐generation amplicon sequencing data. Mol Ecol Resour, 16, 498-510. doi: 10.1111/1755-0998.12453

PDF recommendation

Conflict of interest:
The recommender in charge of the evaluation of the article and the reviewers declared that they have no conflict of interest (as defined in the code of conduct of PCI) with the authors or with the content of the article. The authors declared that they comply with the PCI rule of having no financial conflicts of interest in relation to the content of the article.

Reviews

Reviewed by Thomas Bigot, 16 Dec 2019

The authors made a remarkable work on their manuscript.

It is now clear which software version is presented in the article, with which dataset. Some missing points were explained.
The instructions were made more complete, and the code was stored on a permanent repository with a DOI. It is now possible to fully test it.

My suggestion to use a pipeline manager was discussed in the response and I agree the solution the authors chose (a homemade manager) is suitable for this kind of pipeline. Il find the fact the answers to the interactive pipeline questions are stored in the configuration file for next executions is a smart way to make it more user friendly.

This pipeline now totally reach the current standards of a bioinformatics tool, and to my mind is suitable for publication.

Some very minor remarks:

L37, L203, L392, L553, L558, L566, L609: naive still does not take an umlaut in English

Enumerations are made using different styles: 1) 2) L73 L77 ; 1.) 2.) L152 L153 and 1. 2. L175 L177. This may be standardized.

https://doi.org/10.24072/pci.evolbiol.100092.rev21

Reviewed by Sebastian Ernesto Ramos-Onsins, 16 Dec 2019

In this work, the authors propose a new workflow (ACACIA) for performing genotyping analysis of relatively complex muti-locus systems, addressed specially to non-model species. The authors realized of a number of problems in genotyping analysis of multi-locus systems (also detected and reviewed for other authors as referenced in the manuscript), such as MHC, and constructed a workflow in which is key to use a method that call candidate genes based on clustering redundant alleles from other divergent alleles, given the information contained at each position (Olygotyping tool). This workflow allows to reduce the number of false positives and the number of dropout alleles in relation to other available workflows. Although, this key process avoided a threshold decision, these kind of methodologies are not fully probabilistic, and therefore, a posterior decision also make some errors in discarding possible true alleles. Nevertheless, I find a good and practical solution that improves existent methods.

The authors construct a new set of genotypes of different CNV in order to compare and validate the new workflow, using already known genotypes from chicken. Thus, it is possible to test for example, how many alleles are not detected and what is the rate of false positives. I find it correct and very informative about the possibilities of this methodology.

Finally, the authors have thought about all the suggestions given by previous reviewers and have included most of them. In my opinion, the manuscript and the software has greatly improved. I have no more suggestions.

https://doi.org/10.24072/pci.evolbiol.100092.rev22

Evaluation round #1

DOI or URL of the preprint: https://doi.org/10.1101/638288

Author's Reply, 28 Oct 2019

Download author's reply https://doi.org/10.24072/pci.evolbiol.100204.ar1

Decision by François Rousset, posted 16 Jul 2019

I managed to obtain two reviews. One of the reviews highlights why this ms may be eventually worth recommending by PCI. Nevertheless, it also notes two important weaknesses, and the other review points additional important issues. I summarize these criticisms below to make clear the main revisions that appear required for the ms to be eventually recommended.

From the first review:

"ACACIA might be advantageous to the existing programs / workflows, [but] this is not really fully tested in the manuscript": comparisons should be provided.

"The authors should either have run all settings in one study data-set or one setting in all data sets (or all combinations for all data sets)." Here the issue is : what can be concluded from the different analyses? I guess that the authors will be able to partially rebut this question, but it is not clear what is meant by "test" on l. 183 ("test ACACIA in wildlife species with unknown genotypes of varying CNV").

The second review highlights that ACACIA is not yet really a "pipeline" but rather an interactive script. Most importantly, it expresses concerns about the repeatability of the analyses. I concur with this review that reproducible(s) example(s) should be provided. This review also implies that the version described in the ms should be made permanently accessible. I see the point but I am not sure it is the best way to address the issue of reproducibility. An alternative view is that future versions should be tested against the results of the current version, which brings us back to the issue of providing reproducible examples.

I hope the authors will be able to submit a revised version addressing all these points.

https://doi.org/10.24072/pci.evolbiol.100204.d1

Reviewed by Helena Westerdahl, 26 Jun 2019

Download the review https://doi.org/10.24072/pci.evolbiol.100204.rev11

Reviewed by Thomas Bigot, 10 Jul 2019

This article presents a workflow to improve multi-locus genotyping. They propose an experimental set-up and a pipeline named Acacia to perform the genotyping itself. They chose chicken as a model organism, and try to characterize sequences of MHC B Complex with their tool.

The manuscript is well-written.

According to my skills, I will focus this review on the pipeline and its bioinformatics aspects.

The ACACIA pipeline

Description in the article

The introduction (L 273) should mention Biopython as a dependency;
Some step were coded ad-hoc, even being non-trivial (e.g. Trimming low quality ends). The reason why well known methods wer not used should be briefly explained.
Input data is not explained. In the documentation, three input files are listed. One of them is A fasta file with 100+ sequences related to those that you expect to have sequenced. This file will be used to setup a local BLAST database.. This description is not clear and BLAST is not mentionned in the manuscript.
FLASH and Pandas are used in the script but not mentionned in the manuscript.

The pipeline itself

Reproducibility of the code

I have a major concern about reproducibility: the only code available is the master branch of the git repository. If a user downloads the pipeline in the future, nothing can tell the code available at this time corresponds to the one described in this article, and nothing guarantees the code is still available on Github. Hence, I strongly suggest to:

create a release number of the code (eg v1.0) and indicate this number in the article;
create an archive of this release and upload it to zenodo or figshare (or any repository of this kind);
get a DOI from them, and indicate it in the pipeline description.

Dataset: reproducibility of the analysis

I wish I could test the pipeline, but no example dataset is provided. Moreover, the article describes the analysis of a peculiar one (Chicken HMC), so it should be included. I suggest to upload it (fastq data, primers, “well known sequences”) at zenodo or figshare like explained just above, and indicate the DOI in the article, and in the documentation files as a testing procedure.

Pipeline manager

The program is an interactive script, asking questions to the user who has to wait during the whole time of the analysis. No argument can be provided to the pipeline at the launching time. The files must be at certain places with certain names. Moreover, all the steps are performed in one run: if one step fails, it has to be restarted from the beginning.

This script should be transformed as a real pipeline, using a dedicated software. As authors seem to have a good level of Python, I suggest them to choose Snakemake (https://snakemake.readthedocs.io/en/stable/). It is a Python tool: each step code chunk could be simply copied/pasted in the Snakemake recipie.

Other remarks

L 216, L 217: “Naive” and “naively” do have an umlaut in English.

https://doi.org/10.24072/pci.evolbiol.100204.rev12

User comments

No user comments yet

or Register
Submit a preprint