Computational protocol: Toward Universal Forward Genetics: Using a Draft Genome Sequence of the Nematode Oscheius tipulae To Identify Mutations Affecting Vulva Development

Similar protocols

Protocol publication

[…] All software tools used (including versioning and command line main options) are summarized in Table S4. Raw reads were trimmed for adaptors using Cutadapt () and low-quality bases, then corrected for sequencing errors based on k-mer content using Quake () and JELLYFISH (). Raw data were checked with FastQC () and a preliminary assembly generated with CLC Assembly Cell (CLC bio 2017) (Table S5). The CLC assembly was screened for contaminants using taxon-annotated, GC-coverage (TAGC) plots (). Only data deriving from E. coli, the food source, was identified as contaminant and the corresponding reads were removed. The optimal k-mer size for assembly of the cleaned read set was estimated using KmerGenie (). Nine different assemblers () were used to generate preliminary assemblies and these were assessed using basic metrics, correctness of read alignment using ALE () and REAPR (), and biological completeness using Core Eukaryotic Genes Mapping Appoach (CEGMA) () and direct identification of ribosomal RNA genes and mitochondrial genome sequences. SPAdes () outperformed the other assembly tools in almost all aspects and was chosen as draft assembly nOt.1.0. An improved assembly limited to the nuclear genome (nOt.2.0) was generated by removing mitochondrial contigs and contigs of abnormally low coverage and by breaking all scaffolds where mis-assembly had been indicated from analysis of mapping plots and REAPR fragment coverage distribution (FCD) scores. [...] Genes were predicted using a two-pass pipeline () (see Figure S3) based on MAKER2 () and Augustus (), and using the transcriptome data as evidence. Repeats were identified in the assembly using RepeatModeler (). MAKER2 was run in an SGE cluster using the SNAP () ab initio gene finder trained by CEGMA () output models, the GeneMark-ES ab initio finder, SwissProt proteins, and O. tipulae transcripts. The transcriptome data were filtered so that only reads >300 bases that had significant basic-local-alignment-search-tool (BLAST) similarity to C. elegans protein databases were kept. The MAKER2 predictions were used to train Augustus to generate a custom gene-finding profile for O. tipulae. Finally, Augustus was used with the gene-finder profile and O. tipulae transcripts to predict the final gene set. Not enough transcript evidence was available to train a model of untranslated regions (UTRs), and therefore no UTRs were annotated. Protein sets for C. elegans (C. elegans Sequencing Consortium 1998), Dictyocaulus viviparus (), Meloidogyne hapla (), and P. pacificus (), downloaded from WormBase (http://www.wormbase.org/), were clustered with Orthofinder () using an inflation value of three. [...] JU170 whole-genome sequencing data were analyzed to identify SNPs compared to the CEW1 reference genome. These variants were then used for genetic mapping of the mutants (listed in Table S3). Reads were mapped with bwa () to the CEW1 assembly, mappings processed with the GATK tool suite () version 3.3-0 and variants called with HaplotypeCaller using default parameters. We followed HaplotypeCaller’s authors’ recommendations of best practice (; ), realigning reads around indels and performing BQSR by bootstrapping a first call made with HaplotypeCaller. We analyzed the 300-bp CEW1 MiSeq data used for genome assembly with the same pipeline, after E. coli decontamination, as a control for variant calling. We then hard filtered a list of high-confidence SNPs of JU170 with conservative criteria, retaining homozygous positions covered by at least three reads in each strain, with a sequencing and mapping quality higher than 100 and 55, respectively, and a position noted as reference in CEW1 and variant in JU170. Sequencing data from pooled F2 mutants were analyzed with the same pipeline, except that variant calling was restricted to a list of JU170 SNPs previously established for faster computing (using the HaplotypeCaller option genotyping_mode GENOTYPE_GIVEN_ALLELES). Output VCF files were used to compute allele frequencies for each SNP on the JU170 list as the ratio of the number of reads with the JU170 allele over the total number of reads. These frequencies were plotted along each scaffold using custom R scripts. Scaffolds displaying a mean JU170 allele frequency <10% were selected as possibly linked to the candidate locus and retained for a second, unrestricted variant call. JU170 variants were filtered out from the output at this stage. We also systematically added for analysis the 47 scaffolds that do not carry SNPs between JU170 and CEW1 (0.1% of the genome). The functional impact of identified variants was assessed using snpEff () and used to prioritize candidate genes. Where two alleles of the same gene were analyzed, candidate gene lists were compared to exclude identical variations (likely initial background variations) and were inspected for independent hits to the same gene with a different noncomplementing mutation. When necessary, visual inspections of variations in aligned reads was performed with Tablet (). Scripts used to automate this pipeline are available at: https://github.com/fabfabBesnard/Andalusian_Mapping. [...] For each mutant strain (see Table S3) and each scaffold, the mean frequency of alleles in the F2 mapping population whole-genome sequencing was extracted from the previous pipeline. All data sets included frequencies for the 144 scaffolds (over a total of 191 in nOt.2.0) that contained polymorphic positions between the strains CEW1 and JU170. Scaffolds and F2 mapping populations were clustered using the “heatmap.2” function of the “gplots” package in R. […]

Pipeline specifications