Computational protocol: MaGuS: a tool for quality assessment and scaffolding of genome assemblies with Whole Genome Profiling™ Data

Similar protocols

Protocol publication

[…] One 350-bp paired-end (PE) (ERX372154) and two 5.35-kb mate-pair (MP) (ERX372148, ERX372150) Illumina sequence libraries from A. thaliana were downloaded from the European Nucleotide Archive (ENA). A total of 47.6 Gb of data were obtained representing a coverage depth of 306X of PE and 91X of MP reads.Adapters and primers were removed from the reads, and low quality nucleotides were trimmed from both ends (quality values lower than 20). Reads were also trimmed from their second N to the end and reads longer than 30 nucleotides were kept. Reads that mapped onto run quality control sequences (i.e., the PhiX genome that is used in Illumina sequencing as quality control) were removed. To decrease the number of sequencing errors present in the paired-end (PE) reads, we applied Musket v1.1 [] with a k-mer size of 26 ‘-k 26’. We ran Kmergenie v1.5692 [] on the PE reads to find the best k-mer size for the contig construction step and obtained an optimal k-mer size of 91 bp. SOAPdenovo2 [] was used to perform the genome assembly, a de Bruijn graph was constructed with parameters ‘-K 91 –R’. As SOAPdenovo2 produces contigs over k + 1 bp, we selected informative contigs longer than 500 bp for further processing.We used the PE and MP reads in five different scaffolding programs: SOAPdenovo2, SSPACE [], SGA [], BESST [], and OPERA-LG []. We considered that the two main scaffolding parameters were the k-mer size used at the mapping step and the minimum number of link that validates a contig junction. To perform a scaffolding with the five scaffolders in a fair way, we chose the same parameters for the five scaffolders. We set the k-mer size to 31 bp which is more stringent than the bowtie and bwa mem default parameter, k = 28 and k = 19 respectively. We set the minimum number of link to five which corresponded to the default parameter of BESST and SSPACE. For SOAPdenovo2, we ran the map command with parameter ‘-k 31’, the scaf command with parameter ‘–L 500’, and set the minimum number of links in the configuration file as ‘pair_num_cutoff = 5’. For SSPACE, we manually set the bowtie k-mer size ‘-l 31’ and ran the program with parameter ‘-k 5’. For SGA and BESST, we first aligned the MP reads onto the contigs using BWA aln [] with parameter ‘-l 31’. For SGA, the links file was created using the sga-bam2de command with parameters ‘-n 5 -m 500 --mina 31 –k 31’. The astat file was generated setting ‘–m 500’. The scaf file and the corresponding FASTA file were both created with parameters ‘–m 500’. For BESST, we chose the optimal k-mer size used for the contig assembly as ‘-K 91’ and ran the program with parameter ‘-e 5’. For each program, we selected the scaffolds that were over 2 kb in length. For OPERA-LG, we set the k-mer size for scaffolding with option ‘kmer = 91’. The minimum contig size required for the scaffolding step was fixed as 500 bp with the parameter ‘contig_size_threshold = 500’. Finally, the number of links to validate a connection between two contigs was assigned with the parameter ‘cluster_threshold = 5’.To evaluate the quality of each assembly, we used QUAST v2.3, a popular program based on Nucmer. In the presence of a trusted reference, QUAST aligns with Nucmer the assembly to the provided reference and generates quality metrics. We observed several inconsistancies in the QUAST output. After discussions with the QUAST authors, the source code of QUAST was modified to avoid, as much as possible, the detection of misassemblies (relocation, translocation, and inversion) that correspond to false positives. Because Nucmer generated spurious alignments lower than 5 kb in highly repetitive regions, the minimum alignment length in both parts of a misassembly was set to 5 kb. Moreover, the gap or overlap size threshold length was increased to 5 kb to detect relocations. By default, QUAST reports misassemblies found within a scaffold only if at least 50 % of the scaffold is aligned. We modified this parameter to report all misassemblies regardless of the aligned fraction of a scaffold. [...] We generated new quality assembly metrics from the anchoring based on the commonly used N50 metric (used to evaluate assembly contiguity) and the NA50 introduced by the quality assessment tool QUAST (used to evaluate both contiguity and quality of assembly using a genome reference []). For each scaffold, we defined collinear segments as the fraction of a given scaffold that was correctly organized, i.e., segments anchored with tags that have the same order in the genome map and in the scaffolds (Fig. ). For a given assembly, the lengths of all these segments were used to calculate the following metrics: An50 (50 % of the anchored assembly contains collinear segments with length over An50 bp), AnA50 (50 % of the total assembly contains collinear segments with length over AnA50 bp), and AnG50 (50 % of the estimated genome size contains collinear anchored segments with length over AnG50 bp). MaGuS also generates Anx, AnAx, and AnGx graphs (based on the Nx graph []) that is a plot of the metrics for x values ranging from 1 to 100 %. […]

Pipeline specifications

Software tools Musket, KmerGenie, SOAPdenovo, SSPACE, BESST, Opera, Bowtie, BWA, QUAST, MUMmer, BioGraph, MaGuS
Databases ENA
Applications De novo sequencing analysis, Nucleotide sequence alignment
Organisms Arabidopsis thaliana