Computational protocol: SNP Discovery with EST and NextGen Sequencing in Switchgrass (Panicum virgatum L.)

Similar protocols

Protocol publication

[…] ESTs from all libraries were processed through the JGI EST pipeline. ESTs were generated in pairs, using a 5′ and 3′ end-read from each cDNA clone. Common patterns at the ends of ESTs, such as vector and adaptor sequences, were identified and removed using a custom software tool developed internally at JGI. Clones were identified as “insertless” if more than 200 bases of vector sequence at the 5′ end or less than 100 bases of non-vector sequence remained in the sequence. Next, ESTs were trimmed for quality using a sliding window trimmer (window size  = 11 bases). Once the average quality score in the window was below the quality threshold (phred quality score of 15), the EST was split and the longest remaining sequencewas retained as the trimmed EST sequence, unless less than 100 bases of high-quality sequence remained, in which case, the sequences were removed from further processing. In the next step, ESTs that contained poly-A or poly-T tails were trimmed andretained unless the remaining sequence was shorter than 100 base pairs, in which case they were discarded. In the following step, ESTs consisting of more than 50% low-complexity sequence (even if it was good quality) were also removed from the final set of processed ESTs. In cases where more than one read from the same clone in which the same direction existed, the longest high-quality read was retained.Sister ESTs (paired-end reads) were categorized as follows: if one EST was insertless or a contaminant, then, by default, the second sister was categorized as the same and was discarded. However, when retained, each sister EST was treated separately for complexity and quality scores. Lastly, an annotational quality check was conducted by comparing the EST sequences with those in the GenBank nucleotide database to identify contaminants, i.e., non-desirable sequences such as those matching non-cellular and rRNA sequences. Once identified, those sequences were removed from the final set of processed ESTs. For clustering, ESTs were evaluated with malign, a k-mer-based alignment tool , which clusters ESTs based on sequence overlap (k-mer  = 16, seed length requirement  = 32, alignment identity > = 98%). Clusters of ESTs were further merged based on sister ESTs using double linkage. Double linkage requires that two or more matching sister ESTs exist in both clusters in order to be merged. EST clusters were then assembled using CAP3 to form consensus sequences.Clusters may have more than one consensus sequence for various reasons, including alternative splicing, long-insert sequences, or errors in assembly. Cluster singlets are clusters of multiple reads from the same EST, whereas CAP3 singlets are single ESTs that had joined a cluster but, during cluster assembly, were isolated into a separate singlet consensus sequence. ESTs from each separate cDNA library were clustered and assembled separately and, subsequently, all of the ESTs for all cDNA libraries were clustered and assembled together. For cluster consensus sequence annotation, the consensus sequences were compared to Swissprot protein database using BLASTX and the annotations of the hits were reported. […]

Pipeline specifications

Software tools MALIGN, CAP3, BLASTX
Application Nucleotide sequence alignment
Organisms Panicum virgatum