Computational protocol: From next-generation sequencing alignments to accurate comparison and validation of single-nucleotide variants: the pibase software

Similar protocols

Protocol publication

[…] For our example data download on the project homepage (, 23 August 2012, date last accessed), we used the publicly available BAM files for chromosome 22, (, 23 August 2012, date last accessed), for the high-coverage trio samples from Utah residents with Northern and Western European ancestry (CEU): NA12878 (daughter), NA12891 (father) and NA12892 (mother). The daughter’s whole genome BAM files were available as Illumina, SOLiD and 454/FLX reads, and the father’s and mother’s files were only available as Illumina reads. In the 1000 Genomes Project, the Illumina reads were mapped using BWA, the SOLiD reads using BFAST, and the 454/FLX reads using SSAHA.The reference sequence used for mapping by the 1000 Genomes Project is available at (, 23 August 2012, date last accessed). This genome is largely the same as hg19, except that, e.g. the chromosome names are changed to 1, 2, 3, etc., and the mitochondrial reference sequence is rCRS not chrM. We supply the file hg19.1000 G.quick.fasta in our example data download for use with chromosomes 1–22, X and Y.We also downloaded the file of exonic variant-calls (CEU.exon.2010_03.genotypes.vcf.gz) from (, 23 August 2012, date last accessed). The VCF file lists only 55 SNVs for chromosome 22, and their genomic coordinates are counted with respect to hg18. We transformed these SNV coordinates to hg19 coordinates using the online tool at (, 23 August 2012, date last accessed) () and used them for the pibase analysis examples.For further comparisons with established results and established tools, we used pibase to recall the HapMap single nucleotide polymorphisms (SNPs, a class of SNVs that occur in at least 1% of individuals in a specific population) defined in (, 23 August 2012, date last accessed) after coordinate transformation from hg18/b36 to hg19. For the above whole genome sequencing files, we recalled SNVs in chromosome 22 using both GATK and SAMtools (mpileup, bcftools, vcfutils), merged the GATK and SAMtools SNV-lists (i.e. the union of both SNV-lists, not the overlap) and recalled these SNVs using pibase. As a targeted sequencing example, we used the fastq files available from the EBI/NCBI SRA with the SRA run ID SRR098401 (the high-coverage HapMap CEU trio daughter NA12878,, 23 August 2012, date last accessed). These whole exome reads were generated with an Illumina HiSeq 2000 at the Broad Institute, Cambridge, MA, USA. We mapped the reads using BWA (BAM file version 1). We removed duplicates using Picard (BAM file version 2). We removed non-uniquely aligned reads (BAM file version 3). We called SNVs for each BAM file version using SAMtools and GATK with the Variant Quality Score Recalibrator software (VQSR) (). We used pibase to genotype exonic HapMap SNPs (the genomic HapMap SNPs were filtered to lie within the regions defined by CCDS.20110907.txt, which we downloaded from (, 23 August 2012, date last accessed), and then filtered to exclude reference sequence genotypes). We compared all recalled SNVs with the HapMap SNPs in hapmap3_r1_b36_fwd.CEU.qc.poly.recode.ped to assess the levels of specificity. Finally, using PLINK (), we performed Mendelian error checks for the SNV-lists from the Illumina trio data, to further compare the levels of specificity. We documented the settings for each tool in the shell scripts that are included in our example data download (subfolders chr22_snpcalling, chr22_scripts, and exome). […]

Pipeline specifications

Software tools BWA, BFAST, SSAHA, GATK, SAMtools, bcftools, Picard, PLINK
Databases PIBASE
Applications WGS analysis, WES analysis, GWAS
Diseases Neoplasms
Chemicals Nucleotides