Computational protocol: Exon-Enriched Libraries Reveal Large Genic Differences Between Aedes aegypti from Senegal, West Africa, and Populations Outside Africa

Similar protocols

Protocol publication

[…] A custom concatenated reference file “All3U” was built based on the Vectorbase AaegL3.3 gene build, and SNPs were called with respect to this build. The reference was used for the detailed analysis of 18,840 transcripts among 15,735 genes. A FORTRAN program used the coordinates of the features of each gene appearing in Aedes-aegypti Liverpool_ BASEFEATURES_Aaegl3.3.GTF to extract: (1) 600 bp of the cis region 5′ to each annotated transcript (5′ nontranscribed = 5′NTR), (2) all 5′UTRs, (3) exons, (4) introns, (5) 3′UTRs, and (6) a 600 bp cis region in the 3′ direction (3′ nontranscribed = 3′NTR). Each SNP was given an ordered SNPID number according to its position in the gene, the location of the gene in the supercontig, and where the supercontig mapped physically (). There were 1948 mapped genes containing 41,119,004 SNPs on chromosome 1, 3429 genes containing 70,643,307 SNPs on chromosome 2, and 2382 genes with 44,883,168 SNPs on chromosome 3. There were 7862 unmapped genes containing 144,703,188 SNPs. In total there were 15,735 genes containing 301,348,667 SNPs.Fastq files from each library were aligned to the All3U reference file using GSNAP (v. 2013-10-28), allowing 10% divergence from the reference (). SAMtools () converted GSNAP outputs to *.mpileup files. The “readcounts” command in Varscan2 (v2.3.5) () used the *.mpileup file to report SNPs, with default values of a minimum of 15× coverage in each library, and a minimum base quality of 30 (http://varscan.sourceforge.net/using-varscan.html). A series of FORTRAN programs then read the “readcount” files to produce a flat file containing the SNPID number, the Aaegl3.3 reference nucleotide, its coverage at each sequenced position, and the coverages of insertions or deletions at that position.The program “Trim” removed all SNPs that had <15 counts, and capped all SNP coverage at 2000 to avoid regions of repetitive DNA. The program “2 × 2” read two files, one from each replicate at a collection site, and produced a single “2 × 2.out” file that contained SNPs that occurred in both replicate datasets for each collection (Mexico, Thailand male, Thailand female, PK10 male, PK10 female, and Kaolack). A program “2 × 2 × 2” then read two 2 × 2.out files to produce a list of SNPs common to a pair of collections. A program “Tables” then read the two “2 × 2.out” files and the list to produce a contingency table for each SNP common to both collections. The contingency file contained: (1) the SNPID coordinate followed by four lines containing (2) the name of the library containing that SNP, and (3) the coverage for each nucleotide at that SNP. To ensure high confidence SNP calls, replicate SNP frequencies were compared using a heterogeneity χ2 test with nid-1 degrees of freedom, where nid is the number of nucleotides and indels segregating at a SNP. Any SNP in which nucleotide frequencies were significantly different between either replicate were discarded. A program read the contingency table file to produce a file that contains for each SNP: (1) the SNPID, (2) the VectorBase Gene ID (e.g., AAEL001234), (3) coverages of all nucleotide/indels in the first collection, (4) coverages of all nucleotide/indels in the second collection, (5) numbers of alternate nucleotides at that SNP, (6) a list of polymorphic nucleotides/indels, (7) the Aaegl3.3 reference nucleotide, (8) the alternate nucleotide (second most common nucleotide, also referred to as the minor allele), (9) mutation type (i.e., “transition,” “transversion,” “insertion,” or “deletion”), (10) whether the mutation is a synonymous or a replacement substitution for SNPs in codons, and (11) whether the SNP resides in a first, second, or third codon position, or occurs within a 5′UTR, 5′NTR, 3′UTR, 3′NTR, or an intron. Transfer RNAs were annotated using the anticodon tool at http://lowelab.ucsc.edu/tRNAscan-SE/ ().A program was written to calculate for each SNP, a between-collection component (as), a within-collection component (bs), and FST calculated from as and bs following ), where:as=4ni(p^(i,s)−p^s)2+4nj(p^(j,s)−p^s)2−bs2[2ninj/(ni+nj)]andbs=niα(i,s)+njα(j,s)ni+nj−1whereα(i,s)=2p^(i,s)(1−p^(i,s)) and α(j,s)=2p^(j,s)(1−p^(j,s))α(i,s) is also known as the expected heterozygosity (Hexp). p^(i,s) is the coverage of a nucleotide at SNP(s) divided by the total coverage of s in collection (i). ni and nj are the number of mosquitoes sampled in collections i and j, and p^s is the coverage of a nucleotide at s in both i and j collections, divided by the total coverage of s in both i and j collections. The estimate of FST for s is:FST(s)=asas+bsand, for an entire gene (g) with m SNPs, is:FST(g)=∑s=1mas∑s=1m(as+bs)Graphing and statistical metrics were carried out in R-Bioconductor (). Venn diagrams (Figure S2 and Figure S3) were prepared using the webtool Venny (). […]

Pipeline specifications

Software tools GSNAP, SAMtools, VarScan, tRNAscan-SE, VENNY
Databases VectorBase
Applications Genome annotation, Population genetic analysis
Organisms Aedes aegypti, Homo sapiens