Similar protocols

Protocol publication

[…] DNA was extracted from whole blood with a typical phenol:chloroform method and stored at 4°C in 10 mM TrisCl, 1 mM EDTA (pH 8.0) as previously described . Library preparation for DNA sequencing was also accomplished as previously described . Briefly, 2 μg of ovine genomic DNA was fragmented and used to make indexed, 500 bp, paired-end libraries. Pooled libraries were sequenced with a massively parallel sequencing machine and high-output kits (NextSeq500, two by 150 paired-end reads, Illumina Inc.). Pooled libraries with compatible indexes were repeatedly sequenced until 40 GB of data with greater than Q20 quality was collected for each ram, thereby producing at least 10-fold mapped read coverage for each index. This level of coverage provides scoring rates and accuracies that exceed 99% , . The DNA sequence alignment process was similar to that previously reported . FASTQ files were aggregated for each animal and DNA sequences, aligned individually to Oar_v3.1 with the Burrows-Wheeler Alignment tool (BWA) aln algorithm version 0.7.12 , merged, and collated with the bwa sampe command. The resulting sequence alignment map (SAM) files were converted to binary alignment map (BAM) files, and subsequently sorted via SAMtools version 1.3.1 . Potential PCR duplicates were marked in the BAM files using the Genome Analysis Toolkit (GATK) version 3.6 . Regions in the mapped dataset that would benefit from realignment due to small indels were identified with the GATK module RealignerTargetCreator, and realigned using the module IndelRealigner. The BAM files produced at each of these steps were indexed using SAMtools. The resulting indexed BAM files were made immediately available via the Intrepid Bioinformatics genome browser with groups of animals linked at the USDA, ARS, USMARC internet site.The raw reads were deposited at NCBI BioProject PRJNA324837. Mapped datasets for each animal were individually genotyped with the GATK UnifiedGenotyper with arguments “--alleles” set to the VCF file ( ), “--genotyping_mode” set to “GENOTYPE_GIVEN_ALLELES”, and “--output_mode” set to “EMIT_ALL_SITES”. Lastly, some SNP variants were identified manually by inspecting the target sequence with IGV software version 2.1.28 , (described below in Methods section entitled ‘Identifying protein variants encoded by GDF9, BMP15, and BMPR1B genes’). In these cases, read depth, allele count, allele position in the read, and quality score were taken into account when the manual genotype determination was made. [...] Genotypes from a set of 163 reference SNPs were used as an initial verification of the WGS datasets. These DNA markers have been used for parentage determination, animal identification, and disease traceback . The 163 reference SNPs were previously genotyped across the MSDPv2.4 by multiple overlapping PCR-Sanger sequencing reactions, multiplexed matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF MS) genotyping assays, and 50 k bead array platforms . The genotype call rate was defined as the number of SNP sites with three or more mapped reads, divided by the total number of sites tested. The error rate in the WGS data was estimated by comparing the independently-derived consensus genotypes for these SNPs to the WGS genotypes. An animal’s WGS dataset passed initial verification when the accuracy of the WGS genotypes exceeded 97%, and the average mapped read depth was proportional to the amount of WGS data collected. Animals’ datasets that failed this initial verification were inspected for contaminating and/or missing files. Once identified, the dataset was corrected and reprocessed. Linear regression analysis was accomplished in Excel version 2016. Access to the sequence was made available via USDA, ARS, USMARC internet site. Because the raw datasets were available online as they were produced, the raw FASTQ files were deposited in the NCBI SRA only after they were validated as described above. These 96 sets of files may be accessed through BioProject PRJNA324837 in the Project Data table under the Resource Name: SRA Experiments.SNPs from the OvineSNP50 BeadChip (Illumina Inc.) were selected for comparison because they were numerous, uniformly distributed across the ovine genome, and available. Based on the nucleotide sequence of the 54,242 probes obtained from the manufacturer, the positions of 51,796 SNPs were verified via a BLAT process, as previously described . There were 50,357 of these that mapped uniquely to autosomes and were used for analysis ( ). The genotypes from the WGS data were compared to those from the 50 k bead array with a custom program written specifically for this operation. [...] The nucleotide variation in the exon regions of GDF9, BMP15, and BMPR1B was visualized through the public access portal at ARS USMARC with open source software installed on a laptop computer. Variants were recorded manually in a spreadsheet as previously described . Briefly, a Java Runtime Environment version 8, update 131 (Oracle Corporation, Redwood Shores, CA) was first installed on the computer. When links to the data were selected from the appropriate web page, IGV software version 2.1.28 , automatically loaded from a third-party site (University of Louisville, Louisville KY) and the mapped reads were loaded in the context of the ovine Oar_v3.1 reference genome assembly. Gene variants were viewed by loading WGS from a set of eight animals of different breeds, and the IGV browser was directed to the appropriate genome region by entering the gene abbreviation in the search field (e.g., GDF9). The IGV zoom function was used to view the first exon at nucleotide resolution with the “Show translation” option selected in IGV. Since GDF9 was in the reverse orientation with regards to the Oar_v3.1 assembly, the reference sequence was reversed so the translation was correctly viewed from right to left. The exon sequences were visually scanned for polymorphisms that would alter amino acid sequences, such as missense, nonsense, frameshift, and splice site variants. Once identified, the nucleotide position corresponding to a protein variant was viewed and recorded for all 96 animals. Using IGV, codon tables, and knowledge of the ovine GDF9, BMP15, and BMPR1B protein sequences ( NP_001136360.2, NP_001108239.1, and NP_001009431.1, respectively), the codons affected by nucleotide alleles were translated into their corresponding amino acids and their Oar_v3.1 positions noted. Haplotype-phased protein variants were unambiguously assigned in individuals that were either: 1) homozygous for all variant sites, or 2) had exactly one heterozygous variant site. Maximum parsimony phylogenetic trees were manually constructed from the unambiguously phased protein variants. The phylogenetic trees were used, together with simple maximum parsimony assumptions, to infer haplotype phase in seven rams where two heterozygous variant sites occurred in GDF9. The protein phylogenetic trees were rooted by comparing the variable residues in sheep to those from related species. Ovine peptide sequences for GDF9, BMP15, and BMPR1B were used to search NCBI's refseq_protein database with BLASTP 2.6.1 , . Aligned protein sequences from a representative subset of 29 vertebrate species were used for the comparison. […]

Pipeline specifications

Software tools BWA, SAMtools, GATK, IGV, BLAT, BLASTP
Databases Oar
Applications Phylogenetics, WGS analysis
Organisms Ovis aries
Chemicals Amino Acids