Computational protocol: Genome Wide Estimates of Mutation Rates and Spectrum in Schizosaccharomyces pombe Indicate CpG Sites are Highly Mutagenic Despite the Absence of DNA Methylation

Similar protocols

Protocol publication

[…] Sequence reads from each library were quality controlled with the ea-utils and fastx toolkit in order to remove low quality reads and residual adaptor sequence (; )(Workflow deposited at https://github.com/behrimg/Scripts/blob/master/Hall_Projects/Pombe/MA_Pipeline.txt). Based on the workflow outlined in and adjusted for a haploid dataset, quality controlled (QCed) reads were then mapped to the Sc. pombe reference genome ASM294v2.24 with BWA v1.1.2, sorted and indexed with SAMtools v1.0, and assigned line identification numbers with Picard Tools v1.87 (; ). Duplicated reads were marked with Picard Tools and removed, and then the remaining sequence reads were locally realigned with GATK v3.2.2 (). SNM and indel variants for each line and the ancestor were identified simultaneously using GATK’s Unified Genotyper tool with parameter settings for haploid organisms. The resulting VCF files were converted to tab delimited text using VCFtools v0.1.12a vcf-to-tab function (). All sequence differences between the MA ancestor, which was sequenced twice, and the reference were identified to determine the sequence of the ancestor. The differences between each MA line and the reference were determined, and those that were present in the ancestor were ignored. In order to call a variant, a minimum of four reads with ≥75% of the reads favoring the variant allele was needed. Regions of the genome that corresponded to centromeres, telomeres, and mating type loci (approximately 472 kbp) were excluded from the analysis to avoid inaccurate mapping. This was in addition to the two tandem rDNA repeat arrays on chromosome III accounting for 1,465 kbp, which are excluded from the reference genome. Identified SNMs and small indels were annotated using Ensembl’s variant effect predictor (VEP) while flanking regions were determined using the fill-fs program from the VCFtools package (; ).Presence of medium and large structural variants were investigated using the Delly software package (), and variants that passed Delly’s QC were investigated further using the integrated genome viewer (IGV) v2.1.23 (). When IGV supported a structural variant call, the variant was tested with PCR.Sequencing also allowed the detection of across-line and other microbial contamination. Across-line contamination was deemed to have occurred if any two lines shared an identical new mutation. When this happened, one of the lines (chosen by coin flip) was discarded from the remainder of the analysis. [...] To estimate mRNA concentrations, as a surrogate for gene expression levels for our ancestor strain, we sequenced mRNA from 10 biological replicates. We selected 10 colonies, inoculated each into 3 ml liquid YPD medium, and incubated on a rotator at 30° for 48 hr. After 48 hr, mRNA was extracted using the MasterPure Yeast RNA Purification kit (Epicentre). mRNA libraries were constructed using the Illumina Truseq mRNA Stranded Kit, amplified using 13 cycles of PCR and sequenced on an Illumina HiSequation 2500. Libraries were sequenced as 100 bp single-end reads (NCBI SRA BioProject: SRP065886). Sequenced reads were QCed in the same manner as genomic sequencing reads, reference-mapped with TopHat v.2.0.13 (), and assembled with Cufflinks (). The log-median fragments per kilobase of exon per million fragments mapped (FPKM) for each site across ancestor replicates was chosen to represent the level of expression at that site. [...] Five lines were randomly selected to verify the mutations that were identified bioinformatically with Sanger sequencing (Supporting Information, Table S1). Primers were designed using Primer3 () and PCR products destined for sequencing were cleaned using a standard Exo-SAP protocol (), and sequenced with an ABI BigDye Terminator Cycle Sequencing Kit (Applied Biosystems, Foster City, CA). Completed sequencing reactions were submitted to the Georgia Genomics Facility, and analyzed using an Applied Biosystems 3730xl 96-capillary DNA Analyzer. […]

Pipeline specifications

Software tools TopHat, Cufflinks, Primer3
Databases SRA
Applications WES analysis, qPCR
Organisms Schizosaccharomyces pombe, Saccharomyces cerevisiae
Chemicals Cytosine Nucleotides