Computational protocol: Colorectal Cancer with Residual Polyp of Origin: A Model of Malignant Transformation12

Similar protocols

Protocol publication

[…] For library construction, total DNA was quantified in triplicate using the Quant-iT PicoGreen DNA Assay Kit and normalized to 2-ng/μl minimum concentration. An aliquot of 100 ng for each sample was transferred into library preparation utilizing the Broad Institute–developed one-well protocol. All biochemistry occurs in a single well without the need for sample transfer (the sample is reversibly immobilized to and released from magnetic beads, allowing washes and reagent addition). The one-well protocol streamlines the process and greatly reduces sample input requirements. The product provides one library (typical median insert size of library is 330 bp) .Samples were sequenced on the Illumina HiSeq X instruments producing 150-bp, paired-end reads to meet a goal of 30 × mean coverage. Using the Picard Informatics Pipeline, all data from a particular sample were aggregated into a single Binary Alignment/Map format (BAM) file which included all reads, all bases from all reads, and original/vendor-assigned quality scores. A pooled Variant Call Format file using the latest version of Picard GATK software was generated and provided for each sample batch. All whole genome sequencing data analyzed in this manuscript will be uploaded to SRA through dbGaP (Accession numbers to follow). [...] In order to detect somatic single nucleotide variants (SNVs) between the tumor and matched normal tissue SNVs for 10 cases of CRC RPO +, we used 4 different somatic variant callers: MuTect, SomaticSniper, Strelka, and VarScan , , , . Those callers were run with default options for normal and tumor samples from each patient. We took common SNVs detected by at least two different callers. Variant allele frequencies for those SNVs were calculated from sample BAM files for each patient using an in-house script. To annotate them, we used Variant Effect Predictor (http://www.ensembl.org/Tools/VEP).Tumor somatic mutation profiles for CRC RPO − were obtained from The Cancer Genome Atlas (TCGA) for 32 pMMR CRCs which were stage, site, gender, and age matched to the 10 pMMR CRC RPO + cases described above. The somatic mutation profiles were downloaded in Mutation Annotation Format, which classifies somatic mutations into 1 of 13 categories depending on the type and sequence position of the corresponding mutation. SNVs that caused frame shift in/del, in frame in/del, missense mutation, or nonsense mutation, or involved a splice site were classified as being likely to impact a gene’s function. A gene was considered as mutated when it had at least one somatic mutation in at least one of these categories . The frequency of somatic mutation rates from the WGS studies on the cases of CRC RPO + were compared with the mutation profiles reported for the matched TCGA cases presumed to be mainly CRC RPO −. [...] Total RNA was quantified using the Quant-iT RiboGreen RNA Assay Kit and normalized to 5 ng/μl. An aliquot of 200 ng for each sample was transferred into library preparation which was an automated variant of the Illumina TruSeq Stranded mRNA Sample Preparation Kit. This method preserves strand orientation of the RNA transcript. Oligo dT beads were used to select mRNA from the total RNA sample. Heat fragmentation and cDNA synthesis from the RNA template then followed. The resultant cDNA went through library preparation (end repair, base “A” addition, adapter ligation, and enrichment) using Broad-designed indexed adapters substituted in for multiplexing. After enrichment, the libraries were quantified with quantitative polymerase chain reaction using the KAPA Library Quantification Kit for Illumina Sequencing Platforms and then pooled equimolarly. The entire process is in 96-well format, and all pipetting is done by either Agilent Bravo or Hamilton Starlet.Pooled libraries were normalized to 2 nM and denatured using 0.1 N NaOH prior to sequencing. Flowcell cluster amplification and sequencing were performed according to the manufacturer’s protocols using either the HiSeq 2000 or HiSeq 2500. Each run was a 101-bp paired-end with an 8-base index barcode read. Data were analyzed using the Broad Picard Pipeline which includes demultiplexing and data aggregation. [...] The paired-end RNASeq FASTQ files were then analyzed using Mayo Clinic’s standard RNA-Seq application, MAPR-Seq v.2.0.0 (http://bioinformaticstools.mayo.edu/research/maprseq/). MAPR-RSeq integrates a suite of open-source bioinformatics tools along with in-house–developed methods to analyze paired-end RNA-Seq data. Read alignment was performed with Tophat which uses Bowtie —a fast, memory-efficient, short-sequence aligner. The reads were aligned to the transcriptome (Ensembl GTF) and to the genome (hg19) to report both existing and novel expressed regions. The BAM file produced by Tophat was processed using featureCounts to summarize expression at the gene and exon levels. Reads per kilo base per million (RPKM) values were calculated from the raw gene counts produced by featureCounts and by incorporating the total number of aligned reads and the coding length of each gene. To identify possible quality control issues, RSeQC software was used to detect abnormalities, such as unsymmetrical gene body coverage, high levels of read duplication, and low saturation levels of known exon junctions, within each sample.The TCGA colorectal adenocarcinoma and rectal adenocarcinoma RNASeq expression data were obtained from the following site, https://tcga-data.nci.nih.gov/docs/publications/coadread_2012/, and their annotation file was obtained from https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/other/GAF/GAF_bundle/outputs/TCGA.Sept2010.09202010.gaf. RPKM values are directly impacted by gene length, so only the genes with a similar length between the TCGA annotation file and the Mayo Clinic annotation file (ftp://ftp.ensembl.org/pub/grch37/update/gtf/homo_sapiens/Homo_sapiens.GRCh37.82.gtf.gz) were used within this analysis. Specifically, genes that had a gene length within 10% of each other from both references were kept. For these similar genes identified, the RPKM values were extracted from the TCGA analyzed expression data and from the Mayo analyzed expression data. Note, the bioinformatics tools used to align and calculate expression from the TCGA analyzed samples were different from those used for the Mayo analyzed samples. All the genes that had no expression from the TCGA analyzed or the Mayo analyzed samples were removed. Mean and standard deviation values were calculated from the RPKM values across the TCGA analyzed group and for the Mayo analyzed group. The genes that had a standard deviation greater than the mean were also removed from this analysis to avoid evaluating highly variable genes, such as circadian rhythm genes. The sample specific RPKM values from the Mayo analyzed samples and the TCGA colorectal adenocarcinoma and rectal adenocarcinoma samples with similar clinical characteristics were then extracted and visualized within a heatmap. The heatmap was constructed using the Complex Heatmap function in R through the Bioconductor package . All RNA-seq data analyzed in this manuscript will be uploaded to SRA through dbGaP (accession numbers to follow). […]

Pipeline specifications

Software tools Picard, TopHat, Bowtie, Subread, RSeQC
Databases TCGA Data Portal
Application RNA-seq analysis
Diseases Neoplasms, Colorectal Neoplasms, Adenomatous Polyps