Computational protocol: Whole Transcriptome RNA-Seq Analysis of Breast Cancer Recurrence Risk Using Formalin-Fixed Paraffin-Embedded Tumor Tissue

Similar protocols

Protocol publication

[…] With the exception noted below, all primary analysis of sequence data was performed in CASAVA 1.7, the standard data processing package from Illumina. De-multiplexing of sample indices was set with 1 mismatch tolerance to separate the two samples within each lane. Raw FASTQ sequences were trimmed (6 bases from the 5′side and 8 bases from the 3′side) before mapping to the human genome (UCSC release, version 19), to address 3′ end adapter contamination, random RT primer artifacts, and 5′ end terminal-tagging oligonucleotide artifacts. Mapping started from a single seed of 32 base pairs with two mismatches allowed; gap penalties were allowed using ELAND2 provided by Illumina. The libraries as prepared contain strand-of-origin (directional) sequence information. Annotated RNA counts (defined by refFlat.txt from UCSC) were calculated by CASAVA 1.7 both with and without consideration of strand-of-origin information. Although retained in the mapping process, CASAVA does not provide directional counts by default. These counts were obtained by splitting the mapped (export.txt) file into two parts, one with sense strand counts, the other with antisense strand counts, and processing them independently. Raw FASTQ sequence was mapped with Bowtie in parallel with CASAVA to count ribosomal RNA transcripts.Data were analyzed in 3 categories: first, RefSeq RNAs, about 80% of which are exon sequences, consolidated for each gene; second, intronic RNA sequences, consolidated for each gene; third, intergenic sequences, operationally defined as non-RefSeq, non-intronic sequences (Data for this study have been deposited in the Dryad Repository: RNAs for which none of the 136 specimens yielded 5 or more counts were excluded from analysis. Of 21,283 total RefSeq transcripts counted by CASAVA, 821 had a maximum count less than 5, leaving 20,462 RefSeq transcripts for analysis. Similar to a recently published procedure described by Bullard et al. log2 raw RNA counts (setting the log2 for a 0 count to 0) were normalized by subtracting the 3rd quartile of the log2 RefSeq RNA counts and adding the cohort mean 3rd quartile (“3rd quartile normalization”). For normalization of RefSeq and intergenic RNA data, RefSeq transcript data were used. For normalization of intronic RNA data, intronic transcript data were used. Use of third quartile normalization effectively mitigated trends in overall coverage related to sample age and produced stable estimates of expression with relative log expression (RLE, individual gene log2 count minus within-patient median log2 count) values that were centered on zero and relatively tightly distributed around 0, an indicator of effective normalization.Standardized hazard ratios for breast cancer recurrence for each RNA, that is, the proportional change in the hazard with a 1-standard deviation increase in the normalized expression of the RNA, were calculated using univariate Cox proportional hazard regression analyses . The robust standard error estimate of Lin and Wei was used to accommodate possible departures from the assumptions of Cox regression, including nonlinearity of the relationship of gene expression with log hazard and non-proportional hazards. False discovery rates (FDR, q-values) were assessed using the method of Storey with a “tuning parameter” of λ = 0.5. Analyses were conducted to identify true discovery degree of association (TDRDA) sets of RNAs with absolute standardized hazard ratio greater than a specified lower bound while controlling the FDR at 10% . Taking individual RNAs identified at this FDR, the analysis finds the maximum lower bound for which the RNA is included in a TDRDA set. Also computed was an estimate of each RNA’s actual standardized hazard ratio corrected for regression to the mean .Expression of 192 transcripts in the same tumor RNAs was measured using previously described RT-PCR methods . Standardized hazard ratios associating the expression of each gene (normalized by subtracting each gene’s crossing threshold (CT) from the cohort median CT) with cancer recurrence were computed using the same methods used for evaluation of the RNA-Seq data.Intergenic regions were identified by a novel program that evaluates intergenic regions having wide variations in length, and uses data from a population of subjects rather than an individual subject. Briefly, overlapping reads from all 136 patients were combined based on their human genome mapping coordinates, creating clusters of individual islands. Nearby islands were grouped by a merging tolerance criterion into regions of interest. Putative novel intergenic transcripts were then defined by filtering out transcripts with known refFlat.txt annotations. […]

Pipeline specifications

Software tools BaseSpace, ELAND, Bowtie
Application RNA-seq analysis
Organisms Homo sapiens
Diseases Breast Neoplasms, Neoplasms
Chemicals Formaldehyde