Computational protocol: Systematic functional perturbations uncover a prognostic genetic network driving human breast cancer

Similar protocols

Protocol publication

[…] Sequencing read quality was examined using FastQC (http://www.bioinformatics.babraham.ac.uk) at three stages in the analysis pipeline: on the raw data, after trimming, and after duplicate reads were removed. Trimming of low quality reads and clipping of sequencing adapters was done using the program Trimmomatic [] and all reads shorter than 35bp after trimmer were dropped. Reads were aligned to a masked genome (hg19) using Bowtie, only keeping uniquely mapping reads, with no mismatches in the first 45bp (M = 1, N = 0, L = 45) []. Bam to Sam file conversion was done with SamTools [], and duplicate reads were removed using Picard-tools (http://picard.sourceforge.net). ChIP-Seq heatmap plots of Figure were generated through the use of NGSPlot []. Peaks were called using MACS2 [] with the False Discovery Rates (FDR) q < 0.01 with a distribution across the factors as shown in Figure and Figure shows the distribution across genomic regions. The MACS2 algorithm utilizes a dynamic Poisson distribution to capture local biases in the genomic sequence, which allows for a sensitive and robust prediction of peaks. The IGV browser [] was used to visually check called peaks and produce the ChIP-Seq traces in Figure and . Peaks were assigned to genes using PeakAnnotator in the PeakAnalyzer package where positive genes were determined by the presence of a peak in a -2kb and +2.5Kb window around the transcription start site (TSS) []. The Venn diagrams of Figure and were made with the R package VennDiagram. [...] In parallel with the ChIP sample preparation, ∼5 × 106 cells were harvested and RNA isolation was done in TRIzol reagent. Libraries were prepared for sequencing using standard Illumina TrueSeq protocols. Libraries were pooled and sequenced 51 bp on a HiSeq2000. After sequencing, basecalling and demultiplexing have been performed using the standard casava pipeline. [...] Gene expression values were derived from RNA-Seq data for MDA-MB-231 derivative LM2 cells treated with shRNA for Fra-1, E2F1, TP53, MYC, or scrambled (two biological replicates for each gene and controls). FastQC was used to evaluate read quality on raw RNA-Seq reads and trimmed reads. Trimming of low quality reads and clipping of sequencing adapters was done using the program Trimmomatic [] and all reads shorter than 35bp after trimming were dropped. Reads were aligned to the hg19 reference genome with TopHat [] version 2.0.8. Bam to Sam file conversion, sorting, indexing, and file merging was done with SamTools []. FPKM values (Fragments per Kilobase of transcript Per Million mapped reads) were calculated by Cufflinks [] version 2.1.1. FPKM data was loaded into a matrix in R and a variation filter was applied to remove genes with less than 1.5 fold minimum variation and 1 minimum absolute variation (leaving 12239 out of 23615 genes). A t-test was then performed to find genes significantly varying between scrambled and TF knockdown samples and corrected for multiple hypothesis testing using the Benjamini-Hochberg step-up FDR-controlling procedure []. Genes with a Benjamini-Hochberg FDR value < 0.1 were selected leaving 295 genes (157 down in knockdown and 138 up in knockdown). The red-blue heat map in Figure was produced after scaling each row of data to a zero to one range. A parallel heat map with the same genes was produced for ChIP-Seq data using green to indicate genes that had peaks for each ChIP-Seq factor with FDR < 0.01 within a window of -2kb / +2.5kb of the gene's TSS. The red-blue heat-map of Figure was produced for the genes of the nine-gene signature after scaling each row of data to a zero to one range. A parallel heat map with the same nine genes from the nine-gene signature was produced with the ChIP-Seq data using green to indicate genes that had peaks for each ChIP-Seq factor with FDR < 0.01 within a window of -2kb / +2.5kb of the gene's TSS. The box plot of Figure was produced by calculating meta-gene values for the control and shRNA knockdown samples where the meta-gene value for the nine-gene signature is found by taking the mean of the RNA-Seq values for all nine genes in the signature in each sample.In order to test the correlation of the nine-gene signature with proliferation, TCGA RNASeqV2 data for breast cancer (BRCA) was downloaded from TCGA data matrix access portal (http://cancergenome.nih.gov/). Proliferation score came from the TCGA BRCA paper by Ciriello et al. [], comprising data for 817 of the 1093 TCGA BRCA RNASeqV2 samples. The meta-gene value for the nine-gene signature is found by taking the mean of the log2 RNASeqV2 values for all nine genes in the signature in each sample. for all subtypes of breast cancer samples (817) and 3c for the TNBC subset (116 samples) show Pearson correlation between the meta-gene and Ciriello et al. []. Proliferation Score from a correlation test (cor.test() in R). [...] Gene Set Enrichment Analysis (GSEA) [,] was used to evaluate the association of genes bound by the 4 transcription factors with regulation and produce the enrichment plot of Figure . A gene set was made out of the 579 genes that had peaks with FDR < 0.01 associated with the TSS for all four TFs (Fra-1, E2F1, TP53, and MYC) and tested for enrichment in the RNA-Seq data of control vs. knockdown for Fra-1, E2F1, TP53, and MYC (8 vs. 8 samples). GSEA was run with 1000 permutations of the phenotype using signal-to-noise to rank genes. […]

Pipeline specifications

Software tools BaseSpace, FastQC, Trimmomatic, TopHat, SAMtools, Cufflinks, GSEA
Databases TCGA Data Portal
Application RNA-seq analysis
Organisms Homo sapiens
Diseases Breast Neoplasms, Neoplasms
Chemicals Estrogens, Purines