Computational protocol: Deep Sequencing of MYC DNA-Binding Sites in Burkitt Lymphoma

Similar protocols

Protocol publication

[…] Approximately 200 ng of ChIP-DNA was used as template for generating an Illumina sequence library (Illumina, San Diego, CA, USA). The DNA was not further size fractionated and directly taken for adaptor ligation, using a standard Illumina genomic library preparation kit. Briefly, DNA was end-repaired using a mix of T4 DNA polymerase, E. coli DNA Pol I large fragment (Klenow polymerase) and T4 polynucleotide kinase. The blunt, phosphorylated ends were treated with Klenow fragment and dATP to yield a protruding 3′-‘A’ base for ligation of Illumina adapters which have a single ‘T’ base overhang at the 3′- end. After adapter ligation fragments of ∼250–350 bp (insert plus adaptor sequences) were isolated from an agarose gel and were PCR amplified with Illumina primers for 15 cycles. The purified DNA was captured on an Illumina flow cell for cluster generation. These libraries were submitted to high-throughput sequencing on the Illumina Genome Analyzer II (GAII).The resulting sequence reads were mapped to the human reference genome (hg19, GRCh37) using Bowtie . Only reads that mapped uniquely with the Bowtie default setting (http://bowtie-bio.sourceforge.net/manual.shtml#the-n-alignment-mode) for mismatches were considered for further analysis (Bowtie option –m 1).The detection of genomic regions enriched by ChIP versus input control was conducted with HOMER (v2.6) for each experiment individually . Unique reads were directionally extended in the 3′-direction to a length of 300 base pairs. HOMER fits a local Poisson distribution to the input tags and tests the sequence depth corrected tag counts for being differentially expressed. This effectively removes peaks with low tag counts for which there is a chance that differential enrichment is found simply due to sampling error. Only ChIP regions with a p-value of less than 1e-6 under this local Poisson distribution were considered as putative peaks.All discovered putative peaks were merged into one list of putative peak regions that were detected in at least one experiment. A matrix containing the number of reads for every experiment in each putative peak region was assembled. DESeq (v1.4.0) was employed to test the number of reads for being differential over all ChIP versus input samples . The 5 different cell lines were considered as biological replicates in order to find common transcription factor binding sites. A negative binomial distribution was fitted to the inputs and tested for being differential in ChIP samples for every peak. The normalization of the number of reads, i.e. the estimation of the size factors for DESeq, was carried out for input controls and ChIP samples separately. Only peak regions with a FDR below 1e-4 were kept for further analysis.The remaining peaks were ranked by their FDR and annotated with ChIPpeakAnno using Ensembl Biomart employing the Ensembl Genes 59 database and the GCRh37 (hg19) dataset (http://aug2010.archive.ensembl.org/index.html) . The ChIP-Seq files of all experiments are available via the Gene Expression Omnibus (GEO) of the National Center for Biotechnology Information (www.ncbi.nlm.nih.gov/geo/) under the accession number GSE30726.In order to compare our ChIP-Seq data to previously published MYC-binding sites, we additionally analyzed the data provided by Zeller et al. . For this purpose we converted the genomic hg17 genome coordinates of the published data to hg19 using LiftOver (http://genome.ucsc.edu/cgi-bin/hgLiftOver).The overlap of genomic intervals determined by ChIP-Seq with genomic features, such as transcriptional start sites or exons, was calculated utilizing RegionMiner (Genomatix Software, Munich, Germany, Version: ElDorado 02-2010, Matrix Library 8.2) . Furthermore, RegionMiner was applied using default settings (i) to calculate the distance correlations with the genomic intervals and the genome-wide experimentally verified transcriptional start regions (TSRs), which were derived from the mapping of selected full length cDNAs (source: GenBank) and CAGE tags (source: FANTOM3, http://fantom3.gsc.riken.jp/) using the oligo capping method, (ii) to determine the occurrence of E-boxes, (iii) to perform an orthologous region search in 6 placental mammalian (Eutheria) species (Pan troglodytes, Macaca mulatta, Bos tarus, Canis familaris, Mus musculus and Rattus norvegicus) and (iv) to determine the overrepresentation (in comparison to the genome and to promoter regions defined as −500 to +100 bp around the start sites of the 159,075 annotated transcripts in ElDorado comprising Refseq, GenBank full-length transcripts and Ensembl records) of the MYC motive as discovered by CoreSearch.CoreSearch (Genomatix) with default settings was employed for a de novo calculation of a MYC-binding motive using the 100 highest ranked genomic intervals . CoreSearch starts with a search for a highly conserved core sequence (called “tuple” in the original publication) which occurs in almost all of the input sequences. In most cases this initial search defines more than one core. Consecutive selection steps are employed in order to reduce the number of core candidates. The final selection is based on maximization of the information content (consensus index), first of the core and then of regions around the core. [...] SiRNA-mediated knock-down of MYC was performed employing the BL cell lines Raji, BL41 and Blue1 in order to detect changes of MYC-driven gene expression. For this purpose, the cells were Amaxa-transfected using MYC smart pool siRNA and control siRNA (Thermo Scientific/Dharmacon, Erembodegem, Belgium), respectively. Resulting down-regulation of MYC protein expression was monitored by immunoblot analysis using a monoclonal rabbit MYC antibody (clone Y69, Epitomics, Burlingame, USA).Total RNA of MYC siRNA-treated and control siRNA-treated cells was extracted employing the RNeasy Midi kit according to standard protocols (Qiagen, Hilden, Germany). Affymetrix GeneChip analysis (HG-U133A) was performed according to the manufacturer's recommendations, starting with 5 µg total RNA. CEL files were generated with the help of the GCOS 1.3 software (Affymetrix). The CEL files of all experiments are available via the Gene Expression Omnibus (GEO) of the National Center for Biotechnology Information (www.ncbi.nlm.nih.gov/geo/) under the accession number GSE30726.For analysis of Affymetrix data the expression levels were normalized using VSN . Limma was used to fit a model for the effect of treatment, cell type and experiment batch . Probe sets were tested for differential expression in cell lines with and without MYC siRNA treatment using an empirical Bayesian method. P-values were corrected for multiple testing by the method proposed by Benjamini and Hochberg . A corrected p-value of <0.05 and a fold-change >1.2 were used as thresholds.The differential expression of selected genes was confirmed by Western blotting as previously described with antibodies against CD20 (clone L26, Dako, Glostrup, Denmark), CD79a (clone JCB117, Dako), A20 (own-production), BLNK (sc-8003, Santa Cruz, Santa Cruz, CA USA), CIAP2 (ab32059, Abcam, Cambridge, UK), PAX5 (M7303, Dako), and Actin (ab6276, Abcam) as a control. [...] Ensembl annotations (hg19, GRCh37, Ensembl Genes 59) closest to or overlapping with the 7,054 genomic intervals were determined employing the Bioconductor package ChiPpeakAnno (http://www.bioconductor.org/packages/2.5/bioc/html/ChIPpeakAnno.html). The nearest transcription start region (TSR) was used as reference point. By default, the distance is calculated as the distance between the start of the binding site and the TSR that is the gene start for genes located on the forward strand and the gene end for genes located on the reverse strand.Ensembl annotations next to significant MYC-binding sites () and the genes differentially expressed as indicated by our MYC siRNA experiments () were uploaded to DAVID (The database for annotation visualization and integrated discovery) bioinformatics resources (http://david.abcc.ncifcrf.gov/) , . We calculated the most overrepresented (enriched) biological annotations using DAVID default conditions. The total GRCh37 Ensembl gene annotations () and the total U133A Affymetrix gene Ids were used as background distribution for the analysis of ChIP-Seq and U133A data, respectively. […]

Pipeline specifications

Software tools Bowtie, HOMER, DESeq, ChIPpeakAnno, liftoveR, CoreSearch, limma, DAVID
Databases ElDorado
Application ChIP-seq analysis
Organisms Homo sapiens
Diseases Burkitt Lymphoma, Neoplasms, Lymphoma, B-Cell