Computational protocol: Unravelling the hidden DNA structural/physical code provides novel insights on promoter location

Similar protocols

Protocol publication

[…] To be coherent with the ProStar training, we applied our predictor using ENSEMBL (v47) () as a reference annotation to select TSSs located at least 1200 bp away from any other annotated TSS. As a result, we obtained a set of putative ‘false positive’, i.e. regions predicted as promoters by their unusual physical properties but which were not experimentally known. We then filtered out those regions that presented >70% of repetitive elements according to the RepeatMasker algorithm (, or that did not allow unique polymerase chain reaction (PCR) primer localization to the human genome assembly by in silico PCR BLAT search ( This process yielded 119 genomic regions (1200 bp long) located around 72 putative TSS (note that it was not always technically possible to study promoters located in both directions).As a negative prediction set, we randomly selected 100 positions, where ProStar suggested no TSS in a 1200 bp window, and for which unique PCR primers could be located. To make the test unbiased, we did not perform any filtering based on the presence of 2006 known promoters in these ProStar negative predictions. Both ProStar-positive and ProStar-negative predicted promoters were subjected to experimental validation.The positive set was further compared against the latest gene and transcript reference annotations GENCODE (v7) () and ENSEMBL (v56) () to determine the true positives. [...] We designed hybridization primers suitable for high-GC content regions. The presence of a unique hybridization site was subsequently verified by a BLAT genome alignment ( Primers were ordered in 96-well plates to Sigma-Aldrich. PCR was performed in a 96-well format using AccuPrime GC-rich DNA polymerase (Invitrogen) for the amplification of selected regions. PCR products were analyzed in a 1% agarose gel. Successfully amplified regions were inserted into the promoterless pGL4.21 (luc2P/Puro) vector and ligated through Sfi I restriction sites (Rapid DNA ligation Kit, Roche) that enable directional cloning. Escherichia coli competent cells (DH5α, Invitrogen) were transformed with the ligation products. Two independent colonies were selected from each transformant and were verified by sequencing from both the 5′ and 3′ ends. The experimental approach for luciferase activity assays in a high-throughput approach is outlined in Supplementary Figure S1.Cos-7, Hek293, U2OS, MIA PACA and MDA231 cells were cultured in Dulbeccós Modified Eaglés Media (DMEM) supplemented with 10% of fetal calf serum (FBS). All cultures were grown as a monolayer in a humidified incubator at 37°C in an atmosphere of 5% CO2. One day before co-transfection, 2–6 × 104 cells per well were plated in 96-well plates with 100 µl of DMEM without antibiotics. Confluence of 90–95% was achieved by the second day. Transient DNA co-transfections were performed with 0.1 µg of the corresponding pGL4.21/construct plasmid and 0.02 µg of the pGL4.74 (hRluc/TK) vector (Promega) using TransFact reagent (Promega) according to the instructions of the manufacturer. DMEM supplemented with 10% FBS was added to the cells 1 h after co-transfection to allow correct growth and protein expression. Dual Luciferase Reporter Assay (Promega) was performed 36 h after co-transfection using a GloMax Multidetection Luminometer (Promega) with dual injector system allowing rapid reagent addition. Light emission was measured 2 s after addition of each of the substrates and integrated over a 10-s interval. The firefly luciferase activity results were normalized with the renilla luciferase activity from the pGL4.74 (hRluc/TK) plasmid to account for differences in transfection efficiency. The previously characterized SPG4 gene promoter () was used to generate positive (S−621/−1) and negative (S−1290/−424) promoter region controls, respectively. Promoter activity was assessed in duplicates and was considered active if it exceeded 3-fold the score of negative control sequences from the normalized threshold value.After luciferase assays, 80 regions from both the positive and negative promoter sets were further divided into four subsets for further analysis: subset 1 contains 20 high-confidence ProStar sequences with high luciferase activity (PS+L+); subset 2 contains another 20 high-confidence ProStar sequences with low luciferase activity (PS+L−); subset 3, 20 low-scored ProStar sequences with luciferase activity (PS−,L+); and subset 4, 20 low-scored ProStar sequences with no luciferase activity (PS−L−). [...] We investigated if different subsets, including the PS+ predictions (17 909 in total), the experimentally tested PS+ predictions (119 sequences) and PS− predictions (100 sequences) or luciferase positive (49 sequences) and negative (23 sequences) regions, were enriched within the 1200 bp in any of the currently annotated 885 TFBSs. To this end, we systematically compared them with a full list of transcripts described in the BioMart database ( (76 905 transcripts) as a background control. To determine the significant enrichment, we used a Fisher’s exact test and represented the magnitude of enrichment as odds ratios, which is the ratio of enrichment for a given TFBS. The corrected significant P-value after applying a Bonferroni’s correction for all tests was 0.05/885 = 5.65 × 10−5. The analyses were performed using the R statistical environment ( […]

Pipeline specifications

Software tools BLAT, BioMart
Applications Genome annotation, Nucleotide sequence alignment
Organisms Homo sapiens