Computational protocol: Integrative model of genomic factors for determining binding site selection by estrogen receptor-α

Similar protocols

Protocol publication

[…] We initially analyzed the presence of the ERE motif in the ERα-binding sites identified in this study, according to a previously described method (). We then sought to leverage our ChIP-seq data to construct the ERE motif in greater detail using the TherMoS algorithm (Sun et al, manuscript in preparation). Instead of the traditional position-specific scoring matrix, the algorithm fits an explicit thermodynamic model of TF–DNA binding (PSEM) to ChIP-seq data. The PSEM ΔΔGij represents the free energy contribution of each possible nucleotide i at position j in the binding site. The total binding energy (G-score) of any particular n-mer is simply obtained by summing over the free energy contributions from each nucleotide in the n-mer. In the case of palindromic motifs for homodimer binding, it is convenient to split the G-score into the contributions from the left (GL) and right (GR) half sites. The probability that a given DNA sequence is bound, i.e., the ‘occupancy' of the sequence, is given by , where τ is a scale factor proportional to the intranuclear TF concentration (). We used this thermodynamic model to quantify ER-binding affinity at 16 043 binding regions identified by ChIP-seq in the non-amplified MCF-7 genome (see ). We also systematically analyzed the set of TFs that modulate estrogen receptor function, by examining co-occupant proteins that might be enriched at the ERE half sites or no-ERE sites defined by TherMoS analysis, using MDscan () (see ). Genome coordinates of ER-binding sites in MCF-7 and T47D cells and corresponding background ERE sets together with binding affinity scores are available at the website http://www.gis.a-star.edu.sg/~liue/sup/ and as . [...] We used logistic regression to assess how well various chromatin features were able to discriminate between ER bound and non-bound regions in three different scenarios or ‘classification tasks'. The features were either an ER affinity score (see main text) or tag count from a ChIP-seq library downsampled to minimal size 7 million tags (for MCF-7 libraries) or 12.5 million tags (for T47D libraries) for unbiased comparison of predictive chromatin marks. Logistic regression was performed using the ‘lrm' command in R. The predictive performance of the resulting models was summarized using precision/recall and receiver operating characteristic (ROC) curves generated using the ROCR package for R (). In classification task 1, 70% of the data was used for model construction and 30% was then used as the test set. For the TherMoS ER affinity scores, we needed to make sure that data that had been used to fit the thermodynamic model was not in any way involved in the fitting or evaluation of the predictive models. Therefore, we fitted a PSEM five times, each time using 80% of the data set, and used the resulting PSEM to score the remaining 20%. These five non-overlapping sets of 20% each were then divided into a training set (70%) for fitting the logistic regression model and a test set (30%) for evaluating the accuracy of the model. The resulting ROC and precision-recall curves are thus averages of five rounds of this procedure. […]

Pipeline specifications

Software tools MDscan, ROCR
Applications Miscellaneous, ChIP-seq analysis
Chemicals Estrogens