Computational protocol: Multi-scale chromatin state annotation using a hierarchical hidden Markov model

Similar protocols

Protocol publication

[…] diHMM differs from existing methods in that it uses a hierarchical hidden Markov model framework, where each level of hidden states corresponds to a distinct length-scale (). It can be used to analyse any number of levels of chromatin states (Methods). diHMM takes multiple ChIP-seq (chromatin immunoprecipitation with sequencing) data as input, and outputs a genome-wide segmentation of the genome into functionally annotated, multilevel chromatin states, each corresponding to a specific length scale.For simplicity, we focus on a two-level model (see Methods for discussion regarding extension to incorporate additional layers), where the lower level corresponds to nucleosome-level states and the upper level corresponds to broader domain-level states ( and ). Following the approach taken by ChromHMM, we first binarize each data track at a 200-base pair (bp) resolution, approximately the size of a nucleosome. The combinatorial patterns of chromatin marks at the 200 bp bins are classified by a discrete set of nucleosome-level states. Domain-level states are used to annotate the transition patterns between nucleosome-level states over regions covered by 20 consecutive 200 bp bins and thus have a 4 kb resolution. At each genomic locus, the assignment of domain-level and nucleosome-level states is interdependent: with domain states informing the overall frequency of different nucleosome states, whereas nucleosome-level states over multiple 200 bp bins provide the transitional grammar for domain-level state classification. These two levels of chromatin states can be identified simultaneously using an iterative algorithm (see Methods for details). For functional analysis, we consider the combination of both levels of chromatin states. By using a relatively small number of states in each level, diHMM can effectively capture a large number of combinatorial patterns.We applied diHMM to annotate multi-scale chromatin states in the three ENCODE tier 1 cell lines, H1 (human embryonic stem cells), GM12878 (B cell-derived lymphoblastoid cells) and K562 (erythroleukemia cells), using a public ChIP-seq data set containing 9 marks: CTCF, H3K4me3, H3K4me2, H3K4me1, H3K9ac, H3K27ac, H3K36me3, H4K20me1 and H3K27me3 (ref. ). Following previous studies, we determined the number of chromatin states based on a balance between biological complexity, model interpretability and speed. As a result, we constructed a model containing 30 nucleosome-level and 30 domain-level states. As discussed later, the results are not significantly affected by the number of chromatin states. diHMM provides genome-wide annotations of chromatin states. However, due to the lack of numerical efficiency, it is infeasible to train a diHMM model using genome-wide data. Therefore, we selected a short chromosome (chromosome 17) as training set, combining information from all three cell lines. The model was then applied to annotate the entire genome. To test the robustness of diHMM, we retrained a model based on data from chromosome 20. The results are in good agreement (). Compared with the nucleosome-level states, the domain-level states are less robust, likely reflecting the smaller sample size in the training data. In addition, we varied the number of nucleosome-level (at 20, 25 and 35, respectively) and domain-level (at 20, 25 and 35, respectively) states. The resulting states are also similar ().After segmentation, consecutive identical states were stitched together, forming regions of variable size. Although the median size for a nucleosome-level state was ∼600 bp (), a domain-level state may extend to over 100 kb regions, as is the case of the HOXB cluster (). Importantly, these small- and large-scale structures were identified from a single model that decomposes the input signals into components of different spatial resolutions. [...] Existing chromatin-state annotation methods usually focus on a specific length scale. To see whether diHMM provides new insights, we selected a few representative methods and compared their results with diHMM. First, we compared the nucleosome-level annotations with chromHMM and Segway, two widely used methods for nucleosome-level chromatin-state annotations. We applied a 30-state ChromHMM to analyse the same data, and found that the nucleosome-level states agreed very well between diHMM and ChromHMM (). Segway is a dynamic Bayesian network-based chromatin-state segmentation method. It also has higher spatial resolution (at 10 bp) than chromHMM. We compared the chromatin-state annotations identified by diHMM and Segway. As expected, the agreement between the nucleosome-level chromatin states is significantly weaker, but the overall functional annotations are quite similar ().We wondered whether similar results regarding chromatin domains could be obtained by applying traditional models with different parameter settings. To this end, we adapted ChromHMM to identify domain-level states, using two alternative approaches: (1) We divided the genome into 4 kb bins, and applied a 30-state ChromHMM to segment the genome; and (2), we first applied ChromHMM to identify nucleosome-level states (with 200 bp resolution), stitched each set of 20 consecutive bins into a block, and applied k-centre to cluster the block-wide nucleosome-state patterns. We chose k=30 so that the results were comparable.We found significant discrepancies at the domain level between diHMM and the results for both (1) and (2) (). For both (1) and (2) the domain-level segmentations were more fragmented compared with diHMM (), and had lower enrichment in regulatory elements (). In addition, although there was still significant bias of gene expression among different ChromHMM-derived domains in (1) and (2), the trend was much weaker compared with diHMM (). Taken together, these results suggest the domain-level states identified by diHMM are more biologically meaningful.Recently, a BCP model was developed to identify local domains (called BLOCKS) with similar histone modification patterns. BCP is computationally less efficient than diHMM, and therefore we only trained a BCP model on 20 kb resolution signal on chromosome 17. This resulted in 25 BLOCKS with an average size of 3.2 Mb, which is about two orders of magnitude wider than diHMM. For comparison, we examined the diHMM domain-level state distribution near BLOCKS boundaries but were unable to find a significant association between the two methods, suggesting these two methods may identify complementary chromatin structures. […]

Pipeline specifications

Software tools ChromHMM, Segway
Application ChIP-seq analysis