Computational protocol: Topologically associating domains are ancient features that coincide with Metazoan clusters of extreme noncoding conservation

Similar protocols

Protocol publication

[…] Conserved CNEs were generated by examining pairwise BLASTZ net whole-genome alignments for regions with a high percentage identity over a defined number of base-pairs. For each comparison, both of the relevant nets (from the perspective of each species) were scanned. Elements overlapping exonic and repetitive repeats were removed. This set of elements was then aligned against the genome using BLAT to remove elements that mapped to more than four locations in vertebrates. The resulting set of CNEs was then smoothed using a sliding window (300 kb for vertebrates and 50 kb for Drosophila) to generate CNE densities, as was originally used for ANCORA browser. As the evolutionary distance between two species increases it becomes more difficult to identify CNEs, and therefore it is required that less stringent thresholds are used. [...] The set of TADs generated from an experiment is dependent on both the experimental protocol (i.e., restriction enzyme, sequencing depth) and the processing techniques used (i.e., bin size, algorithm). Hi-C interaction data sets for human were obtained from Gene Expression Omnibus (GEO; GSE52457), for H1-ESC (H1), MS, ME, NP and trophoblast-like. These reads were iteratively aligned using bowtie against hg19. Reads mapping to chrM and chrY were removed from the analysis. The resulting aligned reads were binned using a variety of bin and window sizes, with a bin size of 20 kb and a window size of 40 kb appearing to generate a robust set of TADs. TADs were identified using both HOMER and the TAD calling pipeline (HMM_calls) proposed by Dixon et al.. Mouse ESC Hi-C data were downloaded from GEO (GSE35156) and processed using the same pipeline as for humans. Hi-C for Drosophila whole embryo Hi-C data were obtained from GEO (GSM849422) and processed using the same pipeline. Directionality matrices for Drosophila Hi-C were generated using HOMER with a bin size of 10 kb and a window size of 20 kb. Due to the low number of reads mapping uniquely to heterochromatic chromosomes (i.e., chr3RHet, chr2LHet), these chromosomes were discarded from further analyses. TADs predicted to span across centromeric regions were removed.The strength of TADs was defined as the sum of the absolute directionality indexes within a TAD normalised to the length of the QTAD in kilobases.High-resolution GM12878 Hi-C data were obtained from GSE63525. Contact domains which overlapped by at least 60% were collapsed to generate a set of outer-most domains.Compartments were identified by performing principal component analysis on the Hi-C interaction matrix and investigating the first principal component. TADs were classified as A or B given at least 60% of locations within them were either positive or negative, respectively. A single gene was classified by examining at 5 kb window around its promoter and classifying it as belonging to the A or B compartment using the same criteria. [...] To visualise the relationship between the identified GRBs and TADs, we produced heatmaps of genomic regions centred on the GRB and ordered by GRB size, in which the GRBs and any features that correlate with them show a characteristic funnel shape. To show the TAD data for the GRB regions shown on the heatmap, we used Hi-C directionality index (positive/red when this region is preferentially interacting with regions downstream, and negative/blue when this region is preferentially interacting with regions upstream; one TAD is typically a red region followed by similar-sized blue region). [...] For this analysis, we used GRBs called using mm9-galGal4 CNEs at a threshold of 70% over 50 bp. CTCF chromatin immunoprecipitation sequencing data were obtained from mouse ENCODE for 17 cell lines and tissues. Reads were aligned to the mm9 genome using bowtie and peaks called using MACS2, with the first input replicate for each sample used as the control. Where replicates were available, the intersection of peaks called on different replicates was used for the final peak set.A consensus set of CTCF peaks was calculated by resizing all peaks to a width of 400 bp and taking the union of peaks across all 17 samples (average CTCF peak size across all data sets investigated was 404 bp). Peaks were scored for the number of samples they occur in. CTCF peaks per 10 kb tracks were calculated using the consensus peak set and counting the number of peaks occurring in overlapping 10 kb windows, with a step size of 1 kb, across the mouse genome.CTCF peaks within 10 kb of GRB and TAD boundaries were assigned to the boundary, and classified as ‘specific’ if they were present in 1–2 samples, ‘constitutive’ if they were present in 16–17 samples and ‘intermediate’ otherwise. Enrichment was calculated relative to the proportion of these categories in the consensus peak set, and p-values calculated for each category using a two-sided binomial test. Additionally, we investigated the effect of using different distance thresholds for assigning CTCF sites to GRB boundaries. Regardless of the choice of threshold used (10–120 kb in steps of 10 kb) the reported enrichments remained stable. […]

Pipeline specifications

Software tools Bowtie, HOMER, DI
Application Hi-C analysis