Computational protocol: Long range control of gene expression via RNA directed DNA methylation

Similar protocols

Protocol publication

[…] Each end of paired-end reads with unique alignment was mapped to the Arabidopsis genome (TAIR10) using Bowtie and then paired. Reads were assigned to 1 kb bins and significant intra-chromosomal interactions more than 3 kb apart were called by Fit-Hi-C with ICE () [,]. These Fit-Hi-C calls were further filtered by estimating a false discovery rate (FDR). This FDR was calculated by permutating p-values such that in a random set of interactions with equal size only .05 were present at or below the p-value cutoff. In essence all called peaks had stringent enough p-values such that they would not be likely to appear in random sets. Alternatively, to obtain a looser view of interactions, the genome was divided into 250bp windows (bin-mapping) and interactions between windows were counted keeping only those with more than three DpnII sites apart (). This was done not only to account for non-uniform cleavage distances, but to allow overlap of uniform datasets such as H3K4me2/H3K9me2 bins or DNA methylation bins. The top 5% interactions (read counts) of the shortest bin distance, 4 bins apart, were kept and that maximum number was applied to all subsequent bins. This approach allowed keeping only the highest confidence short range interactions while maintaining long-range interaction events with a minimum of two reads supporting independent ligation events (). Interaction scores were calculated from the number of reads supporting the interaction between two 250bp bins multiplied by the ratio between the total number of reads to the total number of loops called in that sample. Pearson Correlation between replicates was done in 25 kb bins using contact counts between bins normalized by sequencing depth. Eigenvector was calculated at 1 Mb resolution using juicebox []. Telomere association was plotted using scores from inter-chromosomal interactions. Highly scoring interactions (value > = 10) were used for overlaps between features to compare between replicates unless otherwise indicated.Overlaps with interaction counts were normalized to total interactions for each where applicable and we have provided the raw overlap counts and normalization scheme for each in . Significance scores for each were calculated by a two-sided paired t-test among biological replicates. Genomic regions tested for overlap with Hi-C data were first filtered based on their mappability in Hi-C. This was done by keeping only those features with reads in the decrosslinked control from Wang et al.[]. Filtered inactive and active regions were further checked to ensure similar digestion efficiency by comparing the read counts in the decrosslinked control () []. This was to ensure that lower mappability was not affecting the results and to ensure that restriction digestion to get mapped reads in Hi-C was able to penetrate heterochromatic regions.Interaction enrichment plots between promoters and DMRs () were plotted by 1) Filtering out interactions <10 kb apart; 2) Keeping reads that overlap bins surrounding the TSS on one side and DMRs on the other; 3) Taking the average or sum of each interaction bin in the matrix of bins according to the distance from TSS and from DMR; 4) Repeating steps 1–3 with TSS’s connected to bins surrounding random regions to calculate a randomized average and subtracting these average values from the matrix. Z-axis values and color scores are then plotted from these values obtained, as a relative score from the minimum and maximum. Interaction overlaps between promoters of differential genes and distal nrpe1 DMRs was taken using an interaction score > = 5 to obtain as many overlapping high quality interactions as possible. [...] RNA from ago4 seedlings was isolated and rRNA-depleted in three biological repeats as described [] and libraries were prepared by the University of Michigan Sequencing Core. Reads were mapped to the TAIR10 genome assembly using Tophat [] and differential expression was called using EdgeR []. Previously published RNA-seq datasets from seedlings (Col-0 wild-type and the nrpe1 mutant) [] (GSE38464) were obtained from plants grown, harvested, isolated, rRNA-depleted, and sequenced in parallel to the ago4 dataset. Overlaps in differential expression were calculated from EdgeR and plotted as a weighted Venn diagram using the Venneuler package in R. […]

Pipeline specifications

Software tools TopHat, edgeR, venneuler
Databases TAIR
Organisms Arabidopsis thaliana
Diseases RNA Virus Infections