Computational protocol: Comparative genomics using teleost fish helps to systematically identify target gene bodies of functionally defined human enhancers

Similar protocols

Protocol publication

[…] In order to associate each of the selected subset of human CNE-enhancer with their bona fide target gene, we analyzed the neighboring genomic context using the UCSC [] and Ensembl genome browsers [] and drafted a locus map depicting the flanking genes spanning at least 2 MB interval on either side of CNE-enhancer (Additional file : Figure S1). The sequence similarity of selected subset of CNEs between the human and the teleost fish genome suggests that they are functional in both lineages. It would be appropriate then to speculate that the target genes would also be the same in both species. Given this assumption, comparative picture of these CNE-enhancers bearing human synteny maps was observed in currently available teleost genome (zebrafish, tetraodon, stickleback, medaka and Fugu) by using Multi-species view option at Ensembl genome browser []. This allowed us to map carefully the genomic context of evolutionary conserved human enhancers in corresponding zebrafish, tetraodon, medaka, stickleback and Fugu loci (Additional file : Figure S1). Among these anciently diverged genomes (human-teleost fish, >450 Mya) uninterrupted physical linkage between CNE-enhancer and one or more neighboring genes was taken as an evidence of functional association (Additional file : Table S1 and Additional file : Figure S1). To further confirm these associations, for one or more genes depicting evolutionary conserved physical association with CNE-enhancer, the endogenous expression pattern of the mouse ortholog was obtained from MGI []. We preferred available gene expression obtained by RNA in-situ hybridization. Reporter gene expression induced by the selected CNE-enhancer is also captured from the VISTA enhancer browser database []. We manually compared the image data of transgenic mouse embryos expressing LacZ reporter gene under the influence of CNE-enhancer element with the RNA in-situ hybridization based endogenous expression data of genes residing in the neighborhood of enhancer sequence (Additional file : Table S2).Duplicated copies of selected subset of CNE-enhancers (dCNEs) were searched through BLAST based similarity searches at Ensemble and UCSC genome browsers [,]. We categorized the duplicated enhancers into those, with duplicated copies only in fish lineage (only a single counterpart in human), duplicated copies only in human (only a single counterpart in fish), and the ones that contains duplicated copies in both fish and human lineages (Figure A). Duplicated CNE-enhancer facilitated further, to link them explicitly with their target gene through paralogy mapping, i.e. by identifying the genes that have paralogs in the genomic regions that harbor at least two dCNEs from the same family. Paralogy relationship among target genes of duplicated set of enhancers was generated by using paralogy prediction pipeline of Ensembl genome browser where maximum likelihood phylogenetic gene trees (generated by TreeBeST) play a central role []. [...] To establish the central nervous system (CNS) specific transcriptional factor (TF) code we selected 159/192 the subset of human CNE-enhancers that were shown to drove expression in various domains mouse CNS (Additional file : Table S4). For this purpose the technique of phylogenetic foot printing was employed on human and mouse orthologous enhancer regions to track the occurrence of evolutionary conserved grouping of transcription factor binding sites (TFBSs) in experimentally verified subset of brain specific enhancers (Additional file : Table S4).Mouse orthologs of human enhancers were obtained through BLAST based similarity searches. Human-mouse conserved transcription factor binding sites in each CNE-enhancer were detected with computer program ConSite []. The ConSite screen for conserved TFBSs was performed against the JASPAR database with 85% conservation cutoff, 60 bp window size and 75% transcription factor score threshold settings.To track cooperative heterotypic interaction among distinct set of TFs within brain specific enhancers, suitable statistical methodologies were employed for their identification and verification. We formulated a multivariate data matrix with n (rows) as the sample of enhancers and p (columns) the number of TFs for training and control data sets (for control data set see Additional file : Table S5). For the materialization of the known biological background that occurrences of TFs in sample of enhancers are not mutually exclusive, the repeated occurrence of a TF is determined by finding the individual probability of the occurrence of a TF (P(TF)i in a sample). Looking for the patterns and structures in TFs, primarily the training data matrix of 159 enhancers across 14 TFs X¯159×14 and control data matrix of non-conserved/non-coding elements Y¯100×14 are subjected to a two step exploratory data analysis. Computation of probabilities of TFs in (Table ) and correlation matrices RX = [rij2 ] (lower diagonal in Additional file : Table S6) and RY = [rij2 ] (upper diagonal in Additional file : Table S6) complete the two steps employed for the initial exploration of patterns of TFs in the control and training data sets respectively. The probability table (Table ) is a classified presentation of P(TF)i with P(TF)i < 0.5 as members of group-1 and for P(TF)i ≥ 0.5 members of group-2 in the training and control data sets.The correlation matrices of the data sets are desirable to define clusters of TFs that may covary together among all possible pairs of TFs. For the purpose, the squared correlation coefficient (R = [rij2 ]) is interpreted as it indicates a meaningful and practical co-variation among the variables (Additional file : Table S6) [,].Principal Component Analysis (PCA), a powerful multivariate exploratory tool is used to identify patterns in specifically high (P) dimension, interrelated data sets and express the data sets by highlighting their similarities and differences. For multivariate data sets that are interrelated, appropriate application of PCA is using R = [rij] matrix for eigen analysis. Therefore, PCA will be used as a means of constructing an informative graphical representation of the data set by projecting the data onto a lower dimensional space. In the study, control and training data will be presented in a three dimensions (3D) subspace of the first three PCs [,]. The PCs derived by the eigen analysis of correlation matrix (R = [rij]) is a linear combination of the original p variables (the TFs) and each PC uncorrelated with the other, meaning these are the new transformed data expressed in terms of the patterns existing in the original data set. The total PCs derived are equal to the number of original variables present in the dataset. The p PCs formed are with decreasing order of magnitude of variance of the total variation in the data sets. Thus the first three PCs capturing most of the variation in the data set is visualized in a 3D representation. The coefficient of the variables in each of the linear combination, i.e. the PC is defined as loadings. The magnitude of these loadings represents the importance of each variable present. Thus a 3D representation of the loadings of the first three PCs will identify any cluster structure present in the variables (the TFs), exhibiting the co-occurring pattern of TFs in the control and training data sets.The comparative analysis of control and training is of major significance in the validation of clusters of known TFs highly represented in human brain specific enhancers. […]

Pipeline specifications

Software tools TreeBest, ConSite
Applications Phylogenetics, Genome data visualization
Organisms Mus musculus, Homo sapiens