Computational protocol: Meta-Analysis of Gene Expression Signatures Reveals Hidden Links among Diverse Biological Processes in Arabidopsis

Similar protocols

Protocol publication

[…] Data in this study was extracted from the AraPath , which is a gene lists database in Arabidopsis we created (Availability: http://bioinformatics.sdstate.edu/arapath/). As part of the database, the data contains a total of 1,065 co-expression gene lists, which were manually retrieved from published papers linked to GEO before February, 2011.Methodology of the analysis includes four steps. Step 1 is to evaluate overlapping genes among the 1,065 gene lists. A Perl programs was written to evaluate overlapping genes between all 566,580 pairs of lists. An overlap refers to a pair of gene lists, which has at least two common genes. And overlaps from the same paper were considered trivial and were removed. Because there are too much overlaps and microarray experiments tends to produce noisy data, we selected significant overlaps using stringent threshold. Step 2 computes p-values and q-values to identify significant overlaps. Based on the Hypergeometric distribution, we first calculate the likelihood (p-value) of observing the number of overlapping genes if these two gene lists are randomly drawn without replacement from a collection of 28,024 unique genes in terms of R program we compiled. Then, p-values were translated into q-values based on the false discovery rate (FDR) to correct that for multiple testing. Overlaps with very small q-value were significant overlaps. In this case, significant overlaps were identified with a q-value  = 5.0E-9 as a cutoff. In step 3, network of significant overlaps was constructed based on outputs of the step 2 using Cytoscape. Because this network includes too many nodes and edges, we need to further break the big clusters into smaller subclusters. In step 4, There are many algorithms that could decompose large networks into small, densely connected subnetworks such as those in , . We chose a simply algorithm that is available as a plug-in to Cytoscape. MCODE is used to identify interconnected sub-networks and their clusters within the network of the step 3. To generally find locally dense regions (or clusters) of a graph is based on the clustering coefficient , Ci, which measures “clique” of the neighborhood of a vertex: Ci = 2n/ki (ki – 1), where ki is the vertex size of the neighborhood of vertex i, n is the number of edges in the neighborhood. According to the MCODE algorithm , however, clustering the main network into sub-networks is by means of vertex weighting, which is to weight all vertices based on their local network density using the highest k-core of the vertex neighborhood rather than the clustering coefficient Ci. A k-core is a graph of minimal degree k. The highest k-core of a graph is the central most densely connected sub-graph. Given a highly connected vertex, in a dense region of a graph, v may be connected to many vertices of degree one. These low degree vertices do not interconnect within the neighborhood of v and thus would reduce the clustering coefficient, but not the core-clustering coefficient (for detailed information about the MCODE algorithms, see the paper ). Here we created the sub-networks and found the modules and clusters using MCODE algorithms based on the following parameters: Node Score Cutoff  = 0.15; k-core  = 2; Degree Cutoff  = 2; Max. Depth  = 100. The DAVID web site , was applied to analyze the most significant functions of most frequently shared genes in each of sub-networks. […]

Pipeline specifications

Software tools MCODE, DAVID
Databases AraPath
Application Protein interaction analysis
Organisms Arabidopsis thaliana