Computational protocol: Examining non-syndromic autosomal recessive intellectual disability (NS-ARID) genes for an enriched association with intelligence differences☆

Similar protocols

Protocol publication

[…] Secondly, a gene-based statistic was derived by combining the effect of each SNP within a gene and the 50 kb boundary. Combining the effect of multiple SNPs has the potential to capture a greater proportion of variance which will lead to an increase in power (, ). Gene-based statistics were derived using VEGAS () in which a test statistic is calculated from the sum of test statistics within a gene region with linkage disequilibrium (LD) being taken into account using the HapMap phase II CEU (NCBI build 36 release 22) reference panel for each gene and the 50 kb boundary. The statistical significance of this statistic is calculated using simulations. With 40 genes in the NS-ARID set the alpha level was 0.00125 (i.e., 0.05/40). [...] Thirdly, a gene-set analysis was performed. In gene-set analysis, genetic data is aggregated from multiple genes that are united by sharing certain biological, functional, or statistical characteristics. This aggregation provides the advantage of reducing the multiple testing burden, as the whole gene set forms the statistical unit of association making it possible to detect small but consistent deviations from the chance level of association. Gene-set analysis has also been shown to be able to increase statistical power because, as is found in gene level analysis by contrast with single marker analysis, the effect of multiple SNPs is summed (). Gene-set analysis can be subdivided into self-contained testing and competitive testing. The difference between these two depends on the null hypothesis being tested. Self-contained tests examine if the a priori gene-set shows an association with the trait of interest, whereas competitive tests are used to show that the a priori gene-set shows a greater level of association compared to other gene sets. As there are more low p-values in a GWAS than would be expected under the null hypothesis, self-contained tests will inflate the type 1 error rate; for this reason competitive testing is recommended.In order examine the NS-ARID gene set, INRICH () was used. A gene-set analysis using INRICH proceeds through a number of steps. Firstly, regions of the genome showing evidence of association are identified using the clump function in PLINK (). These regions were derived by selecting SNPs, termed index SNPs, where the p-value is below 0.0005. Regions around these index SNPs are included by adding SNPs which are nominally significant, within 250 kb and correlated (in LD of r2 > 0.5 using the HapMap2 CEU reference panel) with the index SNP. This creates regions of the genome (genomic intervals) that show evidence of association that is independent of associations found in the other regions. Genomic intervals were excluded from subsequent analysis if they did not overlap within 20 kb (5′ or 3′) of any known gene according to the UCSC human genome browser hg 18 assembly. Secondly, a test statistic describing the level of association between the a priori gene-set and the phenotype is derived. This is defined by counting the number of times the independent genomic intervals overlapped with the a priori gene-set. The total number of times the independent intervals overlap with the gene-set is the test statistic. Thirdly, the statistical significance of this test statistic is determined using a competitive test. This is carried out by creating genomic intervals that contain the same number of genes, SNP density and LD structure as the independent genomic intervals derived using evidence of association with the phenotype. 10,000 permutations were used to derive an empirical p-value for the gene set defined as the proportion of permuted statistics that are equal to or exceed the observed gene set statistic. [...] The fourth analysis carried out aimed to quantify the biological relationship between the genes of the NS-ARID set and to use this knowledge to test the systems and pathways that reflect these processes for an association with intelligence. Here, Gene Relationships Across implicated Loci (GRAIL) () was used to examine the 40 genes of the NS-ARID gene set and identify common cellular process or pathways. This was carried out using a text mining algorithm to derive a set of statistically significant keywords describing the relationship between the 40 NS-ARID genes. Using the a priori gene-set GRAIL can be used to identify a subset of genes that are more related than chance as well as to assign statistically significant keywords suggesting a pathway or system that unites the members of the gene-set. Importantly, this metric is derived without the use of the phenotype, meaning that potentially biased ideas about which pathways or biological functions influence the phenotype cannot dominate the analysis. Additionally, undocumented or distant relationships between the members of the gene-set can be indicated.These keywords were derived using a database of 259,638 abstracts taken from PubMed before December 2006. This date was selected as it is prior to the mainstream application of GWAS, as abstracts detailing the regions identified by GWAS would be expected to confound the analysis by describing the NS-ARID gene set as being associated with NS-ARID. The GRAIL parameters applied were as follows: release 22/HG 18; HapMap population: CEU; Functional Data source PubMed Text (December 2006); Gene size Correction on; Gene lists; All human genes within the database.Each of these abstracts was converted into a vector of word counts and, for each gene, a vector consisting of averaged word counts is derived. The relationship between any pair of genes is defined as the correlation between the two vectors of averaged word counts. This means that if two genes are described using the same words they will receive a high similarity score; however, they do not need to be mentioned in the same abstract in order to be classed as similar. After the relationship between the members of the gene-set has been examined, keywords are derived. These keywords are defined as those that have the greatest weight across all of the text vectors for the genes of the gene-set. Keywords are restricted to those that appear in > 500 abstracts and contain > 3 letters and no numbers.Following the generation of the keywords, Gene Ontology (GO) () was mined. Here, the keywords derived by GRAIL to suggest pathways or systems common to the NS-ARID gene-set were used as search terms in GO. All gene-sets with at least five human genes were extracted and examined using INRICH to discover whether these showed significant overlap with the intervals generated from the GWAS data. As multiple gene-sets are being tested in this section of the study, the p-value generated for each gene-set will need to be corrected for the number of tests made. As the gene-sets are not independent, corrections such as the Bonferroni or false discovery rate will yield an overly conservative estimate of significance (), and so a bootstrap approach was used. Firstly, one of the 10,000 permuted interval sets was selected at random to serve as the observed interval set. Secondly, the statistical significance for the interval set serving as the observed data was derived as before by generating intervals across the genome and comparing the overlap with the gene-sets. Finally, the corrected p-value is the proportion of bootstrapped samples where the minimum gene p-value over all the gene-sets is at least as significant as the p-value for the gene-set being corrected for (). By using GRAIL to examine the functional relationship between the genes of the NS-ARID gene-set followed by GO to construct gene sets based on these shared functions, this series of analyses tests the hypothesis that the genes responsible for NS-ARID are functionally related to systems where common SNP variation can explain variation in intelligence. […]

Pipeline specifications

Applications GWAS, Genome data visualization
Diseases Movement Disorders
Chemicals Nucleotides