Computational protocol: Analysis pipeline for the epistasis search – statistical versus biological filtering

Similar protocols

Protocol publication

[…] Regression-based approaches are mostly used to model and test interactions. The regression approach has been implemented in the epistasis module of PLINK () to test pair-wise diallelic by diallelic epistasis for both quantitative and binary traits. An extension of the PLINK epistasis module, FastEpistasis, uses an efficient parallel computation algorithm to test pair-wise interactions. FastEpistasis is 15 times faster than PLINK using a single core computer (). proposed an approach for joint association analyses allowing for pair-wise interactions based on logistic models; their approach uses an exhaustive search among single-nucleotide polymorphisms (SNPs) meeting some low marginal significance threshold. The software package PLATO can perform linear or logistic regression interaction analysis, calculating the full model, the reduced model, and the likelihood ratio test comparing the two ().The advantages of regression-based approaches are the clear interpretation of the model and the parameters that relate genotypes to phenotype. However, regression-based approaches have many technical and computational disadvantages for testing higher-order interactions and require many more tests: the number of parameters to be tested increases exponentially with the number of SNPs in the model.Model-free approaches, such as machine learning and pattern recognition, afford an alternative strategy, and are capable of detecting high-dimensional non-linear interactions. This approach generally does not estimate parameters. It finds combinations of SNPs that can best separate cases and controls associated with the disease by epistatic interactions or joint effects. Some model-free approaches collapse high dimensional data into two dimensions, such as the combinatorial partitioning method (CPM; ), restricted partition method (RPM; ), set association (), and multifactor dimensionality reduction (MDR;).Unsupervised pattern recognition has also been used to detect interactions. proposed a method for family based studies to detect differentially inherited SNP modules by hierarchically clustering SNPs that could be interactively associated with a disease. They first construct a genomic context-based SNP network based on adjacency on the chromosome. The association between each SNP and disease is evaluated on the basis of mutual information between SNP identity by descent sharing and affection status sharing of pairs of siblings. Then they use a hierarchical clustering algorithm to find risk SNP modules (clusters) for which discriminative scores are locally maximal. In each module, the SNPs are within a certain network distance (defined as the number of edges separating connected SNPs), and the discriminative score of a module is the maximum mutual information of the SNPs in the module, reflecting the risk associated with the module.A likelihood ratio-based Mann–Whitney approach () and its extension () are other non-parametric methods for detecting interaction. They use a multi-locus Mann–Whitney statistic to evaluate the joint association of a SNP combination. Using a computationally efficient forward selection algorithm makes these methods feasible for genome-wide gene–gene interaction analyses. Nevertheless, they require at least one SNP in the combination to have a significant marginal association. The non-parametric approaches do not suffer from the issue of an increasing number of parameters when modeling high-order interactions, but it is difficult to determine how the detected SNP combinations affect the disease, either via the single marker associations or via their interactions.Some studies test marker–marker interactions by testing linkage disequilibrium (LD) in the diseased population (), or test the contrast of LD or Pearson correlation in cases and controls (; ). These methods are based on the idea that, if two unlinked markers are interactively associated with a disease, the two markers will have LD patterns in the disease population. If controls are not studied, these methods assume that the controls do not exhibit similar patterns. [...] Interaction models that are constructed based on specific biological knowledge are more likely to make sense. Research over the last several decades has accumulated vast amounts of biological information that is stored in public databases. These include gene ontology annotation, gene–gene interaction databases, pathways, disease related gene networks and systems, as shown in Table . This information can greatly assist GWAS to find epistatic interactions. Many recent studies have used such biological knowledge and databases for filtering in their interaction studies. The databases have helped identify biological pair-wise interactions among SNPs in pathways, and hence new associations and potential drug targets. For example, generated genome-wide SNP pairs based on multiple biological pathways such as KEGG, STRING, T2DGADB, etc.Biofilter is an analysis pipeline that catalogs biological information by integrating data from the Reactome, KEGG, GO, DIP, Pfam, Ensembl, and NetPath (; ). It can build SNP–SNP models based on known interactions between genes and proteins in curated pathways and networks. utilized the Biofilter software to look for epistasis contributing to the risk of virologic failure. Approximately two million SNP–SNP interaction models were produced by Biofilter, and tested these models by using logistic regression via the software package PLATO. They identified interactions between SNPs in the TAP1 and ABCC9 genes. identified five significant GxG interactions associated with cataract using Biofilter. studied multiple sclerosis susceptibility with Biofilter, identifying gene–gene interactions of susceptibility loci involved in the central nervous system and neuron function. used Biofilter to detect associations with low density lipoprotein cholesterol level, identifying 11 significant GxG interactions, eight of which were replicated in a second cohort. In each of these examples, Biofilter generated biologically plausible gene–gene and SNP–SNP interaction models that were replicated in an independent study.Some studies reduce the number of tests by performing a gene-based, as opposed to a SNP-based, interaction test. combined the SNP-wise P-values to form a gene-wise P-value for each gene (such as using the minimum P-value for the gene), and superimposed the gene-wise P-values on a human protein interaction network to identify sub-networks containing a higher proportion of genes associated with multiple sclerosis than expected by chance. tested interactions of SNP pairs that are separately located in two different genes as marker-based tests. To test the interaction between each pair of genes, they combined these marker-based interactions and the LD between markers into a gene-based statistic.Knowledge-driven filtering approaches can test models of genes that participate in the same biological pathway or network, and the interpretation of the interactions is then more straightforward. But their precision and power are hard to validate by simulation. Because such approaches depend on prior knowledge, which may not be accurate or may not be applicable to a particular dataset, they may miss what could be important findings among the genes for which we have little knowledge. [...] Filtering based on statistical tests is data-driven. Statistical data-driven filtering includes, apart from SNP quality control, single marker associations, feature selection to keep only the most informative markers, and statistical tests to screen for potential interactions. Using data-driven filtering in GWAS can dramatically decrease the search space used to find interactions, so that subsequent statistical tests and machine learning methods can be applied as an exhaustive search among a smaller number of SNPs. The performance of data-driven filtering depends on the assumptions that the statistical tests or filtering algorithms make. Single marker association filtering can only screen interactions among SNPs showing at least a moderate effect on the trait of interest, while feature selection filtering and variance heterogeneity filtering can be used to detect SNP interactions with very weak marginal SNP effects.Filtering according to single marker association. Filtering SNPs based on their marginal effects is frequently used for a high-dimensional gene–gene interaction search. It is often combined with biological filtering to identify interactions among SNPs that are marginally associated with a phenotype (; ; ; ; ). This approach follows the principles of hierarchical model building in the general linear model, where the interaction terms are tested only after all main-effect terms are deemed statistically significant. Typically the significance threshold used is less stringent than the usual genome-wide threshold of 5 × 10-8. The advantage of this filtering is that it is easy to implement; its disadvantage is that it has low power for detecting interactions among low-marginal-effect SNPs.Filtering by feature selection algorithms. Feature selection algorithms such as Relief (), ReliefF (), Tuned ReliefF (TuRF; ), and Spatially Uniform ReliefF (SURF; ) can also be used. They screen pairs of diallelic SNPs that can cluster individuals with similar phenotypes, on the basis of the nine two-SNP genotypes, into two distinct classes (e.g., cases versus controls). For each individual only a small subset of neighboring individuals, i.e., individuals most similar to that individual over all the SNPs, is examined. Iterating over each individual and its chosen subset of neighboring individuals, SNPs are up-weighted for selection on the basis of belonging to the SNP pairs most frequently found in all such sets. Simulation results have indicated this is able to identify SNP pairs with purely non-additive effects in genome-wide datasets. Evaporative cooling () is another feature selection approach which couples mutual information and thermodynamics theory. It filters SNPs by removing those with least information for epistatic interactions. Such feature selection filtering is able to retain pure epistatic (i.e., essential) interaction between markers with low-marginal effects, offering a powerful alternative to single-marker filtering.Filtering by testing variance heterogeneity of phenotype among SNP genotypes. For a quantitative trait, the presence of gene–gene interactions will result in heterogeneity of the phenotype variances among the genotypes of a single SNP, and this heterogeneity of phenotype variance has been proposed as a screen to prioritize SNPs for interaction testing (; ). SNPs selected on the basis of variance heterogeneity would then be used for later gene–gene or gene–environment interaction analyses. However, unless the phenotypic means are the same for all the SNP genotypes, a transformation corresponding to a non-linear change in the scale of measurement may equalize the variances (). This transformation, if it can be found, would eliminate any interactions detected this way. [...] Exhaustive search of interactions among millions of SNPs in GWAS data is computationally time-consuming. However, heuristic stochastic searching algorithms and efficient computational technology, such as parallel computing and bit operation, can boost the computational speed and, if maximization is involved, speed the convergence required to calculate test statistics. Some interaction studies use optimal searching and computational technology to search the whole space for potential interactions. An ultrafast genome-wide scan approach for SNP–SNP interactions, SIXPAC, employs a randomization searching algorithm – probability approximate complete (PAC) testing – to drastically trim the universe of SNP combinations. The approach samples small groups of cases and highlights combinations of alleles carried by all individuals in the group. By further incorporating bit operation technology, SIXPAC can scan genome-wide pair-wise interactions in a few hours, compared to PLINK in weeks (). developed a likelihood ratio-based Mann–Whitney approach that can test high-order interactions. It is computationally efficient and only conducts one test for all the identified interaction, so that no adjustment is necessary for multiple testing. A further extension of the approach introduces a randomizing algorithm into the scan, using ensemble tree models (), to increase the computational efficiency and convergence precision. developed an efficient extension of the PLINK epistasis module by using a parallel computing algorithm running on multiple processors to increase the speed of an exhaustive scan of all SNP pairs.Heuristic or randomized search is much more efficient than exhaustive search, so it can perform a genome-wide scan of interactions among millions of SNPs without any filtering in reasonable time. However, it cannot guarantee reaching the optimal solution, which means it may not find all the biologically relevant interactions. […]

Pipeline specifications