## Similar protocols

## Protocol publication

[…] The three standard approaches considered here—haplo.score, haplo.glm, and **hapassoc**—are based on the generalized linear model (GLM). In haplo.score, a global test of association as well as individual haplotype-specific tests are carried out using a score function. It estimates haplotype frequencies independently of trait or covariates under the null hypothesis of no association. Haplo.score does not estimate the magnitude of individual haplotype effects. Haplo.glm is an extension of haplo.score for testing haplotype–environment interactions (it can fit a main-effects-only model also). Unlike haplo.score, it iteratively estimates haplotype frequencies conditional on all observed data and current estimates of regression parameters. It uses Wald tests for testing a global haplotype–environment interaction effect and individual haplotype-specific effects. Also, it estimates the magnitude of individual haplotype effects []. Hapassoc was proposed as an extension of haplo.glm to accommodate missing genotype data at individual SNPs (although haplo.glm can now accommodate missing genotypes) and uses an improved approximation to standard error estimation []. All of these methods can handle binary as well as continuous response.As the above three approaches are not specifically designed for rare haplotypes, they may or may not perform well in presence of rare haplotypes. Indeed, in previous studies [–], hapassoc has shown high non-convergence rates when rare haplotypes are modeled individually rather than pooled together, which is a typical approach for handling rare haplotypes but one that doesn’t allow study of individual rare haplotypes. Thus, we also apply **LBL**, which is described in details in Biswas and Lin [] and Biswas et al [], and briefly here.LBL is based on a retrospective likelihood; that is, it models the probability of haplotypes given disease status. The unobserved (phased) haplotypes of subjects are treated as missing data and frequencies of haplotype pair for each person are modeled using haplotype frequencies (treated as unknown parameters) and allowing for Hardy-Weinberg disequilibrium. The odds of disease are expressed as a logistic regression model, whose coefficients are regularized through a double-exponential prior centered at zero and a variance parameter, which is further assigned a hyper prior. This regularization corresponds to the Bayesian LASSO. Markov chain Monte Carlo methods are used for estimating the posterior distributions of all parameters, which include regression coefficients and haplotype frequencies. Testing for association for each main and interaction effect is carried out by calculating the Bayes factor (BF). A BF exceeding 2 is considered significant evidence of association. The posterior mean and confidence intervals of parameters can be obtained, if desired. LBL is available as an R package at http://www.utdallas.edu/~swati.biswas/. Currently, LBL can only handle binary (case-control) responses. [...] We consider 2 genes—ULK4 and MAP4. We exclude SNPs with more than 25 % of genotypes missing and include SNPs with a MAF of at least 0.001. We use sliding and overlapping windows made up of 5 SNPs to create haplotype blocks (eg, SNPs 1 to 5, 2 to 6, and so on) to cover the whole gene.For selection of SNPs and calculation of MAF, we used genotypes listed under NALTT (the number of alternate alleles thresholded), coded as 0/1/2; these are high-quality genotypes. An alternate allele is usually the minor allele (but not always); for simplicity, we coded the major allele as 0 and minor allele as 1. For phenotype, we defined a binary hypertension trait as follows. If a person has systolic blood pressure (SBP) greater than 140 or diastolic blood pressure (DBP) greater than 90 or is taking antihypertensive medication, we labeled that person to be affected by hypertension (case). Otherwise, the individual is labeled as unaffected (control). Also, a person with SBP and DBP values below thresholds and whose medication field is missing is treated as a control.We apply all four methods on the above described haplotype blocks without using any covariates. For LBL, we use a threshold of BF greater than 2, whereas for other methods we use a p value of less than 0.05 to declare significance. We analyze blocks in each gene twice—using the provided phenotypes and after randomly permuting the phenotype status among all subjects. The latter destroys association, if there is any, and so allows us to gauge the false-positive rates. Finally, we also analyzed using LBL after including in the model the covariate age (dichotomized at 55) and its interaction with haplotypes.To allow for rare haplotypes to be analyzed individually, and not be pooled together, we set the pooling tolerance of hapassoc to zero, where pooling tolerance is a value (user-defined) of haplotype frequency below which the corresponding haplotypes are pooled into a single category called pooled in the design matrix for the risk model. In the hapassoc package, there is a pre-processing function called pre.hapassoc, which returns a list of compatible haplotypes for each person’s genotypes and frequencies of all haplotypes in the population. These are provided as input to hapassoc and LBL. In LBL, the estimated frequencies of haplotypes are used as starting values of frequency parameters. Haplo.glm does not allow pooling tolerance to go below 0.001. For a fair comparison of haplo.glm and hapassoc, we also ran hapassoc with pooling tolerance of 0.001. Haplo.score does not pool any haplotypes. Finally, for comparison purpose, we also analyzed each gene (all SNPs within a gene together) using popular collapsing approaches of sequence kernel association test (**SKAT**), SKAT-Optimal (SKAT-O), and SKAT-Combined (SKAT-C) [–]. […]