Computational protocol: Selection on a Variant Associated with Improved Viral Clearance Drives Local, Adaptive Pseudogenization of Interferon Lambda 4 (IFNL4)

Similar protocols

Protocol publication

[…] In order to explore the level of functional constraint in IFNL4, we estimated the level of protein conservation in primate and non-primate mammals. Specifically, we assessed the ratio (dN/dS) of non-synonymous substitutions per non-synonymous site (dN) to synonymous substitutions per synonymous site (dS) across gene orthologs. Since purifying selection eliminates deleterious protein-coding changes, dN/dS decreases with negative selection and increases with relaxed constraint and positive selection.We used human IFNL4 reference sequence NM_001276254.2 to BLAT genomes of other species and generate multiple-species sequence alignment of IFNL4 coding exons 1 through 5 (). The panda-predicted IFNL4 ortholog was subsequently used as BLAT query to extract coding exons for additional non-primate species (). Further, we sequenced IFNL4 (exons and introns) in genomic DNA and reconstructed complete IFNL4 cDNA sequences of chimpanzee (Genbank accession JX867772), baboon (Genbank accession KC525947) and crab-eating macaque (Genbank accession KC525948). The whole IFNL4 genomic region is absent in mouse or rat. All discovered functional IFNL4 sequences () where used for a multiple-sequence alignment which was created using ClustalW and annotated with Jalview .The alignment was analyzed with codeml (part of PAML4 ) to test various models of selection. We estimated the overall dN/dS for the complete tree and compared likelihoods for models that allowed: i) free dN/dS for each branch (i.e., lineage heterogeneity); ii) a primate-specific dN/dS; and iii) a human-specific dN/dS. Additionally, we performed tests aimed to detect site-specific signatures of positive selection across the phylogeny (branch models): i) model 1a (neutral) vs. model 2 (positive selection); ii) model 7 (neutral) vs. model 8 (with dN/dS>1); and iii) model 8a (with dN/dS = 1) vs. model 8 (with dN/dS>1). [...] To infer the model of selection that best fits IFNL4 data and estimate the timing and selection strength of the TT allele, we used an Approximate Bayesian Computation (ABC) approach . In particular, we followed a published approach , which has been previously shown to discriminate well between SDN, SSV and neutrality (NTR) . In brief, this approach is based on performing a large number of simulations under different selection models, with random parameters drawn from some probability distribution (called the prior distribution). Real data and simulations are compared based on summary statistics, and through a rejection scheme the simulations that most closely resemble real data help inform inferences about the best-fitting model. The parameter values that generate these simulations are then used to obtain the posterior distribution of each parameter, whose mean and standard deviation are used to perform the parameter inferences. We extended the method to consider more than one population, since two-population statistics are most informative in our case.Specifically, the approach uses msms to simulate data, custom python scripts to calculate all summary statistics, and ABCtoolbox for all ABC inferences. Under both selection models, we started with uniform priors with a range as follow (see ):SDN model - selection strength in Africa sA∼U(0,1.5%); selection strength in non-Africa sNA∼U(0.5,5%); time when selection started tmut∼U(40,70kya)SSV model - selection strength in non-Africa sNA∼U(>0,5%); frequency of the allele when selection started f0∼U(0,20%); time when selection started tmut∼U(21,51kya)NTR model - time when mutation appears t∼U(40,70kya)Because simulations with the selected allele fixed are likely to be very different from the observed data, we conditioned on the selected allele segregating in both populations. This resulted in non-uniform prior distributions presented in and . We used 104 simulations to distinguish between the neutral model and the two selection models, and a larger set of 8×105 simulations for the more subtle distinction between the two selection models and for parameter estimation. For the simulations, we used the population history model estimated by Gravel et al. and assumed a constant recombination rate of 1.76 cm/Mb throughout the region (average recombination rate in the IFNL locus ), and a perfectly additive model of dominance (h = 0.5). Lack of an appropriate demographic model for American and non-Yoruba African populations precludes analysis for those populations. The following single-population statistics were calculated: the average number of pairwise differences π, Watterson's θ, Fay and Wu's H and Tajima's D , all for both 4 kb around the site and a 8 kb (6 kb upstream and 2 kb downstream of the site) interval around the TT allele. The between-population statistics employed were: FST for the selected site, FST in 4 kb around the site, FST for the whole region, and XP-EHH on the selected site . In addition, we also included the frequency of the selected allele in both populations. This resulted in a set of 16 summary statistics, which, following Wegmann et al. and Peter et al. , was reduced to seven summary statistics using PLS-DA for model choice and regular PLS for parameter inference . Performance of the ABC model choice and parameter distribution for the SDN model has been assessed for each particular model (). Confidence in the choice of selection models has been supported with Bayes factors.In addition, we investigated the influence of the dominance model in our inferences. We analyzed a recessive model for TT (h = 0), the perfectly additive model above (h = 0.5), and a supra-additive model (h = 0.38), using 500,000 simulations for each model. We run an ABC analysis for model selection with all simulations (from all three dominance models and the three selection models NTR, SDN, and SSV). We then assess the posterior probability of each dominance model regardless of selection model, and the posterior probability (and parameter estimates) of each selection model for the additive and supra-additive dominance models (see ). [...] 1000 Genomes,; GENCODE,; HapMap,; XP-EHH and iHS executables,; VCFtools,; ABCtoolbox:; msms: […]

Pipeline specifications

Software tools BLAT, Clustal W, Jalview, PAML, MSMS, ABCtoolbox, VCFtools
Databases GENCODE HGDP psiDR
Applications Phylogenetics, Population genetic analysis, GWAS
Organisms Homo sapiens
Diseases Hepatitis C