## Similar protocols

## Protocol publication

[…] Several safeguards were in place to eliminate cross-contamination with DNA from other sources in the laboratory and across timepoints within each participant: PCR setup was performed in a dedicated clean room absent of any PCR-amplified or plasmid DNA, and all gag and env-gp120 sequence assemblies were regularly compared against the continually updated Mullins laboratory HIV sequence database using **ViroBLAST** [] [http://indra.mullins.microbiol.washington.edu/viroblast/viroblast.php]. As it is particularly difficult to detect specimen mix-up or contamination in longitudinal studies, researchers also never worked with more than one specimen at a time from a given participant.Sequence chromatograms were trimmed based on quality and to remove primer sequences, and then assembled into contigs within Geneious® 7.1.7 (Biomatters, Auckland). Contigs were assessed for quality and complete forward and reverse coverage and then used to generate a consensus sequence. Up to 2 base ambiguities were permitted per assembled consensus sequence (≤ 0.13%).Sequence alignments were generated using **MUSCLE** [] within Geneious followed by manual editing. Hypermut 2.0 [] was used to identify APOBEC3G/F hypermutated sequences, which were then eliminated from downstream analyses. The hypervariable regions within V1, V2, V4 and V5 were excluded from env-gp120 alignments when used to assess positional homology for potential N-linked glycosylation site analyses; these included positions (HXB2 amino acid locations) 132–152, 185–190, 396–410, and 460–465, respectively. A founder sequence was inferred within each participant as the consensus sequence of the first available timepoint.**BEAST** version 1.8.2 was used to estimate substitution rates []. Each taxa was dated according to the known time of sample collection. All sequence alignments were confirmed to possess sufficient temporal signal using **TempEst** []. The GTR +I +G substitution and site heterogeneity models were set for the nucleotide substitution model []. A lognormal relaxed molecular clock [], which estimates the level of rate variation among lineages, was used instead of a strict molecular clock because formal model comparison testing revealed the relaxed clock a better fit of the data (data not shown). Prior distributions for the uncorrelated lognormal distributional model (for a relaxed clock) were specified as follows: the ucld.mean was gamma distributed with a mean of 0.075 and a shape of 1000, and the ucld.stdev was exponentially distributed with a mean of 1/3, initial value of 1/3, and offset of 0. This model yields an estimate of each branch-specific rate, as well as the coefficient of variation and covariance of rates across the tree. The coefficient of variation provides a measure of rate heterogeneity across branches and gives information about how clock-like the data is. The covariance measures autocorrelation between adjacent branches in the tree. Model testing was also performed for the tree prior; the exponential growth population coalescent model fit the data better than a constant population coalescent model (except for M1) and was used for all nucleotide substitution rate estimates [, ]. A comparison of nucleotide substitution rate estimates for each viral region is shown for constant and exponential growth population coalescent models (). All other priors were left as default parameters. The operators were set to auto-optimize during the simulation. Chain lengths were set to 108 with sampling every 104 steps, which ensured a decent sampling from the posterior and effective sample size (ESS) > 100 for all meanRate estimates (ESS > 200 in > 90% of cases). Additionally, each BEAST simulation was run three times (each with a different random starting number) to verify similar convergence from run to run. A separate analysis with the CTMC reference prior on the mean clock rate showed no difference in meanRate estimates (data not shown). Output log files were analyzed using Tracer [http://tree.bio.ed.ac.uk/software/tracer/] and the mean of the posterior density was reported along with the 95% highest posterior density (HPD) interval. Hierarchical estimates of substitution rates were also employed in BEAST using the hierarchical phylogenetic model described by Edo-Matas et al []. This framework allows for different evolutionary histories of the within-host variants from individual-to-individual while providing an overall between-host summary estimate of nucleotide substitution rates. A fixed-effects HPM was used for hypothesis testing of the between group env-C2V5 substitution rates from the 8 WIHS and 11 MACS participants.**jModelTest** 2.1.7 [] was used to determine the appropriate nucleotide substitution model for each sequence dataset, with GTR +I +G [] chosen as it resulted in the highest likelihood scores for most alignments. Equations 1 and 2 from Deng et. al. [] were used to compute average (± standard error) pairwise divergence and diversity, respectively, using the DIVEIN webtool (https://indra.mullins.microbiol.washington.edu/DIVEIN/).Phylogenetic tree reconstruction by maximum likelihood was performed using **PhyML** v3.0 [] within DIVEIN [] for intra-participant alignments, and because of its fast maximum tree search algorithm, **RAxML** (Randomized Axelerated Maximum Likelihood) [] was used to infer inter-participant phylograms. RAxML was also used to perform intra-participant a posteriori bootstrap convergence tests (). Phylogenetic inference was performed using a GTR substitution model, optimized equilibrium frequency, estimated proportion of invariable sites (I), and among site rate heterogeneity was captured using a discretized Gamma distribution. Tree searching optimization used the better of nearest-neighbor interchange (NNI) and subtree pruning and regrafting (SPR).**CodeML** from the PAML v4.8a software package [] was used to estimate nonsynonymous (dN) and synonymous site (dS) substitutions rates. CodeML estimated a single dN/dS (ω) over the coding region using the M0 (one ratio) model. The hierarchical M7(β) and M8(β and ω) site-based models were used for hypothesis testing of positive selection at individual codon sites. Codon alignment and ML-based tree files (from PhyML inference) were input from the viral population sequences sampled for each participant at: i) each timepoint individually (model M0), or ii) for all timepoints collectively (model M0, M7, and M8). The following parameters for CodeML were changed from otherwise default values (CodonFreq = 2: F3x4; kappa = 1.6; fix_alpha = 1: fixed; ncatG = 3; fix_rho = 1: fixed; method = 1: one branch at a time; blength_fix = initial). A likelihood ratio test (LRT) was used to test for positive selection under the M7 and M8 models. Significance of the LRT was determined assuming a Chi square with 2 degrees of freedom test statistic distribution. A mixed effects model of evolution (MEME) [] was used to infer codons undergoing episodic diversifying selection using the **Datamonkey** webserver []. This model combines fixed effects at the level of a site with random effects at the level of branches. Positively selected sites are reported for p-values < 0.05 using a LRT that tested the null vs. alternative models of nonsynonymous rate variation. A false discovery rate (q-value < 0.2, derived from the corresponding P-value using Simes’ procedure []) analysis was used to refine these inferences.Co-receptor usage was predicted using the nucleotide position-specific scoring matrix (ntPSSM) of translated V3 loop amino acid sequences []. PSSM scores of -6 or greater were taken as indicative of X4-tropism []. Predicted N-linked glycosylation sites (PNLGS) were obtained using N-GlycoSite [] [http://www.hiv.lanl.gov/content/sequence/GLYCOSITE/glycosite.html]. [...] Graphics were generated using either GraphPad Prism v6.0f (GraphPad Software, Inc.) or R v3.1.1 []. Phylogenetic trees were generated in **FigTree** v1.4.1 (http://tree.bio.ed.ac.uk/software/figtree/) and the **phytools** R package []. The nonparametric Mann-Whitney U test was used to test for differences between unmatched groups. Most associations were assessed using a nonparametric Spearman’s rank correlation coefficient (ρ) test because the relatively small sample sizes obtained were not assumed to be normally distributed. A Pearson’s correlation test was used to assess the relationship of larger sample sizes and the accompanying r coefficient and P-value was reported. A linear mixed effects model [] was fit to the data and used to estimate the mean upward slope for differences in dS and dN rates over time. A likelihood-ratio test was used to compare the estimated rates within each model and to calculate P-values. Models were corrected for autocorrelation within the repeated measure datasets. LRTs revealed that the slope parameters (dS / year and dN / year) were best estimated with a random effects parameter, while the y-intercepts were best estimated with a fixed effects parameter (data not shown). […]

## Pipeline specifications

Software tools | ViroBLAST, Geneious, MUSCLE, BEAST, TempEst, jModelTest, PhyML, RAxML, PAML, Datamonkey, FigTree, Phytools |
---|---|

Applications | Phylogenetics, Population genetic analysis, Nucleotide sequence alignment |

Organisms | Human immunodeficiency virus 1, Homo sapiens, Human immunodeficiency virus 2 |

Diseases | Acquired Immunodeficiency Syndrome, HIV Infections |

Chemicals | Nucleotides |