Computational protocol: Bridging epidemiology with population genetics in a low incidence MSM-driven HIV-1 subtype B epidemic in Central Europe

Similar protocols

Protocol publication

[…] For the purpose of this study, data and sequences were gathered from 3 previous studies conducted in Slovenia examining the prevalence of transmitted drug resistance among therapy naive HIV-1 positive patients diagnosed in the years 2000–2004, 2005–2010 and 2011–2012, approved by the Medical Ethics Committee at the Ministry of Health of Slovenia (Approval Ref. No.: 126/12/03) [,,]. The following epidemiological and laboratory data were available for statistical analysis: gender, age at time of diagnosis, year of diagnosis, country of origin, acute retroviral syndrome (ARS), Centers for Disease Control and Prevention (CDC) class, AIDS defining illnesses, other sexually transmitted diseases, most probable route of HIV infection, relationship with the source of HIV infection (sex with anonymous person or stable relationship), country where the infection most probably occurred, viral load and CD4 cell count at the time of diagnosis and presence of surveillance for drug resistance mutations (SDRMs).For the purpose of the incidence estimate, patients were estimated as recently infected when diagnosed and sampled within 155 days after infection, or as having a long-standing infection (LSI) when diagnosed after 155 days following infection. In brief, patients with a baseline CD4+ cell count of fewer than 200 cells/mm3 and/or with a baseline viral load (VL) of fewer than 400 copies/ml were first automatically characterized as having a LSI. For the remainder, the Aware™ BED™ EIA HIV-1 Incidence Test (BED test) (Calypte Biomedical Corporation, Portland, Oregon) was employed on a sample taken within 3 months of diagnosis, according to the manufacturer’s instructions. Briefly, the principle of the BED test is as follows: patients’ plasma samples were first diluted 1:101 and added to a goat-anti–human IgG coated microplate, capturing anti–HIV IgG and non–anti–HIV IgG from the samples. The amount of specific anti-HIV-1 antibodies was proportional to the optical density (OD) values obtained by spectrometric analysis. Normalized optical density was calculated (ODn = plasma sample OD/calibrator OD) and, on the basis of the defined OD cut-off values of the assay, the patients were classified as having a LSI or RI, with the suggested window period of infection being 155 days []. This determination was possible for 213/223 Slovenian patients included in the study (for the remaining 10 patients a plasma sample taken within 3 months after HIV-1 diagnosis was not available).Sequences were re-analyzed for subtype determination by employing the REGA HIV-1 Automated Subtyping Tool, version 2 []. Only subtype B sequences were selected for this study, a total of 223 partial pol gene sequences, exhibiting an inclusion of 53% ± 15% of newly diagnosed patients in the study per year.Alignments were made using ClustalW, available in the BioEdit package, and edited and trimmed to 953 base pairs []. A quick neighbor joining (NJ) tree was created using Seaview with 100 bootstrap replicates []. A subset of Slovenian sequences was selected from all parts of the NJ tree by visual inspection, to represent the complete diversity of the epidemic. Depending on the size of the cluster, at least three to five sequences per transmission cluster were included. These sequences were then used to search the GenBank public database for other closely related sequences (controls), using the BLAST search tool []. The 10 most similar sequences per Slovenian sequence were retrieved from GenBank. Since population genetics analysis is only possible using sequences that contain sampling time information, only sequences with an available sampling time were selected for controls. The majority of Slovenian sequences gave similar BLAST results, thus a fair amount of control sequences were found to be repeated. After removing these duplicate sequences, only 84 control sequences remained, together with Slovenian sequences forming a dataset of 307 subtype B sequences for further analysis.jModeltest software was employed for selection of the best fitted evolutionary model to be used on the selected dataset, with an additional three subtype A1 sequences (accession numbers AB098332, AB253421, AB285785), selected to root the obtained maximum likelihood (ML) tree [,]. Using jModelTest, TVM + I + G was determined as the best fitted evolutionary model for the data. Since this evolutionary model is not available in most phylogenetic tree construction software, the selection of the closest model, based on the hierarchy of evolutionary models provided by jModeltest developers, was used instead. Thus, all additional analyses (maximum likelihood and Bayesian probability) were run using the next simpler model, HKY + I + G. This model, when combined with 2 codon partitions (1 + 2 codon, 3 codon), has been previously determined as performing best for most protein viral datasets []. The ML phylogenetic tree was constructed using a sub-tree pruning and regrafting + nearest neighbor interchange (SPR + NNI) search criterion, as implemented in PhyML 3.0 software []. The obtained phylogeny was visualized using FigTree v1.3.1 []. Transmission clusters were at this point identified according to the approximate likelihood ratio test (aLRT) branch support values obtained (>0.90) by the ML method.The temporal signal of the dataset was assessed using Path-O-Gen, which showed an r-squared value of 0.53 with the best-fitting root (date range 29 years) [].The Bayesian analysis was performed on all Slovenian subtype B sequences and control sequences (full analysis). Additionally, separate analyses were executed on all major Slovenian clusters (≥10 Slovenian sequences) for two reasons: 1) to compare the tMRCA values obtained in the independent clusters analysis with the tMRCA values obtained based on the full data set analysis and 2) to obtain the population growth rate of each cluster. The tMRCAs of the full dataset and of major clusters were determined by the Monte Carlo Markov chain (MCMC) method available in the BEAST package v1.7.1, using a relaxed clock model with uncorrelated lognormal distribution and the Bayesian skyline coalescent model []. The analysis was run using a HKY + I + G substitution model and 2 codon partitions (1 + 2 codon, 3 codon) were added. The output results of BEAST were examined in Tracer v1.5 [] and the MCMC chain was run until the effective sample size (ESS) values for all parameters exceeded 200. Convergence was achieved after 500,000,000 and 100,000,000 generations, for the full analysis and for analyses based only on cluster strains, respectively. All the analyses were repeated and the obtained duplicate results then combined using LogCombiner v1.6.1, available in the BEAST package []. TreeAnnotator v1.7.1 (BEAST package) was employed to remove a burn-in of 10% of all sampled trees. A Maximum Clade Credibility Tree was finally visualized and annotated using FigTree v1.3.1 [].The final clusters were defined by reviewing clusters previously identified in the ML analysis (aLRT > 0.90) according to posterior probability values obtained in Bayesian analysis. Only clusters with posterior probability >0.990 were then selected. Thus, cluster definition relied on both aLRT (>0.90) and posterior probability (>0.990) values. Moreover, a sensitivity analysis was conducted by altering aLRT and posterior probability cut-offs. When a higher cut-off was applied, namely, aLRT >0.95 and posterior probability >0.999, all the defined clusters remained. On the other hand, when lowering the cut-offs to aLRT >0.85 and >0.80 one of the clusters had 3 additional Slovenian sequences and one additional Slovenian cluster was identified. Regarding posterior probability, no differences were observed when lowering the cut-off, since the values on the outer nodes were much lower than 0.9, giving credibility to the selected cluster definition.The obtained tMRCA values were compared to epidemiological data (e.g., time of infection) gathered from questionnaires, confirmed epidemiological links and compared to the results of an HIV-1 incidence algorithm.Statistical analyses were conducted using the on-line statistical package Epi Info™ Version 3.5.3 and P ≤ 0.05 values were considered to be significant []. […]

Pipeline specifications

Software tools Clustal W, BioEdit, SeaView, jModelTest, PhyML, FigTree, TempEst, BEAST
Applications Phylogenetics, Population genetic analysis
Organisms Human immunodeficiency virus 1, Homo sapiens
Diseases Encephalitis, Arbovirus, HIV Infections