Computational protocol: High prevalence and diversity of HIV-1 non-B genetic forms due to immigration in southern Spain: A phylogeographic approach

Similar protocols

Protocol publication

[…] All the sequences were trimmed to 883 nucleotides (nt) and aligned using ClustalW []. The viral subtype was studied with the REGA v3.0 subtyping tool (, and was confirmed by phylogenetic analysis through maximum likelihood (ML) using the randomized Accelerated Maximum Likelihood (RAxML) program, accessible on the CIPRES Science Gateway []. The general time-reversible (GTR) model with a gamma-distributed heterogeneity rate across sites was employed, applying 1000 bootstrap iterations. A representative dataset of HIV-1 group M sequences, including non- recombinant subtypes (A-K) and recombinant forms (at least four representative sequences of each non-recombinant subtype and the CRF currently available from the analysis) were downloaded from the Los Alamos HIV sequence database ( was used as a reference dataset ().The assignment to any subtype/CRF was considered definitive if the query sequence was included with the reference sequences corresponding to that viral variant in a monophyletic cluster supported by high bootstrap values (>70%) []. Any genetic form not associated with reference subtypes/CRFs was classified as a unique recombinant form (URF), whose recombination pattern was further studied by a Bootscan analysis using the SimPlot v3.5.1 software []. The bootscanning method in SimPlot consists of a sliding-window phylogenetic bootstrap analysis of the query sequence aligned against a set of reference strains to reveal breakpoints. The Neighbor-Joining algorithm was selected, with the Kimura 2-parameter substitution model. We employed a window size of 200nt moving in 10nt increments. We used a minimum cutoff for the bootstrap value of 70% to reliably assign each of the breakpoint segments to a parental variant.We have submitted to GenBank the major groups of HIV-1 non-B variants under accession numbers MF628109 to MF628250. These were defined as those found in at least five patients. With the aim of protecting the identity of patients infected with rare genetic forms of HIV-1, and for similar scientific and ethical reasons as explained in other HIV cohorts [–], we decided not to submit to GenBank those sequences corresponding to the less frequent variants. [...] To further characterize the relationships among the major groups of HIV-1 non-B variants, we interrogated GenBank for genetically related sequences to our major subtypes/recombinant forms using HIV-BLAST ( The 10 most closely related GenBank sequences to each of our study sequences, were downloaded and included in each dataset. We also included all the pol sequences (start: 2293 and end: 3290, HXB2 coordinates), available in the HIV Los Alamos database sampled in Spain for each dataset: subtype A1 (n = 60), subtype C (n = 52), subtype F (n = 143), subtype G (n = 64), CRF14_BG (n = 25), and CRF02_AG (n = 265). Since very few sequences for CRF06_cpx were available in public databases (, we included them all (n = 110).All these individual sequence datasets were put together (n = 970) and a global phylogenetic analysis was performed using RAxML (GTR + Gamma model) and 1000 bootstrap iterations for this analysis. The phylogenetic relatedness between the sequences was studied, and a 70% bootstrap value was taken as a significantly reliable value []. Thresholds for low genetic distance, which are commonly used as a proxy for divergence time, were not applied to the cluster definition in the ML trees since these clusters were further confirmed and analyzed using a time-stamped Bayesian phylogenetic analysis with BEAST, as described below. International non-B lineages (defined as phylogenetic associations of at least one sequence from our cohort clustered with sequences from different countries), and ‘Andalusian clusters’ (monophyletic associations of sequences in our cohort alone), were identified in the global ML tree.A Bayesian Markov Chain Monte Carlo (MCMC) approach was applied to each of the individual HIV-1 non-B subtype/CRFs datasets described above, which included the most genetically similar sequences found with HIV-1 BLAST, as implemented in BEAST v1.7.5 []. The Shapiro-Rambaut-Drummond-2006 (SRD06) substitution model was used, together with a relaxed uncorrelated lognormal clock (UCLN)[] and a demographic non parametric model, Bayesian Skyline Plot (BSP) []. This model combination was chosen because it best fits the analysis of the HIV-1 pol data run in the majority of studies []. The MCMC was run for 250 million states sampling every 50000. The evolutionary rate (μ, nucleotide substitutions per site per year, subst./site/year) for the different HIV-1 non-B subtypes/CRFs (), and the most recent common ancestors (MRCA) of the different HIV-1 non-B clusters, were estimated. Only traces with an effective sample size (ESS) > 200 for all the parameters, after excluding an initial 10% burnin, were accepted as visualised in TRACER, v1.6 ( Clades Credibility (MCC) trees were constructed in each case to summarise the posterior tree distributions. In these MCC trees, the more epidemiologically relevant clusters and lineages, previously identified in the global ML tree, were studied; and a node support cutoff (posterior probability (pp) above 0.9) was applied for their confirmation. Trees were viewed and edited in FigTree, v. 1.4.0 ( […]

Pipeline specifications

Software tools Clustal W, RAxML, SimPlot, BEAST, FigTree
Application Phylogenetics
Organisms Human immunodeficiency virus 1, Homo sapiens, Human immunodeficiency virus 2