Computational protocol: Molecular Dating of HIV-1 Subtype C from Bangladesh

[…] The analyzed viral sequences originated from blood samples obtained from serological surveillance surveys conducted by the International Centre for Diarrhoeal Disease Research, Bangladesh (icddr,b) on behalf of the Government of Bangladesh during 1999-2007, from research studies and from voluntary counseling and testing (VCT) units from 2002 to 2007. The sampling procedures, PCR and sequencing methodology have been described previously [], and the majority of sequences had been submitted to GenBank previously (Accession numbers EF999759-EF999824, EU167547-EU167548, JQ668327-JQ668332, AF470702, AF470704, AF470707, AF470709, AF470710, AF448213-AF448217 and AF448219. New submissions for this paper: KC859271-KC859303). Informed and signed consent was obtained from all study participants prior to drawing blood, and in the case of children (N=5) consent was obtained from parents/guardians. The summary of the consent paper was read out for those who could not read and the left thumb impression was obtained from those who could not sign. This study, which included secondary analysis of already retrieved samples, was approved by the Ethical Review Committee of icddr,b. The age of the study population ranged from 2-59 years, 22% were female and 99/118 (83.8%) were from the capital, Dhaka. The population included 52 PWIDs, 2 heroin smokers, 10 FSW, 3 MSM, 2 TG and 49 people who were either VCT clients or were detected with HIV while seeking treatment for tuberculosis (TB, N=2) or other sexually transmitted infections (STIs, N=3). The VCT clients included five vertically infected children (ages 2-7), while the risk factor in the other cases was mostly unknown. However, 23 of the VCT clients had a history of travel to Saudi Arabia, Kuwait, United Arab Emirates, India, Nepal or Malaysia. A total of 118 Bangladeshi subtype C sequences were available for the gag gene. This gene is well suited for phylogenetic analysis as it is not as variable as env and not under selective pressure by antiretroviral drugs, like pol. For the phylogenetic analysis all sequences were truncated to start in the correct reading frame and cover exactly the same genetic region (corresponding to positions 934-1244 in the HXB2 reference strain). Reference sequences were retrieved from the LANL HIV sequence database [] through BLAST searches for all included Bangladeshi strains (the five best hits were kept for each strain, duplicates were eliminated and only strains with known year and country of sampling were included). In addition, the geographic search interface was used to retrieve subtype C sequences from nearby countries, and 16 additional sequences from Myanmar were included. China and India were already represented from the BLAST search and subtype C sequences from other nearby countries including Nepal and Malaysia were not available. A search for sequences from the most popular destination countries for migrant workers (Saudi Arabia, UAE, Kuwait, Oman, Qatar, Bahrain and Libya) was also performed, but it was found that no subtype C sequences were available from this geographic region for the part of the gag-gene used in this analysis. Finally, subtype references for B, C, 07_BC and 08_BC were retrieved and included in the data set. The final set included sequences from Bangladesh (118), China (31), Myanmar (19), India (16), South Africa (14), Zambia (12),Malawi (9), Ethiopia (7), Israel (4), Botswana (4), Great Britain (2), Somalia (2), Zimbabwe (1), Denmark (1), Kenya (1), Tanzania (1), USA (1), Brazil (1), Thailand (1) and France (1). Alignments were performed in MEGA 5 [], and manual editing was done to ensure that all strains were in the correct reading frame throughout the alignment. The final gag alignment included 246 taxa covering 374 sites. The phylogenetic analysis was performed using BEAST v.1.7.3 []. The GTR+Γ+I substitution model was used in all runs, since this was found by to be the most appropriate according to ModelTest []. Preliminary analysis included combinations of lognormal and exponential relaxed clocks with logistic growth and Bayesian Skyline tree priors. Bayes factor analysis performed in Tracer v.1.5 [] showed that the lognormal relaxed clock with the Bayesian Skyline was the best model and this was used for the subsequent analyses. The final analysis consisted of two parallel runs of 20 million generations, with parameters logged every 1000 gen. Tip dates (year of sampling) were included for all sequences and four different taxons were defined: one with all subtype C strains, and three Bangladesh specific clades. The evolutionary rate for gag, 0.002 mutations/site/year, was used. Adjusted priors were the tree root height (mean 100 years, stdev 20) and a previously calculated tMRCA for subtype C (mean 58, stdev 6 []). The resulting log-files were analyzed in Tracer, while the final tree was compiled in TreeAnnotator using a burn-in of 1,000 trees and edited in FigTree []. […]

Pipeline specifications

Software tools MEGA, BEAST, ModelTest-NG, FigTree
Application Phylogenetics
Organisms Human immunodeficiency virus 1, Human immunodeficiency virus 2
Diseases HIV Infections