Computational protocol: HIV-1 envelope sequence-based diversity measures for identifying recent infections

Similar protocols

Protocol publication

[…] The quality of the NGS runs was evaluated using the Illumina Sequencing Analysis Viewer v1.10.2 Software and the FastQC application ( Sequencing depth and coverage were available under Coverage.txt and in OneDrive HIV_A_kafando project following these links respectively: A species with a coverage less than 100x were excluded in final statistical analyses.Sequences were de novo assembled using Iterative Virus Assembler (IVA) [] to generate a consensus. The HIV-1 env subdomains gp120-V1 to C5 and a part of the gp41 ectodomain (first 158pb) were analyzed separately. The gp120-C2 and C3 subdomains were subdivided into 3 and 2 segments for subsequent analyses to compare DNA sequences of sizes like the other regions as showed in .To map subdomains, consensus sequences were aligned with the HXB2 env reference sequence (Genbank accession number K03455.1-HIVHXB2CG env nucleotides positions 6225–8795) using Clustal W in MEGA7.0 ( [].The env subdomain length delimitations followed the HXB2 complete genome numbering were as follows: gp120 V1 (6615–6692 ≈78pb), V2 (6696–6812≈116pb), C2_segment 1 (6813–6913≈100pb), C2_segment 2 (6914–7014≈100pb), C2_segment 3 (7015–7109≈94pb), V3 (7110–7217≈108pb), C3_segment 1 (7218–7320≈102pb), C3_segment 2 (7321–7376≈56pb), V4 (7377–7478≈102pb), C4 (7479–7556≈78pb), V5 (7557–7637≈80pb), C5 (7638–7757≈120pb) and gp41-ectodomain (7758–7915≈158pb).Intra-patient genetic diversity was evaluated for each subdomain/segment using an in-house coded Python pipeline. SMALT ( was used to map the reads against their respective consensus sequence, and SAM tools (Sequence Alignment/Map)[] were used for analysis of the mapping file generated by SMALT. Bioconductor packages ( [] were used for the genetic diversity calculation. More details about the specific packages and the python codes used for diversity estimates are available and DOIs to access are below: and!Ao82p2mOrppwgl8eApq05btfNl8PThe four sequence-based diversity measures were calculated as previously described [, ] as show in . Briefly, the percent diversity was evaluated as the average pairwise genetic distance between two sequences [], the percent complexity was expressed as the number of distinct variants divided by the total number of reads x 100 [], and the Shannon entropy index (S) was calculated using a formula that accounts for both the number of distinct reads and their proportional representation in the dataset [, ]. The number of haplotypes strictly included the number of distinct quasi-species or variants present in at least 1% or more in the viral population []. The frequency distribution curves (ggplot2) of the percent diversity, percent complexity, Shannon entropy and number of haplotypes for recent versus chronic sequences were generated using R []. [...] We used two HIV subtyping tools to determine a consensus HIV subtype. The Rega HIV Subtyping Tool V3 [] ( and, Confirmation with the NCBI HIV Subtyping tool [] ( [...] Summary statistics (mean, median and interquartile range) were used to estimate the intra and inter-patient envelope genetic diversity.The student t-test was used to compare the diversity measures between sequences from recent and chronic infections. Analyses were performed using Epi Info™ 7 ( and IBM SPSS Statistics software. P-values below 0.05 were considered statistically significant. […]

Pipeline specifications

Software tools FastQC, IVA, Clustal W, MEGA, SMALT, Ggplot2, REGA HIV Subtyping Tool, Epi Info, SPSS
Applications Miscellaneous, Phylogenetics, Population genetic analysis
Diseases Infection, HIV Infections