Computational protocol: HPV Population Profiling in Healthy Men by Next-Generation Deep Sequencing Coupled with HPV-QUEST

Similar protocols

Protocol publication

[…] The bioinformatic pipeline for analysis of L1 deep sequences is outlined in . Raw reads generated from pooled libraries were de-multiplexed in Geneious Pro 5.6.5 (Biomatters Ltd., Auckland, New Zealand) using “Separate Reads by Barcode” function with 454 MIDS selected as the Barcode Set to assign sequences to each individual. Quality control steps were performed first in Geneious Pro 5.6.5 to exclude unassigned reads, reads inaccurately assigned to MIDs unused in multiplexing due to an incomplete MID, no MID, sequencing errors in the MID, reads with more than one mismatch in any of the primers, or reads with a sequence length outside the mean ± 2 SD range. Subsequent analysis in BioEdit (Ibis Biosciences, Carlsbad, CA, USA) identified and excluded reads with ambiguous nucleotide(s) or out-of-reading-frame indels. After quality control, a median (QR) of 1,441 (684–1,670) and 25,928 (16,444–36,038) quality sequences were obtained for the first 14 samples and the 2nd 11 samples, respectively, with correspondent median (QR) removal rate of 9.7% (8.0%–14.5%) and 1.4% (0.7%–3.4%) ().To assess deep sequencing errors, 25,928 quality L1 deep sequences from a molecular clone of HPV16 were aligned to an HPV16 reference sequence (GI│333031) using MUSCLE in Geneious package, followed by BioEdit (Ibis Biosciences, Carlsbad, CA, USA). Errors were evaluated using an in-house code. Overall sequencing error rate was 0.46% (46 errors/10,000 nucleotides) with 0.09%, 0.03%, 0.08% and 0.26% for transitions, transversions, and insertions or deletions (indels), respectively. After correction of indels, rate of misincorporation was 0.12% (12 errors per 10,000 nucleotides), which is 18 times lower than 2%, the maximum dissimilarity within an HPV genotype [,,].Quality sequences were clustered at 3% pairwise distance (representing 2% dissimilarity within variants [,,] plus 1% to more than compensate for sequencing errors) using ESPRIT []. ESPRIT generated a consensus sequence for each cluster representing an HPV genotype variant, and displayed the size of the cluster (number of sequences/cluster) reflecting the abundance of each variant, which only serves as a reference rather than a quantitation due to the possible bias of PGMY/GP+ consensus primers towards certain HPV genotypes over others. The number of consensus sequences for each genotype is detailed in .Consensus sequences were queried in HPV-QUEST, a custom HPV-genotyping server which collected 150 cutaneous and mucosal HPV reference sequences containing the L1 region obtained from National Center for Biotechnology Information (NCBI) Genebank [], Los Alamos HPV Sequence Database [] and Virus Sequence Database [] with nomenclatures updated, and accommodates large sequence data sets for automated and expedited HPV genotyping and classification of HPV genus, species, type and oncogenicity (high risk = oncogenic; low risk = nononcogenic) []. To insure reliability of the HPV genotyping, highly stringent cutoff values were applied with e-value ≤ 1.0E-38 and ≥ 80 nucleotide identities between query (~106 nt) and reference sequences. Because of stability and conservation of HPV genomes over evolutionary times [], genome segments as short as 20 to 30 nt can provide reliable genotyping [,]. Validation analysis of genetically unrelated sequences, including human immunoglobulin heavy chain variable region or human immunodeficiency virus type 1 envelope hypervariable region 3, returned results as “not identified (ND)”. Singleton HPV genotypes concordant with LA typing, or unique among individuals, or containing sample-specific substitution(s) were included in the study because of the unlikelihood of contamination among the samples. They were otherwise excluded from further analysis. Cervical epidermoid carcinoma cell lines CaSki and C-4 II, reported to contain HPV16 or HPV18, respectively, were used to verify the sequence analysis pipeline. A total of 16,444 quality sequences generated from C-4 II DNA were genotyped as a single variant identical to reference HPV18, while a total of 22,763 quality sequences obtained from CaSki DNA formed a single cluster that was identical to reference HPV16 (). Papillomavirus Episteme (PaVE) PV Specific Blastn [] was used to confirm HPV-QUEST genotyping results, and NCBI Blastn [] was applied to evaluate discrepancies of genotyping by HPV-QUEST and PaVE PV Specific Blastn. Genotype variants were studied by aligning them to correspondent reference sequences to identify synonymous and/or nonsynonymous substitution(s). […]

Pipeline specifications

Software tools Geneious, BioEdit, Esprit, BLASTN
Databases HPV Sequence Database
Application 16S rRNA-seq analysis
Organisms Homo sapiens, Human papillomavirus type 16
Diseases Infection, Neoplasms