Computational protocol: Genome Wide Association Study of HIV Whole Genome Sequences Validated using Drug Resistance

Similar protocols

Protocol publication

[…] Fastq quality control was performed using FASTQC(0.11.3) and QUASR(3.1) software applications. Reads of less than 100bp in length and a quality score lower than 30 were excluded. In addition, the reads were trimmed up to 10bp from 5’ and 30bp at the 3’ to exclude poor quality sequence at the beginning and end of reads. We noticed that the second pair read of the Illumina Nextera XT was of lower quality and that excluding the last 30bp increased quality score to > 33. We imposed these exclusion criteria in order to decrease the probability of ambiguous read mapping, which occurs when shorter reads of lower accuracy are included in assemblies []. Following these quality control steps, we mapped reads against a subtype C reference sequence (AF411967) with five assembly iterations using Geneious 8 (http://www.geneious.com) []. After assembly, we exported the data as BAM files and exported contigs as FASTA files.In order to determine if there was clustering of sequences (i.e. sequences that were very similar with low genetic diversity), we aligned all of the whole genomes with a reference dataset for HIV-1 subtype C. The tree was constructed with HKY+Gamma site rate variation in a MPI version of RaxML. Reliability of internal nodes was evaluated by 100 bootstrap replicates. Phylogenies were analysed using Phylotype software application [] in order to detect any clustering of sequences with high bootstrap values (>90%) and low sequence diversity (<3%). This was performed to identify pairs of closely related HIV sequences that might confound the analysis and test the sensitivity of the results to their inclusion. [...] The processing of WGS data to the performing of GWAS is outlined in , with comparison to human GWAS steps. BAM files were converted to VCF format variant calls individually for each sequence in PILON []. A threshold of a depth of 50 reads per base was used for a variant to be called.As GWAS software was originally designed for diploid organisms (i.e. those with two chromosomes and so two copies of any given loci), each sample can be called either as homozygous for an allele (e.g. AA or TT) or as heterozygous (e.g. AT). While heterozygosity is incorrect in the sense that HIV is haploid, it captures an important reality of viral infection: genetic differences within the host’s viral population. We wanted to retain the feature of diploidy to account for samples with diversity at a given DR loci. We expected heterozygous samples to have an intermediate effect size compared to samples where the DR variant was either entirely non-existent or fixed. The downside of this approach was that given numerous sequence reads for each loci, some variation is expected due to sequencing errors. To account for this, we allowed for diploid calling in the following manner. If the reference allele frequency was present in >85% of reads at a loci, the loci was called as homozygous for the reference allele. A heterozygous call with one copy of the reference variant and one of the non-reference variant was made if the reference allele frequency was between 85% and 15% of reads. Finally, a homozygous non-reference call was made if the reference allele frequency was found in less than 15% of reads. While these cut-offs are simply defaults of the software, this worked as a crude calling approach for whether an individual sample’s HIV population was fixed or mixed for any given loci.VCFs were then merged in GATK [], then the combined VCF read into PLINK1.09 [] for GWAS analysis. Prior to analysis, several QC steps were performed. First, where multiple alleles occurred at the same loci, the reference variant and the most common non-reference variant were used to make the loci bi-allelic. Second, a minor allele frequency of greater than 1% was required for all variants. Lastly, we did not implement a restriction on missingness of data. In human GWAS, high missingness for a SNP or individual may reflect poor quality genotyping. However, in HIV WGS sequencing quality is not homogenous across the genome (Fig B in ). As we had restricted analysis to calling variants at loci with a depth of 50 or greater, higher missingness was expected. Missingness for SNPs significantly associated with DR is reported in . [...] A logistic regression was performed in PLINK1.09 [] with drug exposure as the binary outcome and each SNP as a predictor with an additive effect. All samples exposed to a given drug were compared to all that were not. To determine genome-wide significance we performed 10,000,000 permutations within PLINK1.09 both on a single SNP and genome-wide level using the—mperm command. This was performed to account for correlation between nearby SNPs which would have made Bonferroni correction for the raw number of statistical tests overly conservative. Given the smaller number of variants compared to a human GWAS, permutation using 10,000,000 for the empirical p-values was computationally feasible. As the negative correlations in the prescribing of these drugs existed, associations with the same SNP were seen in multiple analyses. However it was possible to identify when exposure was associated with the non-reference sequence (i.e. odds ratio>1) and so, presumably, which association identified the true drug resistant variant. Principal components were generated in GCTA []. […]

Pipeline specifications

Software tools FastQC, QUASR, Geneious, RAxML, PhyloType, Pilon, GATK, PLINK, GCTA
Applications Phylogenetics, WGS analysis, GWAS
Diseases HIV Infections
Chemicals Amino Acids