Computational protocol: Genetic factors affecting EBV copy number in lymphoblastoid cell lines derived from the 1000 Genome Project samples

[…] To estimate the number of EBV genome copies per cell (EBV copy number) in a given LCL, we compared the coverage of mapped reads between human genomic regions and the EBV reference genome. To determine EBV coverage, those reads that did not map to the human reference genome (labelled as “unmapped”), or which were already mapped against the EBV reference genome (labelled as NC_007605 in the "mapped" 1KGP alignment files) were retrieved from the 1KGP website. For each LCL sample, we remapped the reads to EBV reference genome (NC_007605) composed of B95-8 strain plus 12Kb of Raji strain to correct the non-natural B95-8 specific deletion. Duplicated paired mappings were removed to avoid PCR duplicates; paired reads not mapping together were filtered out using SAM tools []. Only uniquely mapping reads were retained. A total of 2,215 LCL-derived genome samples coming from 4 continents (Europe, Asia, Africa and America), and consisting of 19 populations were selected () as the final data set for in silico EBV copy number estimation. Lastly, we used GATK’s Depth Of Coverage tool [] to quantify the average EBV coverage per genome sample in a masked version of the EBV reference genome, in which all repetitive and low-complexity regions and the B95-8 specific deletion were excluded (127,219 bp in total). Particular attention was paid to those reads mapping within the B95-8-specific deletion at a median coverage of > = 1 and EBV coverage of < = 1, since they could be an indication of cell lines co-infected with natural EBV strains [] or of blood as genome source. All such reads were identified and excluded from further analysis.Next, the hg19 human reference genome was masked to properly estimate the average human genome coverage, excluding regions of copy number variation (CNV), segmental duplication, tandem repeats and repeat masker UCSC tracks. 5 random windows of 1 Kbp size were selected representing "callable" loci of each chromosome and generated a sequence of 110 Kbp size (1 Kbp * 5 windows* 22 chromosomes = 110 Kbp). Reads overlapping these segments were retrieved from the 1KGP website and filtered with the same criteria described above for EBV mappings and the median coverage value was calculated for these regions with the Depth Of Coverage tool.Finally, EBV copy number was estimated on the basis that the human genome coverage accounts for 2 DNA copies/cell; from this, the number of EBV copies per cell was calculated by the simple procedure of dividing the EBV genome coverage by half of the human genome coverage. Prior to GWAS analysis, and since the range of EBV copy number is very wide and varies among populations, copy number values were normalized by means of inverse rank transformation using GenABEL []. […]

Pipeline specifications

Software tools GATK, GenABEL
Application GWAS
Organisms Human alphaherpesvirus 3, Homo sapiens
Diseases Infectious Mononucleosis, Multiple Sclerosis, Neoplasms, Epstein-Barr Virus Infections