Computational protocol: Genome-wide target profiling of piggyBac and Tol2 in HEK 293: pros and cons for gene discovery and gene therapy

[…] Target sites were identified in build hg18 of the human genome using Blat [], with a sequence identity cutoff of 95%. Human genes were obtained from RefSeq [], and 2,075 cancer-related genes were taken from the CancerGenes database []. Upon counting the number of genes within n base intervals, all overlapping genes were first merged to avoid over-counting. CpG islands were taken from the UCSC Genome browser "CpG Island" track, which identifies CpG islands based on the methods of Gardiner-Garden and Frommer []. Repeat elements predictions were obtained from RepeatMasker []. Only insertions whose first 100 bases are contained within a repeat element were considered to overlap a repeat element. To estimate the significance of the tendency of insertions to be located proximal to CpG islands, we compared the number of insertions located within 2,000 bases of a CpG island to the number expected by chance. The expected number was calculated for each transposon type by picking N random regions in the genome of the same size (in bases) as the given transposon, where N is the total number of insertions for the given transposon. This procedure was repeated 1,000 times, and the mean and standard deviation of the number of random insertions points within 2,000 bases of a CpG island across the 1,000 random trials were used to obtain a Z-score (and associated P-value) for the actual number of insertions located within 2,000 bases of a CpG island. […]

Pipeline specifications

Software tools BLAT, RepeatMasker
Databases CancerGenes UCSC Genome Browser
Application Genome data visualization
Organisms Homo sapiens