Computational protocol: Sequence variation in the human transcription factor gene POU5F1

Similar protocols

Protocol publication

[…] Chromatograms of fluorescence-based sequences were compared across all samples to identify polymorphic loci and to determine which individuals were heterozygous or homozygous for the common or variant alleles using PolyPhred []. PolyPhred uses the base calls and peak information provided by Phred [,] and the sequence alignments provided by Phrap. PolyPhred detects a heterozygous locus by a significant drop in the height of a fluorescence peak along with the presence of a second fluorescence peak, but ignores regions of poor sequence quality, such as the ends of amplicons, to reduce the number of miscalled sites. In addition to using PolyPhred, all sequences were visually inspected in Consed [] by a data analyst for confirmation. The polymorphisms were named according to distance in bases from the first translational start, according to our reference sequence [See Additional file ], 3,055 bases downstream of our re-sequencing start site.We accessed the National Center for Biotechnology Information (NCBI) SNP Database on November 13, 2006 [], and compared the polymorphisms identified from our re-sequencing to those contained in the database. [...] We analyzed the polymorphism data using the University of California Santa Cruz (UCSC) Genome Browser human assembly from March 2006 [] to identify the location of SNPs in relation to conserved elements within and flanking POU5F1. We used the 'most conserved' track setting which shows predictions of conserved regions using sequences from 17 vertebrates produced by PhastCons software []. PhastCons uses a phylogenetic hidden Markov model and a maximum-likelihood approach to produce conservation scores for each nucleotide of the alignment, which can be interpreted as the probability that the nucleotide is in a conserved element. Each predicted conserved element was assigned a log-odds score indicating how much more likely it is under the conserved phylogenetic model than the nonconserved model. [...] Haplotypes were inferred separately for the AD and ED populations using PHASE (version 2.1) software, which uses a Bayesian statistical method []. Posterior recombination parameters, obtained from the general model for varying recombination rate in PHASE, were used to determine the presence and location of any potential recombination hotspots defined by a substantial difference in recombination between two neighboring loci and the background recombination rate []. Ancestral relationships between inferred haplotypes from both populations were examined using MEGA (version 3.1) software [], using the neighbor-joining method []. The square of Pearson's correlation coefficient (r2) for pairwise comparisons of biallelic polymorphisms was used to determine the extent to which polymorphic loci provide redundant genotype information. Groups of correlated polymorphisms were identified through a binning algorithm using a threshold value of r2 ≥ 0.80 between one polymorphism and a maximum number of other polymorphisms, separately for the AD and ED populations, using the LDSelect version 1.0 program []. LDSelect also identifies desirable tagging polymorphisms for each group of polymorphisms based on the r2 statistic between polymorphisms within a group. In addition, the combined population of AD and ED individuals was analyzed using the multiPopTagSelect version 1.1 program, which identifies the smallest number of maximally informative tagging polymorphisms to capture correlation patterns in both populations [].Within the region we resequenced, the International HapMap Project, data release 21, phase II, [] has genotyped 31 polymorphisms in several populations (including 90 Caucasian individuals from Utah, USA, in the CEPH collection). We compared the tagging polymorphisms identified from our genotype data to that of two populations in the HapMap project. […]

Pipeline specifications

Software tools PolyPhred, Phrap, Consed, PHAST, MEGA, ldSelect, multiPopTagSelect
Databases International HapMap Project UCSC Genome Browser
Applications Phylogenetics, Sanger sequencing, GWAS
Organisms Homo sapiens
Diseases Burkitt Lymphoma, Carcinoma, Renal Cell, Encephalitis, Tick-Borne