Computational protocol: A Selection Index for Gene Expression Evolution and Its Application to the Divergence between Humans and Chimpanzees

Similar protocols

Protocol publication

[…] To estimate Vw and Vb from experimental data, we used a previously published expression dataset from human and chimpanzee lymphoblastoid cell lines, measured on the human-specific Affymetrix U133A microarray . We masked the data by removing all probes that did not have a unique perfect match in the chimpanzee genome. Probe sets with less than four remaining probes were discarded, as smaller probe sets tend to give unreliable results . Expression values were calculated with the robust microchip average (RMA) method as implemented in Bioconductor –. For genes with multiple probe sets on the array, we chose a single probe set at random to represent that gene.The dataset from Choy et al. included cell lines derived from 5 chimpanzees and 46 humans, of which 13 were of European descent (CEU), 19 of Han Chinese or Japanese descent (CHB/JPT) and 14 of Yoruba descent (YRI). For each human sample, two replicates were available, whereas three or four replicates were available for the chimpanzee samples. To achieve a balanced experimental design, five individuals were randomly chosen from each of the human populations, and two replicates were randomly chosen for each chimpanzee individual, so that for each analysis we had five humans and five chimpanzees with two replicates each. The between-species, within-species and error variance components were then estimated by nested ANOVA of the log-transformed expression values, with the modification that we calculated separate estimates for the human and chimpanzee within-species and error variances.To verify that our variance estimates were unbiased even in cases with unequal variances, we used the same method to analyse simulated expression datasets that were based on the model(10)where yijk is the log2 expression value for species i, individual j and replicate k, μi is the true mean, Iij represents individual variation and εijk is the measurement error. The values for μI, Iij and εijk were drawn from normal distributions with variance corresponding to the between-species, within-species and error variances displayed in .Estimates of π and d for each gene were obtained as follows: We extracted the intron coordinates of all human autosomal protein-coding genes in Ensembl release 56 . To further ensure that we were working with purely neutral sequences, we removed any sequences that were within 50 bp of a splice junction or that overlapped with exons from other genes. We also removed conserved elements identified by the phastCons program by excluding all sequences that featured in the ‘Primate El’ table of the Conservation track for the human genome release hg18 in the UCSC Genome Browser . The SNP frequency spectra for these neutral sequences in the CEU, CHB/JPT and YRI populations were taken from low coverage pilot data from the 1000 Genomes Project . To correct for the limited power to detect very rare variants, we divided the number of observed SNPs at different frequencies by the power to detect SNPs at that frequency (estimates of detection power were kindly provided by Adam Auton). To estimate the degree of sequence divergence, we downloaded blastz alignments of the human and chimpanzee genomes (releases hg18 and panTro2, respectively) from the UCSC genome browser , , . We excluded sites where the human sequence was unknown (‘N’) or where the chimpanzee sequence had a quality score of 40 or below, as judged from the Quality Scores track in the UCSC Genome Browser.In equations 5 and 9 we need to subtract the average nucleotide diversity, across our two species, from d. Unfortunately we do not have data from chimpanzee and so we assumed that the nucleotide diversity for each gene was the same in humans and chimpanzees. The true chimpanzee value is likely to be larger , , which means that our estimate of d is slightly inflated and will cause our test to be somewhat conservative. To test whether this had a major influence on our results, we repeated the analysis, assuming that the chimpanzee average heterozygosity was 10-fold larger than the one found in humans.To gauge the accuracy of selection index estimates for individual genes, we generated datasets of 5000 genes where all genes had a true selection index of −5, −2, 0, 2 or 5. In our simulations, we drew Vw from a uniform distribution ranging from 10−4 to 1 and used this value and the true selection index to set the true Vb for that gene. Note that the results of this analysis are independent of the magnitude of Vw. We then estimated Vb based on two species means drawn from a normal distribution with a mean of 0 and variance corresponding to the true Vb, and used this to calculate the estimated selection index. […]

Pipeline specifications

Software tools PHAST, LASTZ
Databases UCSC Genome Browser
Applications Nucleotide sequence alignment, Genome data visualization
Organisms Pan troglodytes, Homo sapiens