Computational protocol: Mutation Rate Variation is a Primary Determinant of the Distribution of Allele Frequencies in Humans

Similar protocols

Protocol publication

[…] For polymorphism data, we downloaded single nucleotide polymorphism (SNP) data from version 0.2 of the Exome Aggregation Consortium database []. This database is a standardized aggregation of several exome sequencing studies amounting to a sample size of over 60,700 individuals and approximately 8 million SNPs. For each SNP we extracted upstream and downstream 30 nucleotides in the coding sequence of the human reference genome hg19 build. For simplicity, we excluded sites that are tri-allelic (6.5% of all SNPs) or quad-allelic (0.2% of all SNPs).For divergence data, we used the following reference genome builds downloaded from the UCSC genome browser []: chimpanzee (panTro4), gorilla (gorGor3), orangutan (ponAbe2), gibbon (nomLeu1), macaque (rheMac3), and baboon (papHam1). We used the UCSC genome browser’s liftOver program to align each ExAC SNP along with its 60bp sequence context to the six aforementioned reference genomes. We used the baboon reference genome solely for the ascertainment of all other substituted-species categories (rather than including a substituted-baboon category in the analysis).For gene annotations, we downloaded the refGene table of the RefSeq Genes track from the UCSC genome browser. For each SNP in our data, we extracted all gene isoforms in which the position was included. We kept all ExAC SNPs that fell in a coding exon, intron or untranslated region. We excluded from the analysis non-autosomal SNPs, SNPs that had multiple annotations corresponding to different transcript models, and SNPs with a sample size of less than 100,000 chromosomes. After applying the filters, we were left with 6,002,065 SNPs.For recombination rates, we downloaded the sex-averaged recombination rate map constructed by Kong et al. [], which estimates rates at a resolution of 10kbp bins. [...] To get a theoretical expectation for the fraction of rare variants under different mutational models, we used various software for computing the expected sample SFS of 33,750 diploid individuals, corresponding to the size of the non-Finnish European subsample in the ExAC dataset. For all mutational models, which we describe below, we generated predictions under various demographic models from recent literature: Gazave et al. [] (model 2 in their work), Tennessen et al. [] and Nelson et al. [].For the infinite-sites model, we computed the expected sample SFS analytically using fastNeutrino []. The infinite-sites model corresponds to an upper bound for the fraction of rare variants, but nonetheless predicted a fraction of rare variants much lower than that observed in data (75%-78%) for all non-CpG mutations under the Gazave et al. (59%) and Tennessen et al. (60%) demographies. The Nelson et al. model, which was inferred using a larger sample size of 11,000 people predicted 75% of biallelic polymorphisms would be rare under the infinite-sites model. In order to fit the highest observed fraction of rare variants for non-CpG sites in the ExAC data, we modified the parameters of the most recent epoch of exponential growth in Nelson et al. We estimated these parameters using fastNeutrino [] on all A->C intronic mutations from ExAC. The inferred parameters were: current effective population size of 4,009,877 diploids, and an exponential growth onset time of 119.47 generations in the past with a growth rate of 5.38% per generation. The more ancient demographic parameters were fixed to the same values as in the model of Nelson et al.We assume multiple mergers (non-Kingman merger events) have negligible effect on the SFS since the sample size is significantly smaller than the current effective population size. A similar demographic model [] with a four-fold smaller current effective population size exhibited a relative difference of only about 1.3% and 0.3% in the proportion of singletons and doubletons respectively for a comparable sample size of 50,000 people. Hence, we felt confident in using the Kingman coalescent for drawing genealogies.For the finite sites model, we first simulated independent coalescent trees using ms [] and then generated 1kb non-recombining sequences for each coalescent tree using the desired recurrent mutation rate with the 4 x 4 Jukes-Cantor model of mutation []. We used the program Seq-Gen [] to drop recurrent mutations on coalescent trees drawn from ms. We used mutation rates in a uniform logarithmically spaced grid of 40 points ranging from 10−9 to 5.3x10−5 mutations per basepair per generation per haploid. For each value of the mutation rate, we simulated enough sequence data so that at least 100,000 biallelic polymorphic sites were available to reliably estimate the expected fraction of rare variants. If we indicate whether a variant is rare by Y, then for each mutation rate μ, the expected fraction of rare variants is E[Y|S=1;μ], where S is an indicator variable indicating whether a site is polymorphic and, specifically, biallelic. Finally, we considered a model with additional, within-mutation-type heterogeneity in mutation rate. Specifically, we considered a model in which sites of a particular mutation type (e.g., C->A sites) have a mean mutation rate μ as before, but the mutation rate itself, M, is no longer fixed (and equal to μ), but rather a random variable with mean μ. Let f(M|S = 1;μ) be the probability density function of M in a site with mean mutation rate μ conditional on it being biallelic. Then, by the law of total expectation we have: E[Y|S=1;μ]=∫E[Y|M,S=1]f(M|S=1;μ)dM. By Bayes’ rule, f(M|S = 1;μ) is determined by both the within-mutation-type distribution of mutation rates, g(M;μ), and the probability of a site with mutation rate M being biallelic, as follows: f(M|S=1;μ)=P(S=1|M)g(M;μ)P(S=1;μ). Therefore, E[Y|S=1;μ] =∫E[Y|S=1,M]P(S=1|M)g(M;μ)∫P(S=1|M′)g(M′;μ)dM′dM. For a large range of M, we have already estimated E[Y|M] as described above. From the same simulations we have estimated the probability of a site with mutation rate M being a biallelic polymorphism, P(S = 1|M). Lastly, the distribution of mutation rates due to within-mutation-type variance was modeled using a lognormal distribution: log10M;μ ~ N(log10μ−σ22ln(10),σ2). The mean parameter in the lognormal distribution above ensures that E[M] = μ. σ was arbitrarily chosen to be 0.57 (red line in ). Notably, Hodgkinson et al. also fit a lognormal distribution of mutation rates to their dataset of co-occurrence of SNPs in chimpanzees and humans, and estimate a similar value of σ = 0.83 for non-CpG mutations [] and σ = 0.8 CpG transitions (personal correspondence). […]

Pipeline specifications

Software tools UCSC LiftOver, Seq-Gen
Databases RefGene UCSC Genome Browser
Applications Phylogenetics, WES analysis, Genome data visualization
Organisms Homo sapiens
Chemicals Nucleotides