Computational protocol: Systems Genomics of Metabolic Phenotypes in Wild-Type Drosophila melanogaster

Similar protocols

Protocol publication

[…] Experimental analysis was performed on 20 lines representing the diversity of dietary reaction norms for pupal weight and larval lipid storage phenotypes identified from an initial screen of 146 inbred lines sampled from North Carolina and Maine populations (). Lines were raised on four dietary treatments used in the analysis following a rationale described in detail in . All diets were cornmeal based but varied in their sugar and fat content. The base for all diets is by weight 0.7% agar, 6.5% cornmeal, and 1.3% inactive yeast into water. In the normal diet (maintenance diet for fly stocks in many labs in addition to our own) the major source of calories is 4% (0.117 M) sucrose and is made from the standard cornmeal-based lab food with the addition of 58 ml of molasses. reports that the maximum rate of development in Drosophila larvae is achieved at a sugar concentration of 0.75% by weight, while a 4% sugar diet produced a decreased developmental rate. We found that the type of sugar (e.g., sucrose vs. glucose), in addition to the sugar concentration, affects the metabolic health of the larvae (); maximum weight was achieved at a concentration of <1% of glucose or sucrose, and survival decreased dramatically at higher (>8%) glucose. To both reflect the past research on dietary variation in flies and specifically target the insulin pathway through glucose metabolism we added glucose to the standard base at a concentration of 0.75% glucose by weight (0.042 M) to make our low-sugar diet and a 4% glucose by weight (0.222 M) diet to make the high-sugar diet. Note that the total calories are approximately the same between the normal and high-sugar diets but differ in which sugar is providing those calories. The fat content of the diet base is <0.2% fat, so we supplemented the low-sugar diet with 3% coconut oil by weight to make the high-fat diet; coconut oil is nearly 100% fat (85% saturated, 15% mono- and polyunsaturated). The specific names of the diets are not intended to have an absolute significance in comparison to other studies and signify only the relative sugar and fat concentrations used within this study.Fifty first-instar larvae were seeded into each food vial and six gross phenotypes were measured for each line on each of the four diets. Samples of third-instar larvae were pooled across a minimum of three food vials for each treatment to be measured for their triglyceride and trehalose content and preserved for expression and metabolomics analysis. Homogenates of six randomly selected larvae were characterized using a 96-well spectrophotometer to determine total triglyceride content using the Sigma triglyceride determination kit and to determine trehalose content (the primary circulating sugar) by treating with trehalase to produce glucose was determined by Sigma glucose determination kit (; ; ; ). Up to 15 male pupae from each of two food vials per treatment were weighed individually to determine the weight phenotype. Larval and pupal survival and the time to pupation (development time) were also scored. Samples were generated in randomized blocks of four synchronized lines per week on all four diet treatments; each genetic line was replicated at three time points.Whole-genome expression profiles for high-quality RNA samples were determined using Nimblegen 12-plex microarrays using the manufacturers protocols and software (). The gene expression data from this publication are deposited in the GEO database under accession no. GSE50745 (http://www.ncbi.nlm.nih.gov/geo/). Metabolomic profiling was performed by gas chromatography–mass spectrometry (GC–MS) on samples of exactly six larvae. Samples were homogenized in 60:40 methanol:water, dried down, and then TMS derivitized in acetonitrile. Samples were processed in daily randomized blocks of 15–22, along with pooled standards. Samples were run in daily sets in a randomized order into a Thermo Scientific DSQ II Series single quadrupole GC-MS with an electron impact source and an Agilent DB-5 column run in splitless mode with a 30-min temperature ramp. Over the course of the experiment, a minimum of five distinct biological samples were analyzed for each genotype and diet. The Kovats retention index (RI) was calculated for each chromatographic peak, and chromatograms were then aligned to a consensus list of internally determined candidate target profiles. Chromatographic peaks were initially cataloged using AnalyzerPro (http://www.spectralworks.com/analyzerpro.html) and were then hand curated to develop a list of potential analytes. Relative concentrations for each chromatographic peak were determined from the standard curve produced by a pooled standard. After quality control filtering, 187 putative metabolites were proposed as being reliably detectable. Chemical category of those 187 putative metabolites was determined by the searching profiles against the publically available National Institute of Standards and Technology (NIST) database, identifying candidate compounds, and then running candidate compounds on the GC–MS system to confirm profile matches. Using this approach, we were able to identify with confidence the exact molecule for 60 of the putative metabolites; another 124 were matched with confidence to chemical class (e.g., amino acid or monosaccharide), while the remaining 3 are presently unknown.All statistical analyses were performed using JMP Genomics (SAS Institute, Cary NC). Individual metabolites were first normalized to the mean value for the metabolite in one pooled standard for the day the sample was run. They were then log2 transformed and the median centered standardized residuals of a day-of-run ANOVA model on the individual metabolites composed the final data set. Of a potential 15595 genes, 11650 were expressed at detectable levels in this data set. The log2 transformed expressed gene values were median centered, and then the standardized residuals of the hybridization block ANOVA model were themselves median centered and used for all subsequent gene expression analysis. In this analysis, means for each line and diet combination (n = 80) were determined and those values used for all subsequent analyses. Principal components analysis on the correlation matrix among metabolites and expression profiles was performed using the “correlation and principal components” function in JMP Genomics and the first 10 principal components were estimated. Our sensitivity to detection of rare metabolites was lower than to detection of rare expressed gene products but we have no a priori reason to anticipate the variance structure in the rare metabolites to be fundamentally different from the detectable metabolites. In addition, the numerous metabolite and gene expression variables greatly outnumber the principal components calculated; thus the principal components should be robust estimates of the variance in both data sets. Very similar results were obtained after normalization using the supervised normalization of microarrays algorithm (http://www.bioconductor.org/packages/2.12/bioc/html/snm.html). The mean values for all phenotypes, including gene expression and metabolites, calculated for each genotype-by-diet combination are available at the authors’ website as data sets S1–S3 (http://www.gibsongroup.biology.gatech.edu/supplemental-data-reed-et-al). [...] To determine the allele frequencies of the SNPs that are polymorphic in our Georgia Tech population, we mapped paired end reads from whole-genome sequencing to the D. melanogaster reference genome (build 5.33) using Bowtie2, and subsequently Samtools mpileup was used to generate the pileup files. A minimum mapping quality score of 30 (Phred scale) and base quality score of 15 were set as lower thresholds for variant calling. Calls were subsequently performed using the default parameters in Varscan downloaded from http://varscan.sourceforge.net/. This is BioProject (http://www.ncbi.nlm.nih.gov/bioproject/) accession no. PRJNA194129 and the raw resequencing data are available under accession no. SRA143721 at the Sequencing Read Archive (http://www.ncbi.nlm.nih.gov/sra) and the variant calls used in this analysis are available as data set S4 at the authors’ website (http://www.gibsongroup.biology.gatech.edu/supplemental-data-reed-et-al).Single nucleotide variant calls were then imported into JMP Genomics and further filtered for quality control. Only biallelic loci were considered in subsequent analyses, and these represented >99% of all SNV calls. Read depth ranged from 150 to 463 (with the exception of two outliers corresponding to the chorion complexes) and had a standard deviation of 47. Given our high-mapping-quality threshold (MAPQ ≥ 30), it is unlikely that many SNV positions in the genome would have high depth because of being in repetitive regions. We also excluded all variants with minor allele frequency (MAF) < 0.05 in baseline; this was done to remove likely sequencing errors, but likely also removed many low-MAF variants since our focus is on selection acting on common variants in the natural population. Since SNVs with both low estimated allele frequency (<0.1) and low read depth in the baseline tended to change the most in the evolving populations, low-MAF SNPs with low coverage seem to have poor allele frequency estimates, as also seen in contrasting the Georgia Tech and Raleigh Drosophila Genetic Reference Panel (DGRP) estimates. This is theoretically supported by the notion that the variance of the estimate of allele frequency of a given SNP is linearly dependent on the number of the alleles sampled. Consequently, positions with depth <150 in the baseline population were excluded in the final analyses reported in the manuscript.Since one replicate of each of the evolved populations was sequenced using a different technology, we confirmed that there was no platform effect by performing a paired two-sample t-test of all frequencies between replicates of the evolved populations. Each SNP is represented by six measures, namely the one replicate of each diet measured on the GAiix platform and the second replicate on the HiSeq2000 platform. The values for each diet were paired and a t-test for the difference between platforms was assessed. The results shown in the Q–Q plot in Supporting Information, Figure S1 confirm that there is no enrichment of variants that tend to show a higher or lower frequency between the GAiix and HiSeq platforms.After quality-control steps, we identified the SNPs that likely evolved in frequency under laboratory adaptation, by finding alleles in the evolved populations that differed more than expected under the assumption of drift compared to the baseline population. We did so by deriving the effective population size from the expectation that the variance in each allele frequency is proportional to the baseline frequency and the number of generations at a given effective population sizeVt≈pq(1−e−t/2Ne),(1)which upon rearrangement, recognizing that it is measured from the sum of squared deviations between each observed yi after evolution and its expected frequency y^,Vt≈1n−1∑i=1n(pi−p^i)(2)yieldsNe=−t2×ln(1−[1/n])×∑i=1n[(pi−p^i)2/piqi],(3)where n is the number of SNPs (∼16,000 per bin), and t is the number of generations of laboratory adaptation (14 for replicate 1, 17 for replicate 2). The expected value under the null hypothesis of no sequence evolution, p^ is just the baseline frequency. The effective population size of each evolved population was initially estimated assuming that most SNPs in our data set were not evolving.The value (pi−p^i)2/piqi is calculated independently for every SNP because each SNP has a different starting allele frequency. The average estimated effective population size was calculated for bins of varying initial allele frequencies (Figure S2) and across all bins is ∼60 flies for all six populations. The estimates for effective population size both support the consistency in the fly cultivation techniques employed in the course of the experiments and corroborate the effective population size estimated from the experimental procedures.Since it is likely that at least some alleles are under selective pressure in our evolving populations, this procedure certainly underestimates Ne. We examined the effect of removing the alleles that changed the most in frequency on our estimate of Ne. Removing 20% of the most extreme SNPs increased our estimate of effective population size to between 80 and 100 flies per generation. Another source of variance inflation is the contribution of technical variance. Technical variance is the component of the total variance, which is introduced by the fact that each allele-frequency estimate is derived from high-throughput sequencing read depths and has a defined variance. Since we had technical replicates for the baseline population (alternate lanes of the GAiix), we estimated the technical variance introduced by random sampling of alleles. We then calculated a more accurate estimate for biological variance due to drift, which increased our estimates of Ne to consistently ∼100–120 flies. Consequently, we note that while our estimates of Ne may be >100, we use this value in our subsequent analysis as a conservative estimated parameter.The SNPs on the X chromosome suggest a higher effective population size than for those of the autosomes. This effect may partially be explained by the fact that we used a scaling factor of 3/2 instead of 2 in the estimation of Ne for the X chromosome because there are only 1.5 chromosomes per effective individual of the population. However, scaling by 2 for the X chromosome led to an underestimate of the Ne relative to the autosomes.After estimating Ne for each evolved population, we then used this value to calculate the variance expected under drift using Equation 1. The expected variance is also dependent upon the initial allele frequency. Since our analysis suggested that the accuracy of the estimates of allele frequency decreased with decreasing minor allele frequency, we set a minimum value of p × q of 0.85 × 0.15 = 0.1275, which would prevent alleles with low MAF from benefiting from both a poor estimate of p and a decreased level of expected variance due to drift. We then divided the allele frequency change by the standard deviation (the square root of the variance), which yielded a z-score that was subsequently used to calculate a P-value. To correct for multiple testing, we used a Bonferroni correction of six tests for 143,975 SNPs at P < 0.05, which yields a genome-wide significant threshold of P < 4.00 × 10−8, which we reiterate is a conservative threshold.Since the median baseline major allele frequency is 0.85, the response to lab adaptation is highly asymmetric as many sites go to fixation under drift. Our test for selection requires changes of frequency at least 0.3, so effectively only analyzes reduction in the major allele frequency. illustrates, however, that similar (but more variable, as expected) results are observed for alleles increasing in frequency. […]

Pipeline specifications

Software tools Bowtie2, SAMtools, VarScan, JMP Genomics
Application WGS analysis
Organisms Drosophila melanogaster
Diseases Metabolic Diseases, Machado-Joseph Disease