Computational protocol: Primers for fourteen protein-coding genes and the deep phylogeny of the true yeasts

Similar protocols

Protocol publication

[…] To identify promising gene fragments for phylogenetic analysis, we first downloaded published genomes for four ascomycete species (Saccharomyces cerevisiae, Candida albicans, Neurospora crassa and Schizosaccharomyces pombe) and performed a series of informatics searches for reciprocal best matches to identify putatively orthologous genes. First, each gene in S. pombe was aligned with every gene in C. albicans, using the Fasta3 local alignment algorithm with default parameters (Pearson, ), and the best match was recorded (i.e. the gene giving the highest Z score). This C. albicans gene was then aligned to every gene in the S. cerevisiae genome, and again, the best match was found. This S. cerevisiae sequence was then aligned to every gene in the S. pombe genome, and if the best match was the same as the starting S. pombe gene, then we classified the three genes as orthologs. For those genes passing this ‘triple boomerang test’, we also tested whether the S. cerevisiae gene had a reciprocal best match with an N. crassa gene. Together, these analyses allowed us to identify 2332 quartets of putatively orthologous proteins. To reduce the likelihood of including paralogs, we also recorded the second highest Z score for each of these searches and then excluded any gene for which this score was greater than 40% of the highest score or was itself higher than 156. This procedure gave 500 quartets for which the probability of including paralogs was relatively low.Multiple alignment of the putative orthologs was performed with ClustalW, using default parameters (Thompson et al., ). Each alignment was then searched for candidate PCR primer binding sites, as follows. First, we recorded the maximum number of bases identical in all four organisms in a sliding window 20 nucleotides long. Then, we recorded the maximum separation in nucleotides of two potential primer sites in both of which at least n/20 bases were identical in all four organisms, for 12 ≤ n ≤ 18. The resulting lists of candidate gene fragments were then sifted manually to find those of an appropriate length (350–850 bp). The 28 most promising coding regions were then identified as being from the following genes: GLN1, DRS2, CDC60, CDC19, FAS2, RPA135, SAH1, ATP2, PGK1, GLT1 (2 regions), GCD11, FBA1, PGI1, CRM1, RPO21, PDA1, ADE6, ATP1, VMA2, POL2, MET6, FAS1, ARG5, ARG6, ECM17, HTS1, NHP6A and GCV2 (gene names as used in S. cerevisiae; http://www.yeastgenome.org/). Degenerate PCR primers were designed for each of the 28 coding regions and tested on a set of 25 yeasts, chosen to represent a broad spectrum of yeast diversity (i.e. deep nodes in the 26S rRNA gene study of Kurtzman & Robnett, ; Table ). Primers had between two and four degenerate sites to allow for differences between the four sequences, including usually no more than two inosine bases.Yeast strains were obtained from the CBS culture collection (Table ). The species identity of all strains was confirmed by sequencing the D1/D2 region of the LSU rDNA (Kurtzman & Robnett, ). Yeast strains were grown on YPD plates (1% yeast extract, 2% peptone, 2% glucose) and DNA extracted following the protocol of Sherman (). Gene fragments were PCR-amplified in 50-μL reaction mixtures (Sigma-Aldrich REDTaq ReadyMix and 5 μL of a 10X dilution of template DNA), using 35 cycles of 95/50–60/72 °C for 30/60/120s, respectively. All primer pairs were tested first with an annealing temperature of 57 °C and then with a higher temperature (60 °C) if multiple bands were obtained at 57°, or with a lower temperature (50 or 55 °C) if no band was obtained at 57°. PCR products were visualised on agarose gels. From these initial PCRs, we chose 14 gene fragments for sequencing and phylogenetic analysis, based upon there being a single amplicon of approximately equal size from a substantial number of the test species (Table ). Gene fragments were still accepted if some species had a single larger band, due to the likelihood that the extra length represented an intron. Primer sequences and alternate annealing temperatures for the 14 gene fragments are given in Table .For sequencing, bands were cut out from the gel to remove the PCR primers, and the DNA was purified using Qiagen QIAquick columns and then sequenced using an AB 3700 capillary sequencer, following the manufacturer's instructions. DNA sequences were translated to amino acids to facilitate alignment and aligned using Clustal (Thompson et al., ). For phylogenetic analyses, we also included data from the orthologous regions of 13 yeast species with published genome sequences (Candida albicans, C. glabrata, C. tropicalis, Clavispora lusitaniae, Debaryomyces hansenii, Eremothecium (Ashbya) gossypii, Kluyveromyces lactis, Lachancea (Kluyveromyces) waltii, Meyerozyma (Pichia) guilliermondii, Naumovozyma (Saccharomyces) castellii, Saccharomyces cerevisiae, Lachancea (Saccharomyces) kluyveri and Yarrowia lipolytica) and five other ascomycete fungi, as outgroups (Schizosaccharomyces pombe, Aspergillus fumigatus, A. nidulans, Neurospora crassa and Gibberella zeae). Alignments were manually adjusted using MacClade (version 4.08; Maddison & Maddison, ), with regions of uncertain alignment excluded from subsequent analyses.Phylogenetic analyses were performed on amino acid sequences rather than DNA sequences because yeasts show codon usage bias (Lloyd & Sharp, ), and differences between species in the strength or direction of this bias would make changes at different silent sites nonindependent. Phylogenetic analyses were conducted using PAUP* (version 4.0a122; Swofford, ), with alignment gaps (including missing genes) treated as missing data. First, a maximum parsimony search was performed (100 random additions, TBR branch swapping). The most parsimonious tree was then used in a likelihood analysis, which determined that the LG amino acid substitution rate matrix (Le & Gascuel, ) performed best, with amongst-site heterogeneity in substitution rates characterised by a discretised gamma distribution with a shape parameter equal to 0.75 (based on four categories) and a proportion of invariant sites equal to 0.31. A heuristic search was then performed using the LG matrix, empirical amino acid frequencies and these two parameter values, with 200 random additions, without branch swapping. The best 10 of these were then used as starting points for TBR branch swapping, with a reconnection limit of 8. Bootstrap support values were derived from 100 replicates, each involving 20 random additions, with NNI swapping. For parsimony analysis, the bootstrap supports are based on 100 replicates, each involving 100 random additions with TBR branch swapping.To examine polymorphism of the 14 gene fragments amongst S. paradoxus strains, we used data from the Saccharomyces Genome Resequencing Project (Liti et al., ). Low-coverage genome sequences are available for 27 S. paradoxus strains (http://www.sanger.ac.uk/research/projects/genomeinformatics/sgrp.html). We extracted the part of the genomic alignment corresponding to each of our 14 gene fragments for analysis. As most of the genomes were sequenced with relatively low coverage, data were not available for all strains for all gene fragments. Bases identified in the database as having accuracy less than Q40 (i.e. expected error rate >10−4) were treated as missing data. Data were analysed at the DNA level as most of the variation is not expected to change the amino acid sequence. […]

Pipeline specifications

Software tools FASTA, Clustal W, MacClade
Applications Phylogenetics, Nucleotide sequence alignment
Organisms Saccharomyces cerevisiae, Saccharomyces paradoxus