Computational protocol: Whole-Genome Sequencing of Sake Yeast Saccharomyces cerevisiae Kyokai no. 7

Similar protocols

Protocol publication

[…] An overall comparison of the K7 chromosomes with S288C (NC_001133–NC_001148), EC1118 (FN393058–FN393060, FN393062–FN393087, FN394216 and FN394217) and YJM789 (AAFW2000000) strain chromosomes was performed using MUMmer 3.0 software. Similarity-based searches of individual genes were performed using BLAST and BLAST2. Phylogenetic analyses were carried out using CLUSTALW 1.83. [...] For predicting protein-encoding genes, ORFs larger than 90 bp were comprehensively included as candidates. ORF prediction was then carried out based on a direct comparison of S288C ORFs with the K7 genome supercontigs. When direct comparison was difficult, ORFs were predicted using the software programs CRITICA, Glimmer2, GlimmerHMM and SIM4. Finally, all K7 ORFs were manually validated by expert annotators. When one or more incomplete ORFs, such as those truncated by a sequence gap and lacking a start or a stop codon, were mapped to a single S288C ORF, each incomplete K7 ORF was annotated as a single ORF. Functional annotation was based primarily on the Saccharomyces Genome Database (SGD; http://www.yeastgenome.org/), secondarily on the Saccharomyces species database (yeast comparative genomics: http://www.broadinstitute.org/annotation/fungi/comp_yeasts/) and also on COG/KOG (http://www.ncbi.nlm.nih.gov/COG/) and DDBJ/EMBL/GENBANK non-redundant databases. Orthology with the S288C ORF was evaluated using the BLASTP similarity and calculated as the percent of matched amino acid residues versus the total covered region between a K7 ORF and the best-hit S288C ORF (Supplementary Table S4) as truncated by a sequence gap. Similarity was calculated by the number of matching residues in only the corresponding regions of the S288C ORF. Dubious ORFs, ORFs in Ty elements and ORFs in telomeric regions were excluded as possible protein-coding genes and were not annotated. Prediction and annotation of RNA genes, Ty elements including solo long terminal repeats (LTRs) and telomeric elements were manually performed based on the results of BLASTN searches of the K7 genome with the S288C sequences of these genes and elements as queries.All annotated ORFs and genetic elements were given individual numbers (Supplementary Table S4). Nomenclature of the K7 genes was based on the following rules: (i) each protein-encoding or RNA gene was named according to the orthologous S288C gene using the format ‘K7_’ plus the S288C standard gene name (with >80% similarity) and the systematic name (with >50% similarity) given in SGD; (ii) K7 identification numbers or K7 original gene names, such as AWA1, were given to genes that were non-orthologous or of low similarity to S288C genes (with ≤50% similarity); (iii) each name of a gene truncated by a sequence gap or segmented by point mutations was followed by a lower case ‘a’, ‘b’ or ‘c’, such as ‘XXX1a’ and ‘XXX1b’, to show its correspondence to a partial region of the ortholog; and (iv) Ty elements and LTRs were independently termed according to the identical nomenclature used for S288C. […]

Pipeline specifications

Software tools MUMmer, Clustal W, CRITICA, Glimmer, GlimmerHMM, Sim4, BLASTP, BLASTN
Databases DDBJ SGD
Applications Phylogenetics, Nucleotide sequence alignment
Organisms Saccharomyces cerevisiae
Diseases Immunologic Deficiency Syndromes