Computational protocol: Complete Plastome Sequences from Glycine syndetika and Six Additional Perennial Wild Relatives of Soybean

Similar protocols

Protocol publication

[…] Five plants of a single accession each from seven perennial Glycine species representing five of the nine genome groups () were grown in the greenhouse at Cornell. Plants were placed in the dark for 48 hr prior to collecting leaf tissue. Tissue was frozen and shipped to Arizona Genomics Institute (AGI), where bacterial artificial chromosome (BAC) libraries were prepared from HindIII digests of genomic DNA. BAC ends were sequenced for all BACs. Two BACs per species were selected from the BAC libraries based on BLAST matches of perennial BAC END sequence data to the G. max chloroplast sequence (Supporting Information, Table S1). BAC DNA was sheared to 3 kb to 4 kb using Adaptive Focused Acoustics technology (Covaris, Woburn, MA) cloned into the plasmid vector pIK96, sequenced with Sanger technology to an average depth of 9×, and then assembled (using Phrap Version 0.990319) and finished as previously described (). Additional BACs were selected as needed to close the chloroplast circles based on BLAST to the finished sequences.DOGMA () was used with default parameters for preliminary annotation of each sequence. The complete chloroplast sequences were arranged so that each sequence started with trnH. The SSC regions were arranged so that the complete ycf1 gene followed IRb, as in the G. max orientation depicted in . Sequences were then aligned using default parameters in Mulan (), with minor adjustments to the alignment made manually. Glycine max gene features were downloaded from NCBI (DQ317523) and aligned against the eight sequence alignment using Sequencher (Gene Codes). BioEdit () was used for manual alignment and confirmation of annotations. Analysis of the total chloroplast alignment for GC content and codon bias were performed using DNAsp ().Previously published primers were used for amplification of trnL-trnF () from 11 G. falcata accessions. The resulting sequences were aligned by MUSCLE () and edited in BioEdit (). This alignment was used to determine pairwise nucleotide diversity (π) between individuals. Levels of nucleotide polymorphism were calculated using DNAsp ().Direct and palindromic repeated sequences from each chloroplast sequence were identified using REPuter (). The number of repeats identified was limited by searching for repeats greater than 30 bp in length and with a sequence identity of 90% or better (Hamming distance of 3 as used previously by ). To further investigate dispersed repeat sequences, we extracted the sequence of each repeat separated by more than 1000 bp. Dispersed repeat sequences were aligned using SeqMan (version 2.2.0.56; DNA-STAR, Madison WI). In many cases, repeats listed separately in the REPuter output were from the same location, had the same sequence, and the only difference was the length; for example, a 36-bp repeat could also be listed for the same position as a 30-bp repeat. Similarly, repeats were listed as pairs and, if the repeat was found in a third or fourth location, then the original location was listed again with the third location and again with the fourth location; hence, the overall number of repeats was inflated. Repeat analysis was also performed using RepeatScout and Repeat Masker (; ) using default parameters in the MAKER 2.10 () genome annotation software.A Bayesian approach was used to generate a phylogenetic tree for the chloroplast genome sequences. The phylogenetic tree was generated with BEAST () using a prior assumption that Glycine perennials are a monophyletic group compared with the annual Glycine species (G. max), and that Phaseolus vulgaris is an outgroup to the Glycine genus. The node date of 19.2 million years ago (mya) for P. vulgaris and Glycine in the plastome tree was used for calibration and is based on the mean age estimated from matK sequence divergence in a comprehensive legume phylogeny (). The substitution model was general time reversible (GTR). The analysis was run for a MCMC length of 10,000,000 iterations, sampling every 1000 iterations. The molecular clock was an uncorrelated relaxed clock with a log normal distribution model (). FigTree was used to plot the tree (http://tree.bio.ed.ac.uk/software/figtree/).Percent identity across the chloroplast genomes, excluding the second IR, was visualized using the VISTA tools server (). The LAGAN shuffle option was used to facilitate inclusion of chloroplast sequences from GenBank for species with inversions not present in Glycine (Phaseolus vulgaris var. Negro Jamapa (DQ886273), Millettia pinnata (JN673818), and Vigna radiata (GQ893027). Pairwise differences in single nucleotides were recorded for each species in an alignment of Glycine species and Phaseolus vulgaris plastomes. Pairwise differences were used to calculate the nucleotide substitution rate by dividing by the length of the alignment and the divergence dates calculated with histone H3D alignments (; ). […]

Pipeline specifications

Software tools DOGMA, Mulan, Sequencher, BioEdit, DnaSP, MUSCLE, REPuter, RepeatScout, BEAST, FigTree, LAGAN
Applications Genome annotation, Phylogenetics, Nucleotide sequence alignment
Organisms Glycine max, Glycine syndetika