Computational protocol: Comparative Mapping of the Wild Perennial Glycine latifolia and Soybean (G. max) Reveals Extensive Chromosome Rearrangements in the Genus Glycine

Similar protocols

Protocol publication

[…] DNA was extracted from leaf tissue of PI 559298, PI 559300, 146 F2 plants and 89 F5 plants using a DNeasy Plant Mini Kit (Qiagen, Valencia, CA). DNA samples were digested with BfaI and PstI-HF restriction enzymes (New England Biolabs, Ipswich, MA) as described by Thurber et al. . For these experiments, BfaI was selected because it did not produce strong banding patterns in preliminary restriction enzyme digestions of G. latifolia DNA and PstI was selected because G. latifolia sequence data contained an intermediate number of PstI recognition sites. For example, the previously determined G. latifolia sequence data were predicted to contain 2.1×104 MluI sites, 8.9×104 PstI sites and 3.1×105 HindIII sites. Up to 96 samples were sequenced per lane of a HiSeq2000 (Illumina Inc., San Diego, CA) at the W. M. Keck Center at the University of Illinois, Urbana, IL, USA to produced 100-nt single-end reads. In both experiments, DNA from each of the parental lines was independently processed twice to serve as a control for SNP identification. The barcode splitter from TASSEL was used to assign sequence reads to individual lines and remove barcode sequences, which produced 90-nucleotide sequence reads that were analyzed for SNPs. The parsed sequence data for the F2 and F5 populations have been deposited in the NCBI Short Read Archive as part of project SRP013346. Next, three Perl scripts were used to analyze the sequence reads for the bi-parental populations. First, sequence reads for each individual/line in the F2 and F5 populations and from the parental lines, PI 559298 and PI 559300, were aligned using Bowtie to a G. latifolia pseudo-reference sequence, which was generated by sequencing 180-bp, 500-bp paired-end and 3-kb, 8-kb, and 15-kb mate-pair libraries prepared from G. latifolia PI 559298 DNA, sequenced on an Illumina HiSeq 2000, and de novo assembled using ALLPATHS-LG (Chang et al., manuscript in preparation). The resulting assembly contained 16,423 scaffolds representing 1,069 Mbp, with an N50 of 235 Kb. Bowtie2 , which allows for insertions and deletions (indels), was also evaluated for read mapping, but at the high stringencies for matching employed, few indels were detected and the output from Bowtie was parsed more directly to SNP calls than output from Bowtie2. Second, SNPs were called when at least three reads from both replications of PI 559298 differed from both replications of PI 559300. Finally, Bowtie output files for each individual/line were used to assess allelic frequencies for each SNP using a custom Perl script, which ignored SNPs in low quality sequence reads (average quality scores of 40 or less). Based on allelic frequencies at each locus for each line, the Perl script then created a genotype matrix file for linkage analysis. Markers with less than 30% missing data and whose segregation did not differ significantly (P>0.05) from expected segregation ratios were selected for de novo linkage map construction.Linkage maps were constructed using MSTMap with a P-value  =  1.0−9, and visualized using MapDraw . Consensus linkage maps were constructed for G. latifolia from the F2 and F5 data using MergeMap . A weight of 5.0 was assigned to the F5 linkage maps and a weight of 1.0 to F2 linkage maps to reflect the higher confidence in the quality of the maps because of the reduced potential for errors in calling of heterozygous genotypes in the F5 population relative to the F2 population. To assess the synteny between G. latifolia linkage groups (LGs) and G. max chromosomes, SNP-containing sequences from G. latifolia were aligned to the G. max genome sequence using BLAST . For comparisons with P. vulgaris chromosomes, G. latifolia SNP-containing sequences and G. max gene models were aligned to P. vulgaris chromosomes ( using BLAST and visualized with MizBee . When G. latifolia sequences aligned at more than one location, the most syntenic location was chosen for these analyses. […]

Pipeline specifications

Software tools FASTX-Toolkit, TASSEL, Bowtie, ALLPATHS-LG, Bowtie2, MSTMap, MizBee
Databases Phytozome
Application Genome data visualization
Organisms Glycine max, Alfalfa mosaic virus, Sclerotinia sclerotiorum, Phaseolus vulgaris, Proteus vulgaris
Chemicals Glycine