Computational protocol: Phylogenomics and barcoding of Panax: toward the identification of ginseng species

Similar protocols

Protocol publication

[…] Sequencing reads were demultiplexed into FASTQ files using Flexbar version 3.0.3. Trimmomatic version 0.36 [] was used for adapter trimming and quality filtering of reads using a sliding window of 15 bp and an average Phred threshold of 20. Low-end quality bases below a Phred score of 20 were removed, and only reads longer than 100 bp were retained. MITOBim version 1.7 [] was used for assembly of the single-end Ion Torrent reads using iterative mapping with in silico baiting using the following reference plastomes, P. vietnamensis (KP036470) and P. stipuleanatus (KX247147).Inverted repeats and ambiguous portions of the assembly were resequenced using Sanger sequencing. Specific primers were designed and used for DNA amplification of interest regions. PCR was performed on a Mastercycler® Pro (Eppendorf, USA) in a 20 μl final volume containing 2.5 μM of each primer, 1 mM of each dNTP, 10X DreamTaq Buffer, 0.75 U DreamTaq DNA polymerase (ThermoFisher Scientific, USA) and deionized water. The PCR cycling conditions included a sample denaturation step at 94 °C for 2 min followed by 35 cycles of denaturation at 94 °C for 30 s, primer annealing at 50–55 °C for 30 s and primer extension at 72 °C for 1 min, followed by a final extension step at 72 °C for 5 min. PCR products were then purified using GeneJET PCR Purification Kit (ThermoFisher Scientific, USA). Sanger sequencing was performed on an ABI 3500 Genetic Analyzer system using BigDye Terminator v3.1 Cycle Sequencing Kit. Cycle sequencing was performed on a Veriti Thermal Cycler (Applied Biosystems, USA) using 3.2 μM of each primer, 200 ng purified PCR product, 5X BigDye Sequencing Buffer, 2.5X Ready Reaction Premix and deionized water in a 20 μl final volume. The thermocycling conditions included 1 min at 96 °C followed by 25 cycles of denaturation at 96 °C for 1 min, primer annealing at 50 °C for 5 s and primer extension at 60 °C for 4 min, followed by a holding step at 4 °C. Extension products were purified using ethanol/EDTA precipitation with 5 μl of EDTA 125 mM, 60 μl of absolute ethanol. Purified products were denatured at 95 °C for 5 min using 10 μl Hi-Di Formamide. DNA electrophoresis was performed in 80 cm × 50 μ capillary with POP-4 polymer (Applied Biosystems, USA).In order to test the efficacy of the NEBNext Microbiome DNA Enrichment Kit the proportion of reads belonging to the plastome was estimated for both the methylated and the non-methylated fraction. The P. ginseng whole genome sequencing SRR19873 experiment was used to estimate the starting proportion of plastome reads, by mapping the reads against the plastid genome of P. ginseng (NC_006290) using Bowtie 2. Association of reads to their taxonomic identification and organelles, was made using a tailored database of Panax plastome data representing the same data as that downloaded from public repositories for the phylogenetic analyses. For the mitochondrial data, all angiosperm mitochondrion genomes available on NCBI were used, and for the microbiome all remaining reads were blasted against the full NCBI database. Taxonomic identifications were retrieved using the lowest common ancestor (LCP) algorithm in Megan version 5.11.3, with minimum read length of 150 bp and at least 10 reads for each taxon identified with an e-value of 1e-20 or less. The proportion of plastid DNA in the gDNA was estimated using Bowtie2 by mapping the proportion of reads belonging to the plastid genome for P. ginseng (following SRR experiment SRR1181600).The plastid genomes were annotated using Geneious version 6.1, and annotations of exons and introns were manually checked by alignment with their respective genes in the same annotated species genome. Representative maps of the chloroplast genomes were created using OGDraw (Organellar Genome Draw, []). [...] The matrix for phylogenomic analyses consisted of complete aligned plastid genomes, and the global alignment was done using MAFFT version 7.3 [] with local re-alignment using MUSCLE version 3.8.31 [], and manual adjustments where necessary. Aligned DNA sequences have been deposited in the Open Science Framework (OSF) directory (https://osf.io/ryuz6). The final matrix has a total length of 163,499 bp for a total of 61 individuals with no missing data. Single nucleotide polymorphisms (SNPs) were visualized using Circos version 0.69 []. Relationships from the nucleotide matrix were inferred using Maximum Likelihood (ML) and Bayesian inference. First, an un-partitioned phylogenetic analysis was performed to estimate a single nucleotide substitution model and branch length parameters for all characters. Next, the data was partitioned in coding regions, introns and intergenic spacers, and a best-fit partitioning scheme for the combined dataset was determined using PartitionFinder version 2.1.1 [] using the Bayesian Information Criterion (Additional file : Table S3). Branch lengths were linked across partitions.The dataset was analyzed using RAxML version 8.2.10 [] and mrBayes version 3.2.6 []. RAxML and Bayesian searches used the partition model determined by PartitionFinder. For the ML analyses, tree searches and bootstrapping were conducted simultaneously with 1000 bootstrap replicates. Bayesian analysis were started using a random starting tree and were run for a total of ten million generations, sampling every 1000 generations. Four Markov runs were conducted with eight chains per run. We used AWTY to assess the convergence of the analyses []. Conflicting data within ML and Bayesian analyses were visualized and explored using the R package phangorn using the consensusNet function []. [...] Suitable barcoding markers were selected by extracting the SNP density over the plastid genome alignment of all Panax species and individuals included in this study (matrix available as supplementary data on OSF). We used SNP-sites version 2.3.2 [] to extract the SNP positions from the alignment of a matrix containing only the Panax species, and created bins every 800 bp using Bedtools version 2.26.0 [] (script available on OSF) and plotted the SNP density using Circos [] (Fig. ). The coordinates of each annotation on the aligned Panax species matrix were found using a reference consisting of the four annotated genomes produced in this study, and subsequently exported to Circos. We selected the most variable regions and designed suitable primers for these regions (Fig. , Additional file : Table S4). From the matrix used for the Aralioideae, we extracted 15 plastid markers (Fig. ) and download ITS sequences for the Aralia-Panax group (Figs.  and ) (Additional file : Table S2). We performed maximum likelihood analyses on individual and concatenated matrices using RAxML. mPTP analyses were performed using the ML trees from the individual and concatenated markers, and using the Markov chain Monte Carlo (MCMC) algorithm with two chains and the Likelihood Ratio Test set to 0.01.Fig. 1 […]

Pipeline specifications

Software tools Flexbar, Trimmomatic, MITObim, Bowtie, Bowtie2, Geneious, OGDRAW, MAFFT, MUSCLE, Circos, PartitionFinder, RAxML, MrBayes, AWTY, Phangorn, BEDTools
Applications Phylogenetics, WGS analysis, Nucleotide sequence alignment, Genome data visualization
Organisms Panax ginseng, Panax quinquefolius, Panax vietnamensis, Physiculus japonicus