Computational protocol: Sequencing and Analysis of Strobilanthes cusia (Nees) Kuntze Chloroplast Genome Revealed the Rare Simultaneous Contraction and Expansion of the Inverted Repeat Region in Angiosperm

[…] Further steps of library preparation were performed using a TruSeq DNA Sample Prep Kit (Illumina, Inc., United States) according to the manufacturer’s instructions. The DNA was sheared to yield approximately 500 bp-long fragments for paired-end library construction. The library was sequenced on Illumina HiSeq 3000 (Illumina Inc.). In total, 9,912,889 paired-end reads (2 × 150 bp) were obtained.We first downloaded 1,006 plastid genomes from GenBank in February 2016. These plastid genome sequences were used to search against Illumina paired-end reads using BLASTN with an E-value cutoff of 1e-5. The genome sequence of A. paniculata (Accession number: NC_022451) had the highest overall sequence similarity to the reads and was used as a reference for the downstream genome assembly.AbySS (v1.5.2) () and CLC Genomics Workbench (v7) was used for the de novo genome assembly. Using Gepard (), we identified 7 contigs from the assembly of contigs of AbySS and CLC Genomics Workbench, respectively, that nearly spanned the entire cp genome. All the 14 contigs were assembled using Seqman module of DNASTAR (v11.0). Then, we obtained only one sequence corresponding to the large single-copy (LSC), IR, and small single-copy (SSC) regions of S. cusia. Regions corresponding to the IR/SSC and IR/LSC boundaries were confirmed via direct PCR amplifications.PCR amplifications were performed using the sequence-specific primers (Supplementary Table ) under the following conditions: pre-denaturation at 94°C for 2 min, 35 cycles of amplification at 94°C for 30 s, 55°C for 30 s, and 72°C for 30 s, followed by a final extension at 72°C for 2 min. The PCR reaction mixture contained 25 μL of Taq MasterMix (2 × ), 2 μL of forward primer (10 μM), 2 μL of reverse primer (10 μM), and purified cp DNA ( < 1 μg). RNase-free water was added to a final reaction volume of 50 μL.The CpGAVAS web service () was used to annotate the S. cusia cp genome. Cutoffs for the E-values of BLASTN and BLASTX were 1e-10. The number of top hits to be included in the reference gene sets for annotation after the pre-filtering step was 10. Meanwhile, tRNA genes were identified using tRNAscan-SE () and ARAGORN (). Manual corrections on the positions of the start and stop codons, and for the intron/exon boundaries were performed based on the entries in the cp genome database () using the Apollo program (). Moreover, the circular cp genome map of S. cusia was drawn using OrganellarGenomeDRAW (). Furthermore, codon usage and GC content (that is, the percentage of Guanines and Cytosines) were analyzed using the Cusp and Compseq programs provided by EMBOSS (). Final genome assembly and genome annotation results were deposited in the GenBank (accession number: MG874806). [...] Simple sequence repeats (SSRs) were detected using MISA Perl Script available at with the following thresholds: 8 repeat units for mononucleotide SSRs, 4 repeat units for di- and trinucleotide repeat SSRs, and 3 repeat units for tetra-, penta-, and hexanucleotide repeat SSRs. Tandem repeats were analyzed using Tandem Repeats Finder () with parameter settings of two for matches and seven for mismatches and indels. The minimum alignment score and maximum period size were set at 50 and 500, respectively. All the identified repeats were manually verified and nested, or redundant results were removed. REPuter () was employed to identify the IRs in S. cusia by forward versus reverse complement (palindromic) alignment. The minimal repeat size was set at 30 bp, and the cutoff for similarities among the repeat units was set at 90%. [...] Conserved sequences were identified between the cp genomes of Astragalus membranaceus and those of A. paniculata (NC_022451.2), R. breedlovei (KP300014.1), Tanaecium tetragonolobum (NC_027955.1), Dorcoceras hygrometricum (NC_016468.1), Salvia miltiorrhiza (NC_020431.1), Olea europaea (NC_013707.2), Sesamum indicum (NC_016433.2), and Scrophularia takesimensis (NC_026202.1) by using BLASTN with an E-value cutoff of 1e-10. The homologous regions and gene annotations were visualized using a web-based genome synteny viewer GSV (). [...] A total of 28 complete cp DNA sequences belonging to the Lamiales order were obtained from RefSeq database. For the phylogenetic analysis, 65 protein sequences were shared among all these 31 species, and S. cusia was aligned using the CLUSTALW2 (v2.0.12) program. The 65 proteins included ATPA, ATPB, ATPE, ATPF, ATPH, ATPI, CCSA, CEMA, MATK, NDHA, NDHB, NDHC, NDHE, NDHF, NDHG, NDHH, NDHI, NDHJ, NDHK, PETA, PETD, PETG, PETL, PETN, PSAA, PSAB, PSAC, PSAI, PSAJ, PSBA, PSBC, PSBD, PSBE, PSBF, PSBH, PSBJ, PSBK, PSBL, PSBM, PSBN, PSBT, PSBZ, RBCL, RPL14, RPL2, RPL20, RPL22, RPL23, RPL32, RPL33, RPL36, RPOB, RPOC1, RPOC2, RPS11, RPS14, RPS15, RPS18, RPS2, RPS3, RPS7, RPS8, YCF2, YCF3, and YCF4 (Supplementary File ). The alignment was manually examined and adjusted. Then, the evolutionary history was inferred using the Maximum Likelihood method implemented in RaxML (v8.2.4) (). The detailed parameters were “raxmlHPC-PTHREADS-SSE3 -f a -N 1000 -m PROTGAMMACPREV -x 551314260 -p 551314260 -o A_thaliana, N_tabacum -T 20”. The tree with the highest log likelihood (-126826.029496) was shown. The significance level for the phylogenetic tree was assessed by bootstrap testing with 1000 replications. The bootstrap values exceeding 50% were shown next to the corresponding nodes. […]

