Computational protocol: Complete Genome Sequence of Steroid-Transforming Nocardioides simplex VKM Ac-2033D

Similar protocols

Protocol publication

[…] Nocardioides simplex VKM Ac-2033D (synonyms, Arthrobacter simplex [basonym] and Pimelobacter simplex [senior homotypic synonym] []) effectively introduces a 1(2)-double bond in various 1(2)-saturated 3-ketosteroids, thus enabling the production of valuable pharmaceuticals and immediate precursors for the steroid industry (). The strain is also capable of effective hydrolysis of acetylated steroids (), utilization of natural sterols, and the reduction of carbonyl groups at C-17 and C-20 of androstanes and pregnanes, respectively. This bacterium of soil origin was first classified as Arthrobacter globiformis 193 and then reclassified as N. simplex VKM Ac-2033D based on a complex analysis using a polyphase taxonomic approach ().The short-read library containing DNA fragments of 226 ± 33-bp insert length was prepared with a TruSeq DNA sample preparation kit (Illumina) after digestion of the genomic DNA with NEBNext double-stranded DNA (dsDNA) fragmentase. The library was read on a HiSeq 2000 (with paired-end 100-nucleotide reads). The mate-pair libraries with 3,222 ± 251-bp-long to 9,992 ± 2,172-bp-long fragments were created with the Nextera mate-pair sample preparation kit (Illumina) and were sequenced on a MiSeq. NextClip 0.8 () was used to remove possible paired-end contaminations. Both the paired-end and mate-pair reads were adapter and quality trimmed by Trimmomatic 0.32 (). The mean coverage of the genome by three libraries was 1,989×. De novo genome assembly was performed with Velvet 1.2 () and SPAdes 2.5 () using paired-end reads and with SPAdes 3.1.0, CLC Genomics Workbench 6.0, and MaSuRCA 2.3.2 () using both paired-end and mate-pair reads. The produced contigs were manually combined into a single circular contig in BioEdit (). The quality of the resulting contig was assessed by REAPR 1.0.17 (). The contig was also checked by mapping reads in CLC Genomics Workbench and by a visual inspection of putatively ambiguous places.The length of the genome is 5,637,355 nucleotides (nt), and the G+C content is 72.66%. Annotation of the genome was carried out with the service RAST ( and with GenBank tools. The RAST annotation revealed 5,421 protein-coding sequences, and the GenBank annotation revealed 4,633 coding sequences (CDS) and 816 pseudogenes; both annotations show 46 tRNAs (44 of which were unique), one pseudo-tRNA and 6 rRNAs. A preliminary analysis of the sequences showed several clusters of genes involved in cholesterol metabolism (side chain degradation, steroid core degradation, and transport).The reported complete genome sequence will contribute to the elucidation of the range of the steroid substrates that may be metabolized by this organism and the revelation of the scope of its potential application in pharmaceutical steroid production. […]

Pipeline specifications