Computational protocol: SBSPKSv2: structure-based sequence analysis of polyketide synthases and non-ribosomal peptide synthetases

Similar protocols

Protocol publication

[…] To understand the biosynthetic pathway of an orphan PK or NRP, user can search for chemically similar molecules using the ‘Reaction Search’ module (Figure ). The search for chemically similar PKs and NRPs accepts chemical structure of query molecule in SMILES format. Chemical structures in SMILES format can be obtained from PUBCHEM for a large number of metabolites (). If not available in PUBCHEM or other websites, user can generate it using PubChem Sketcher (). ‘Reaction Search’ module allows users to restrict their search by defining the number of matches, Tanimoto score or sub-structural patterns in SMARTS format. The algorithm then compares the given molecule to ∼2000 biosynthetic intermediates and final products of experimentally characterized PKs and NRPs using the similarity search option of Open Babel which is based on substructure based fingerprints (). Links to the biosynthetic pathway page of the hits provided by the tool can help in deciphering putative biosynthetic pathways of the query compound. [...] The similarity search and search for potential tailoring reaction uses an elaborate database of biosynthetic pathways in chemical space at the backend. The database contains biosynthetic pathway of >200 experimentally characterized PKs and NRPs. Based on extensive manual curation of published literature, chemical structures of metabolites and sequences of biosynthetic enzymes, each step involved in the biosynthesis of PKs or NRPs have been cataloged in the database along with the reactions, enzyme names, accession numbers and monomers added. Approximately 2000 chemical structures of biosynthetic intermediates are stored in SMILES format and >1000 sequences of enzymes involved in the characterized PKs/NRPs pathway have been stored. The PK and NRP pathways have been represented as interactive graphs (Figure ). The pathway pages use embedded JavaScript-based Cytoscape.js (). Each graph starts with the starter moiety and catalogs the intermediate steps to terminate at the complete metabolite. The nodes of the graph represent the biosynthetic intermediates and the edges represent the reaction converting each intermediate. Images of chemical structure of intermediates have been used to depict the nodes. All nodes and edges in the graph based viewer can be dragged by the user to any desired position and can be clicked to show additional details. Individual nodes can be clicked to view a larger image of chemical structure, representation in SMILES format and link to structurally similar metabolites. Each edge label depicts the monomer being added (if applicable), gene name corresponding to the enzyme involved and reaction name. The web-server also allows user to download the pathway map of each metabolite as a flat file. Feature for searches in the text part of the database has been made available using the keyword search functionality. For example, it can help in search for all PKS/NRPS pathway where the monomer alanine or methyl malonate is added or all pathways where a particular reaction like methyl-transfer or epoxidation occurs. The identified pathways can then be visualized as interactive graphs. [...] The genomic and the chemical space of SBSPKSv2 have been interlinked by cross references between related features/records. Clicking on the edge of a reaction graph in chemical space allows the user to visualize the corresponding biosynthetic enzyme in genomic space of SBSPKS and carry out further analysis of its sequence or structural features. The link displays the complete biosynthetic gene cluster where the selected enzyme is highlighted (Figure ). Similarly in the HTML pages which depict domain organizations for each biosynthetic gene cluster in genomic space, each domain has been interlinked to the chemical transformation it catalyzes in chemical space. Clicking on the domain leads to a page which not only provides interfaces for a variety of sequence as well as structural analysis, but also provides a link to the biosynthetic pathway database in the chemical space (). The reaction catalyzed by the selected domain is highlighted in red. Thus SBSPKSv2 provides interfaces for seamless transitions between genomic and chemical space and carry out various types of analysis. [...] In the past decade, a large number of PKS and NRPS gene clusters have been identified and characterized. Resources like MIBig, IMG-ABC and antiSMASH database contain a large number of predicted secondary metabolite gene clusters (–). These databases are excellent resources containing a catalog of all predicted gene clusters and their domain annotations, but often it is difficult to distinguish information about experimentally characterized biosynthetic gene clusters (BGC) from information which is predicted for uncharacterized BGCs. Therefore, there is a need to comprehensively annotate and store the information regarding experimentally characterized BGCs and make them easily accessible for analysis. The few databases that contain manually curated gene clusters of PKS and NRPS are DoBISCUIT and ClusterMine360 (,). They contain 135 and 245 gene clusters respectively, corresponding to unique compound families. But since their last update the number of characterized gene cluster has increased. Therefore to catalog the growing information comprehensively, NRPS_PKS—the genomic database of SBSPKS has been updated. NRPS_PKS now contains >300 gene clusters belonging to unique compound families (). The database catalogs information about genes involved in the biosynthesis of PKs and NRPs, its modules and domains, specificity of acyltransferase (AT) and adenylation (A) domains and their active sites. Each domain is linked to the respective domain organization page which allows for various analyses like pairwise alignment with other characterized domains; search for nearest structural homolog, threading alignments, comparison of the active site with other characterized sequences. As a number of new 3D structures of PKS and NRPS domains have been elucidated since the last NRPS_PKS update, we have incorporated them into SBSPKSv2.Earlier version of SBSPKS identified PKS/NRPS domains by pair wise alignment of query sequence to template sequences of various domains, and multiple template sequences were used for domains like ACP which had highly diverged sequences. Since profile based methods are more efficient for domain identification, other software like AntiSMASH, NRPSsp and NRPSpredictor (,,) use Hidden Markov Models (HMMs) not only for domain identification, but also for prediction of substrate specificity of adenylation (A) domains of NRPS. We have now implemented HMM based method in SBSPKSv2 for quick and efficient domain identification. In the last few years, not only has the number of characterized gene clusters increased, but a number of new domains like product template (PT), starter unit:acyl-carrier protein transacylase (SAT), Formyl transferase (FT) have also been identified in these megasynthases (). To detect these new domains and the canonical PKS/NRPS domains we have either developed HMM models or used HMM models from Pfam (,). Cut-off was determined for each domain after extensive analysis of the characterized sequences with profile HMMs. The sensitivity, specificity and precision of all our HMM based models are >0.9 (). As Condensation (C), Epimerization (E) and Cyclization (Cy) domains of NRPS shares high sequence similarity, we have used motif based methods to distinguish these domains. Though a number of tools exist for genome mining of PKS/NRPS gene clusters, detection of several unusual domains is exclusive to SBSPKSv2 (). In addition to domain detection the genome mining tool of SBSPKSv2 also predicts substrate specificity, active site, closest structural homolog and experimentally characterized domain sequences (Figure ). Updated SBSPKSv2 now uses specificity determining active site profile from 160 different A domain monomers and 15 AT domain substrates. This significantly enhances the performance of SBSPKS in predicting starter/extender substrates selected by PKS/NRPS modules in a newly identified sequence.Since the last SBSPKS release, 3D structures of three NRPS module has been elucidated. (,,). Given a NRPS module sequence, ‘Model 3D-PKS/NRPS’ interface of SBSPKSv2 builds its homology model using these structures as templates. SCWRL program () is used to build the side chain coordinates of these homology models. [...] Open Babel was used to build database of biosynthetic intermediates (). Chemaxon ( was used for chemical structure drawing. The interactive pathway graphs are visualized using Cytoscape.js (). HMM profiles were built using HMMER3 software (). Pairwise alignments are performed using latest version of BLAST+ (). […]

Pipeline specifications