Computational protocol: Thermodynamic matchers for the construction of the cuckoo RNA family

Similar protocols

Protocol publication

[…] Let x be an RNA sequence, and F(x) its folding space, i.e., the set of secondary structures it can fold into. The thermodynamic model assigns a free energy state E(s) to a secondary structure s. A structure prediction algorithm, such as RNAfold, uses this energy model to predict the structure of minimum free energy (MFE),mfe(x)=argmins{E(s)|s∈F(x)}.A thermodynamic matcher (see footnote 1The name “thermodynamic matcher” goes back to Reeder et al., but no formal definition was given there.) (TDM) solves the same minimization problem on a specific subset of F(x). Let M be a function that, given sequence x, yields a subset of the folding space, M(x)⊂F(x). Our intention is that M somehow captures a class of structures that contains a consensus structure for a family of sequences. Then, a thermodynamic matcher for M computestdmM(x)=argmins{E(s)|s∈M(x)}.The name “thermodynamic matcher” goes back to Reeder et al., but no formal definition was given there.This is mathematically strict, with no heuristics involved. The matcher folds the given RNA sequence into the prescribed structure class as good as it – the RNA – can. Should mfe(x)∈M(x), we find tdmM(x)=mfe(x). Therefore, it makes sense to compare E(tdmM(x)) to E(mfe(x)). This allows to evaluate whether the matching structure comes close enough to minimal free energy to be plausible from the thermodynamic point of view.Although notations are quite different, a TDM is similar to a descriptive motif matcher as provided e.g. with RNAMotif. However, its implementation is different. It uses dynamic programming during the matching phase, rather than constructing combinatorial matches and evaluating them afterwards. Thus, only the best possible match (if any) for a sequence window is returned. Even with a loosely defined motif, the output is bounded by O(N), where N is the sequence size.Heretofore, the idea of TDMs has found a variety of applications, but not in RFM construction. A TDM, tuned to a subset of the folding space, is a nontrivial program to construct, similar but with specialized recurrences compared to a standard structure prediction algorithm. In fact, a TDM typically requires more recurrences and dynamic programming tables than a standard RNA folding algorithm. Hand-programming such a matcher, possibly modifying its design several times and debugging the implemented dynamic programming algorithm, would be prohibitive in practice. Fortunately, there is some automated support. The program RapidShapes generates TDMs from abstract shape specifications, such as "[[][][]]" for the cloverleaf shape. This allows one to directly compute this shape's contribution to the partition function of x in O(n3), which otherwise requires computing all shapes and has exponential runtime.The tool Locomotif generates TDMs from structure graphics, which a non-programmer can compile from pre-defined building blocks (stems, bulges, multiloops, simple pseudoknots, …). These building blocks can be specialized and decorated with sequence motifs in various ways. But in principle, Locomotif's fixed set of blocks is a restriction of the TDM concept.The recent Bellman's GAP system provides the declarative language GAP-L, which allows to describe TDMs by (tree) grammars and evaluation algebras, and comes with a repository of re-usable components for RNA folding algorithms. In particular, there is an algebra for MFE calculation, which can be used off-the-shelf, and the TDM designer does not have to worry about the intricacies of the energy model. It implements the TDM grammar with the best asymptotic efficiency and a minimal number of dynamic programming tables.The program RapidShapes generates TDMs from abstract shape specifications, such as "[[][][]]" for the cloverleaf shape. This allows one to directly compute this shape's contribution to the partition function of x in O(n3), which otherwise requires computing all shapes and has exponential runtime.The tool Locomotif generates TDMs from structure graphics, which a non-programmer can compile from pre-defined building blocks (stems, bulges, multiloops, simple pseudoknots, …). These building blocks can be specialized and decorated with sequence motifs in various ways. But in principle, Locomotif's fixed set of blocks is a restriction of the TDM concept.The recent Bellman's GAP system provides the declarative language GAP-L, which allows to describe TDMs by (tree) grammars and evaluation algebras, and comes with a repository of re-usable components for RNA folding algorithms. In particular, there is an algebra for MFE calculation, which can be used off-the-shelf, and the TDM designer does not have to worry about the intricacies of the energy model. It implements the TDM grammar with the best asymptotic efficiency and a minimal number of dynamic programming tables.In this study, we chose to first produce a shape matcher with Locomotif, obtain it's GAP-L source code, and then to further refine the GAP-L program according to our design decisions. [...] The search for homologs of the cuckoo family in a single genome sequence can be divided into three main stages. In the first stage the skeleton TDM screens the genome in a sliding-window mode for initial candidate homologs that meet the primary sequence constraints. Here, a window of 120 nt size and a step size of 50 nt is used. Overlapping matches, provided that the discovered cuckoo motifs in the overlapping part are at identical genomic positions, are assembled to a single candidate. The assembly step enables the generation of candidates with variable numbers of cuckoo motifs, corresponding to an equal number of stem-loops. The screen in 2,689 bacterial genomes resulted in 42,312 candidates which represented 39,801 HP2, 2,233 HP3, and 278 HP4 cuckoo candidates (). Figure 4. In the next stage, the cuckoo TDM tdmC is used to structurally assess each candidate sequence by attempting to fold it into the cuckoo family structure. Candidates that cannot fold into the cuckoo family structure are discarded. The TDM screen reduced the number of candidate sequences by more than 60% to 16,582, with 15,098 most of them belonging to the HP2 structural variant, followed by 1,311 HP3 and 176 HP4 cuckoo candidates. The purpose of the third stage is to verify that a cuckoo candidate x obtained by tdmC truly resembles a homolog of the cuckoo family by assessing its structure tdmC(x). For this, the structure must comply to the following two criteria.The first criterion is the energy filter. It compares E(tdmC(x)) to E(mfe(x))in order to ensures that the candidate folds into the family structure with a free energy similar that of its MFE structure. We apply RNAfold to calculate E(mfe(x)); Option –d2 was employed to make sure both programs use the same energy model. Candidate x passes the energy filter if the ratio of E(tdmC(x))/E(mfe(x)) is equal or greater than 0.85.We observed that, once the TDM had indicated the correct sequence boundaries, the MFE structure often resembles the cuckoo family structure. This led to the implementation of a second filter, which tests if the candidate folds also without structural constraints into the cuckoo RNAs characteristic structure(s). Structural similarity is assessed by using pointed shapes. A pointed shape refers to the abstract shape representation of an RNA structure, which is annotated by hairpin centers, e.g., „[35][66][82][110]." A hairpin center depicts the central position of a hairpin loop and is calculated as (i+j)/2, where i and j are the positions of the hairpin closing base pair. Abstract shape analysis is performed by applying RNAshapes. Two pointed shapes p1 and p2 are regarded similar if they share the same shape and if hairpin centers at the same relative order positions do not differ by more than 2.5 nt. Only candidates with similar shapes pass the filter. In some cases, the MFE structure of an cuckoo RNA exhibits a small extra stem-loop between 2 cuckoo modules, which is ignored by the structure filter as this stem-loop typically consists of not more than 3 base pairs. The filtering process returned 150 HP3 and 66 HP4 cuckoo RNAs. Table S3 lists the structure properties of the final cuckoo RNAs.The HP2 motif is statistically the least significant, and the inspection of the remaining 973 HP2 sequences revealed a high number of false positives, in the form of outliers or repeats. Therefore, we built a new TDM, tdmHP2, that is specific to HP2 cuckoo RNAs, only. The parameters used for adapting tdmHP2 were derived from already known HP2 cuckoo RNAs and manually selected candidates that showed a high plausibility based on synteny, distribution pattern, and structural properties. See Figure S2 for details on the grammar of tdmHP2. The additional application of tdmHP2 on the 973 HP2 candidates narrowed down the number of cuckoo RNAs to 105. […]

Pipeline specifications

Software tools RNAfold, RNAMotif, RapidShapes, Locomotif, RNAshapes
Application RNA structure analysis
Organisms Homo sapiens
Diseases RNA Virus Infections