Computational protocol: Bioinformatic Analysis Reveals Conservation of Intrinsic Disorder in the Linker Sequences of Prokaryotic Dual-family Immunophilin Chaperones

Similar protocols

Protocol publication

[…] All CFBP sequences were retrieved from NCBI GenBank as described before []. Various known CYN and FKBP sequences of human, Drosophila, yeast, A. thaliana and from known FCBP and CFBP [] were joined in silico in the order CYN-FKBP and used as query in BLAST search of NCBI protein databank. Only those hits that contained both CYN and FKBP in the same polypeptide were selected by visual inspection, as most CFBP organisms possessed individual CYN and FKBP genes that were also returned by the search. Multiple and exhaustive iterations were performed, which retrieved new sequences, until no new sequences were found.Multiple sequence alignments were performed by Clustal Omega [] at the EMBL-EBI web server [], as described []. The output, saved in Newick format, was drawn using Dendroscope 3 (, an open source and interactive software for phylogenetic display [], whereby the rectangular Cladogram format was preferred, as many CFBP organisms are highly related but placed in separate clades. [...] Since 2010, several dozen programs have been developed for prediction of intrinsic disorder (ID) in protein sequences, highlighting the flurry of research in this area [,, , ]. As disorder prediction is the major focus in this paper, a brief description of the technique is in order. Fundamentally, all predictors are based on the premise that disordered regions should have a higher frequency of hydrophilic and charged residues, and lower sequence complexity. Technically, the current predictors rely on physicochemical properties or machine learning classifiers, or a combination [,, , , ]. Several methods use a meta-approach that combines predictions from multiple predictors, but this often results in slow computing. Here, I have chosen PrDOS [] because it is relatively fast, offers a simple graphical user interface, allows batch analysis of up to 50 sequences, and is a hybrid that uses both template-based and machine-based predictions. It is also relatively unbiased, without favoring and disfavoring any features or motifs such as disulfide bonds or metal-binding regions []. For each sequence, a scoring matrix is generated after two-rounds of PSI-BLAST search of sequence databases. The profiles are then used for a template-based search for a homolog with known disorder status in the PDB. For sequences that do not have a homolog, a support vector machine (SVM) algorithm is used to obtain the position-specific scoring matrix. PrDOS allows interactive user-selected false-positive rates (FPR) that range from 1 to 25%. The FPR determines the threshold, above which the disordered prediction is increasingly more reliable. After optimizations of the FPR with several known ID sequences (e.g. in p53) [] and 3D structures (e.g. CYN and FKBP proteins), and thereby ascertaining that PrDOS correctly showed both the presence and absence of ID regions, an FPR of 8% was selected as the most optimal for routine analysis, which translated into a “disorder probability” (DP) threshold of 0.43. In analyzing a CFBP, this threshold was used as the baseline. Nevertheless, to rule out any computational bias, we subjected a single CFBP sequence to disorder prediction by multiple programs, and the PrDOS results agreed with essentially all of them, including several meta-predictors (data not shown), such as MetaDisorder, which compares nearly two dozen different methods [], and PONDR-FIT, which combines five []. For routine analysis, the PrDOS results were downloaded as CSV (comma-separated values) files, and then imported into Excel for further analysis and graphing.For ab initio structure prediction, the Rosetta software [] was used in the Robetta server []. [...] Amino acid composition of the linkers was determined by the use of Composition Profiler (, a web-based tool that automates detection of enrichment or depletion patterns of amino acids classified by user-chosen properties []. For the linker analysis, “disorder propensity” was chosen and all 277 CFBP linkers were collectively analyzed as “query sample”, and the rest of the CFBP sequence (i.e., CYN + FKBP) as “background sample”. […]

Pipeline specifications