Computational protocol: Intrinsic Structural Disorder Confers Cellular Viability on Oncogenic Fusion Proteins

Similar protocols

Protocol publication

[…] Human chromosomal translocations and the relevant genes/proteins were collected by sifting through Swissprot, NCBI's GenBank and TICdb –, searching the annotations for key expressions such as “chromosomal translocation”, “chromosome translocation” or “fusion protein”. Breakpoints and gene names were recorded when this information was available in the databank entries. We focused on protein-coding entries: whenever there was a partial peptide sequence in the annotation part of a GenBank fusion entry, we used NCBI's Tblastn to compare the nucleotide sequence of the GenBank entry to the non-redundant set of all human proteins. The two most closely matching proteins (with percentage identity >95%), matching either end of the GenBank entry, were picked. This procedure resulted in 739 highly redundant protein identifiers. After comparing these proteins to one another with Blastp and replacing 42 gene names with their synonyms, the 739 proteins were found to belong to 305 different genes. We culled 101 more genes from the latest version of TicDB , altogether resulting in 406 translocation-related genes (, Supplementary material).We next recorded the breakpoints in the annotation of the proteins or assigned them whenever translocation proteins from chimeric nucleic acid sequences in GenBank could be reconstructed; at least one breakpoint within the coding region could be identified in 146 genes. Using the transcript information in TicDB and the corresponding proteins in the Ensembl database, the final number of translocation partner proteins increased to 255. Fusion proteins were next reconstructed from the chimeric nucleic acid sequences in GenBank by running Blastx to query NCBI's non-redundant protein database, and also from the coordinates provided by TicDB, which resulted in the reconstruction of 187 fusion proteins. These correspond to 171 non-redundant fusion sequences at a sequence identity threshold of 90%.As controls, the sequences of experimentally verified IDPs were downloaded from the DisProt database (http://www.disprot.org/) and fully ordered proteins were obtained from the PDB (http://www.rcsb.org/). A set of all human protein-coding genes and their transcript variants were obtained from the Ensembl website (http://www.ensembl.org). As of March 5, 2008, there were 22,297 protein-coding genes in the human dataset. We used as reference proteins the longest transcript for each human protein-coding gene. A set of human proteins (altogether 15,945) were also obtained from the Swissprot databank (http://expasy.org/sprot/) . The Pfam domain database was downloaded from the Pfam website (http://pfam.sanger.ac.uk/). [...] Intrinsic disorder was predicted by the IUPred algorithm , which can predict disorder with a sensitivity of 76% at 5% false positive rate. Average percentage disorder was defined as the percent of amino acids with a disorder score ≥0.5.Fusion proteins were analyzed for the occurrence of Pfam domains running Blastp against the entire set of Pfam-A domain sequences and also the human subset of Pfam domains derived from Swissprot proteins. We set the thresholds for a domain match at an e-value<1e-06 and sequence similarity >60%. We found that at this similarity level there was less than 1% difference between the two sets of domain matches, so analyzed all 18,609 human Swissprot proteins for the occurrence of Pfam domains using only the Swissprot-derived human subset of Pfam. (Ideally, when looking for non-overlapping domain matches, choosing the best ones we would find only identity matches. We found that out of 29,848 non-overlapping Pfam domains in 14,541 human Swissprot proteins with at least one domain match there were only 1275 domain matches with less than 100% sequence similarity.) We further analyzed the translocation proteins with in-house Perl scripts, namely, (i) statistical significance of the difference between any two distributions was evaluated with the chi-square test; (ii) p-values corresponding to the calculated chi-square values and degrees of freedom were calculated by a computer program courtesy of Zsuzsa Dosztanyi; (iii) percentage values of disorder, length, etc. distributions were also calculated by own Perl scripts. [...] Actual values for the accessible nonpolar surface area (Anp) of truncated domains were determined as follows: The C-terminus of the protein structure was gradually truncated and the actual values of Anp for the truncated fragments were determined with the CHASA program as suggested by . The theoretical values were calculated using the formula by Chothia and Janin .The truncated domain structures were drawn and annotated using NCBI's Cn3D program . The high-resolution images in and were created using the Polyview-3D server . […]

Pipeline specifications

Software tools TBLASTN, BLASTP, BLASTX, IUPred, Cn3D, POLYVIEW-3D
Databases Pfam TICdb DisProt ExPASy
Applications Drug design, Protein structure analysis, Amino acid sequence alignment
Organisms Homo sapiens
Diseases Abetalipoproteinemia, Neoplasms, Mitochondrial Diseases