Computational protocol: Intrinsic protein disorder in histone lysine methylation

Similar protocols

Protocol publication

[…] The human histone lysine methyltransferase (HKMT) dataset was taken from an article published in 2013 about the SET domain containing histone methyltransferases []. DOT1L was added manually as the only HKMT lacking a SET domain. UniProt Acc-s and information about the length of the protein sequences were collected from the UniProt database.Histone modifying enzymes were collected from the UniProt database by searching with the enzyme names and the appropriate GO annotations: ‘histone-lysine N-methylase activity’, ‘histone-arginine N-methylase activity’, ‘histone acetyltransferase activity’, ‘histone demethylase activity’ and ‘histone deacetylase activity’. Protein existence was ‘not uncertain’ and fragment sequences were left out. Of the different variants of the same protein, only the longest version was used for analysis. The human dataset consisted of 34 HKMTs, 8 HRMTs, 29 HATs, 22 Histone demethylases and 18 Histone deacetylases. The datasets used for evolutionary analysis contained 2230 HKMTs, 374 HRMTs, 4444 HATs, 539 Histone demethylases and 2038 Histone deacetylases. Evolutionary groups were formed and named as follows: Bacteria, Archea, Eukaryotic unicellular (Eukaryotic species that are not plants, metazoans or fungi), Fungi, Metazoa1 (Metazoans except for protostomes and deuterostomes, e.g. sponges, cnidarians), Metazoa2 (Protostomes except for Ecdysozoa, e.g. flatworms, annelids, molluscs), Metazoa3 (Ecdysozoa), Metazoa4 (Deuterostomes except for vertebrates), Metazoa5 (Vertebrates). For this analysis reference proteomes were used only.Structural disorder was predicted with the IUPred long disorder predictor []. The overall disorder rate was computed as the fraction of residues with an IUPred score of at least 0.5. To evaluate the IUPred long disorder prediction, we compared its scores to the results given by nine other disorder predictions from MobiDB []. IUPred gives the same per aa classification as the consensus in >90 % of the sequences (Average IUPred disagreement: 9.4 ± 4.7 % st.error).We searched human linear motifs in the disordered HKMTs (disorder rate higher than 50 %) in the ELM database [], and only collected the ones annotated from the literature and the hits with e-value < 0.0001, both with nuclear localization. The construction of scrambled sequences to check for the significance of the frequency of ELM hits in HKMTs was made by shuffling the amino acid residues of the above mentioned HKMTs having an IUPred score at least 0.5, using a Perl script. Twenty constructs were generated with length of 10000 residues for a 10x sequence coverage.PDB structures were searched manually, while SCOP domains were assigned with the help of annotations in the D2P2 database []. Literature mining for known binding regions of HKMTs was done by reading the evidence references of the interaction hits found in the BioGrid database []. Cancer-related single nucleotide polymorphisms in the long conserved IDR regions were collected from the BioMuta v2.0 [] and COSMIC databases []. Long conserved disordered binding regions were calculated in two steps: first, we predicted longer (min. 8 residues) disordered binding regions by ANCHOR []. Next, we took the intersection of the set of these regions with the Scorecons [] conservation output (with default valdar01 scoring) defining “constrained” regions with a value of at least 0.9, based on a multiple alignment generated out of 22–24 vertebrate orthologs. The multiple sequence alignment was generated in UniProt selecting the “canonical” sequences from the vertebrate organisms, ignoring fragments, using BLAST with default parameters (Clustal-Omega alignment, Gonnet transition matrix, gap opening penalty 6 bits, gap extension 1 bit). Each vertebrata multiple alignment file of the proteins contained a broad range of species from primates to the earliest diverged fishes.The calculation of sequence conservation and disorder conservation was carried out by DisCons [], from alignments with default parameters (IUPred long, Jensen-Shannon divergence, window size of 3). As input alignment we used the same vertebrata alignment that we used in the case of Scorecons [].To determine if two normally distributed sets of data were significantly different from each other, or observed values were significantly different from a given mean, we performed two-sample and one-sample t-tests, respectively, using a statistical significance threshold of 0.05 to reject the null hypothesis.For the Discrete Molecular Dynamics (DMD) simulations the following input sequences were generated: i) for the free MLL1 N-terminus, the amino acids 1–200 of MLL1 (UniProt: Q03164) were used to generate an extended structure in PyMol (The PyMOL Molecular Graphics System, Version Schrödinger, LLC); ii) for menin and LEDGF/p75, sequences provided in PDB entry 3U88 were used, while for MLL1, the disordered regions removed from the construct were reinserted into the sequence and the purification tag was removed; iii) for the ternary complex supplemented with the disordered binding loop, PDB entry 2MSR was used. The structures were energy minimized by the DMD [] protocol of Chiron ( []. Briefly, a short simulation (1,000 time unit/steps) using a high heat exchange factor (HEX  =  10) at a high temperature (0.7 temperature unit) was performed followed with a short simulation with a low heat exchange factor (HEX  =  0.1) at a low temperature (0.5 temperature unit). Cα and Cβ atoms were restrained. In all DMD simulations, including those combined with replica exchange, a united-atom representation is used to model proteins, in which all heavy atoms and polar hydrogen atoms of each amino acid are included [, ]. The solvent is implicitly modeled employing the Lazaridis-Karplus solvation model []. Long range electrostatic interactions are also implemented []. The πDMD software employed for simulations was kindly provided by Molecules in Action, LLC ( exchange DMD (RX-DMD) simulations [] were performed with 8 replicas at temperatures 0.5497, 0.5624, 0.5753, 0.5886, 0.6022, 0.6161, 0.6303, and 0.6448 temperature unit, for 4,000,000 time units. One frame (conformation) was generated every 200 time units. Anderson’s thermostat was used and the heat exchange factor was set to 0.1. At the end of a simulation, the frames from every trajectory were grouped by temperature for analysis. These simulations were run on the HPC of the Institute of Enzymology (RCNS, HAS, Hungary, supported by the Momentum Program of HAS).Ψ and Φ torsion angles were determined by DSSP [] for every structure at every temperature. The occurrence of torsion angles characteristic of α-helices was counted for every amino acid position and was divided by the total number of the structures (10,000). To see if the α-helical torsion angles arise at the level of individual amino acids or continuous helices are formed, the helical content for each frame was plotted along the amino acid sequence. All calculations and plotting were done in R []. […]

Pipeline specifications

Software tools IUPred, Clustal Omega, DisCons, PyMOL
Databases BioGRID BioMuta D2P2
Application Protein structure analysis
Organisms Homo sapiens
Diseases Neoplasms, Sleep Disorders, Intrinsic