Computational protocol: Benchmarks for flexible and rigid transcription factor-DNA docking

[…] Structure alignment is carried out with TM-align []. TM-align algorithm uses TM-score instead of the commonly used RMSD (Root Mean Square Deviation) for alignment optimization. TM-score is more sensitive to global structure topology than to local structure changes [,]. The RMSD between two TF chains (RMSDc) or two TF units (RMSDu) is calculated with the alpha carbons of the amino acids that are aligned by the global sequence alignment program NEEDLE in EMBOSS package [].The TF-DNA interface or the buried surface area (BSA) of a TF-DNA binding unit is determined by calculating the difference in solvent accessible surface area (ASA) between separate TF and DNA structures and TF-DNA complexes, i.e. BSA = 0 . 5 × ( AS A TF + AS A DNA - AS A TF - DNA ) . The solvent accessible surface areas are measured with POPS using default parameters []. The number of residue-base contacts (NRBCs) is defined as the number of residues that are in contact with a DNA base through sidechains with a heavy atom-heavy atom distance cutoff of 4.5 Å.To investigate the interaction characteristics among different types of DNA binding proteins, we compiled three non-redundant datasets: TF, RE, and NS for transcription factors, type II restriction endonucleases, and non-specific DNA binding proteins respectively. All the complex structures are solved by X-ray crystallography method with resolutions of 3Å or better. The annotation of each complex to one of the three groups is based on the classifications in PDB [] and literature search. The redundant entries in each set are removed using PISCES with a sequence identity cutoff of 30% []. The protein chains in each set (RE: 24, TF: 84, NS: 43) are shown in Additional file , Table S1.We compared the distributions of NRBC and protein-DNA contact area among RE, TF, and NS groups. Figure shows that restriction endonucleases have more residue-base contacts (Figure ) and larger protein-DNA interfaces (Figure ) than those in the transcription factor group. While the median value of the NS interface distribution falls between the median values of TF and RE (Figure ), the median of NRBC distribution in NS is the lowest among the three groups (Figure ), suggesting small ratio of base/backbone contacts with proteins in the NS group. Figure shows the percentage of interactions of each residue except for glycine (no sidechain contact) with base or backbone-only in three datasets. Not surprisingly, NS has significantly lower base contacts than RE and TF groups. Large differences are also observed in about half of the residues types, alanine (A), aspartate (D), cysteine (C), glutamate (E), leucine (L), methionine (M), serine (S), tryptophan (W) and valine (V) between RE and TF protein groups (Figure ). These data provide further justification to the construction of TF-specific docking benchmarks. […]

Pipeline specifications