Computational protocol: Validation of Structures in the Protein Data Bank

Similar protocols

Protocol publication

[…] Official wwPDB validation reports provide both overall quality scores for a PDB submission and detailed lists of specific issues. Above-average global scores can sometimes mask local issues; hence it is important to review the entire report, especially during structure refinement.The reports are provided as human-readable PDF files and as machine-readable XML files, and are made available with the public release of the corresponding PDB entry. The machine-readable files contain all of the detailed validation information and statistics. For example, the validation XML file specifies for each protein residue any outlying bond length or bond angle, the residue's rotameric state, its region in a Ramachandran plot (, ), any atoms involved in too-close contacts, and (for X-ray structures) the fit to electron density. These XML files can be read and interpreted by popular visualization software packages, such as Coot (), to display validation information for any publicly available PDB entry.Herein, we describe the format and content of the PDF files, which are the more commonly accessed validation report files. A full description of the report content is available at The PDF validation reports are available in two formats: a summary, in which a maximum of five outliers are presented for each metric, and a complete report, in which all outliers are enumerated. [...] The PDF reports are organized as follows. The title page displays the wwPDB logo (and also the EMDataBank logo for EM entries), specifies the type of the report (whether it is preliminary, confidential, or produced for a publicly available PDB entry), shows basic administrative information about the uploaded data or the PDB entry, lists the software packages and versions that were used to produce the report, and provides a URL to access help text at The executive summary (“Overall quality at a glance”) shows key information about the entry, such as the experimental technique employed to determine the structure, a proxy measure of information content of the analyzed data (resolution for crystal and 3DEM structures and completeness of resonance assignments for NMR), and a number of percentile scores (“sliders”), comparing the validated structure to the entire PDB archive (). lists key criteria reported in this section, covering knowledge-based geometric validation scores. For crystal structures, the fit to experimental data is summarized by an overall measure (Rfree factor) and by the fraction of residues that locally do not fit the electron density well (normalized real-space R value, RSRZ) (). These criteria were selected because they are not typically optimized directly during structure refinement (unlike, e.g., the conventional R value and bond lengths and bond angles) (). Ideally, a high-quality structure will score well across the board. Good values for only one of the metrics (e.g., a perfect fit to electron density) with poor scores for others (e.g., many Ramachandran outliers) could be a sign of a biased model building/refinement protocol (e.g., overfitting to experimental data). For each metric, two percentile ranks are calculated: an absolute rank with respect to the entire PDB archive and a relative rank. For crystallographic structures, the relative rank is calculated with respect to structures of similar resolution (at least 1,000 structures), while structures derived from NMR or 3DEM are compared against all other NMR or 3DEM structures, respectively. Absolute percentile scores are useful to general users of the PDB to evaluate whether a given PDB entry is suitable for their purposes, while the relative percentiles provide depositors, editors, reviewers, and expert users with a means to assess structure quality relative to other structures derived in a similar manner.Figure 1The percentile ranks are followed by a graphical summary of chain quality (). Each standard polypeptide and polynucleotide residue is checked against ideal bond and angle geometry, torsion-angle statistics, and contact distances. Residues are then color coded based on the results: green if no issues are detected, yellow if there are outliers for one criterion (e.g., unusual bond lengths), orange if there are outliers for two criteria (e.g., unusual bond lengths and too-close contacts), and red for three or more criteria with outliers reported. A horizontal stack bar plot presents the fraction of residues with each color code for each polypeptide or polynucleotide chain. The fraction of residues present in the experimental sample but not included in the refined atomic model is represented by a gray segment, and the fraction of residues “ill-defined” by the NMR ensemble (see below) is represented by a cyan segment. For X-ray crystal structures, an upper red bar indicates the fraction of residues with a poor fit to the electron density. This is followed by a table listing ligand molecules that show unusual geometry, chirality, and/or fit to the electron density.The section on overall quality is followed by one on entry composition, which describes each unique molecule present in the entry. For NMR entries, a separate section on ensemble composition is also included. As most NMR structures are deposited as ensembles of conformers, this section reports on what parts of the entry are deemed to be well-defined or ill-defined () and also identifies a medoid representative conformer from the ensemble, i.e., the conformer most similar to all the others ().The section on residue quality highlights residues that exhibit at least one kind of issue, i.e., color coded yellow, orange, or red, as described above (). While unusual features (e.g., a residue falling into a disallowed region of the Ramachandran plot) are not unexpected even in high-resolution structures, typically occurring with a frequency of 0.5% (), they nevertheless should be inspected, and the sequence plots are intended to help users more easily find residues with validation issues.The section that presents an overview of the experimental data is specific to each experimental technique. For X-ray crystal structures, the structure factors are analyzed using the Phenix tool Xtriage () to identify outliers, assess whether the crystalline sample was twinned, and analyze the level of anisotropy in the data. The R and Rfree values are presented as provided by the depositor and as recalculated by the wwPDB from structure-factor amplitudes and the model. The Rfree value measures how well the atomic model predicts the structure factors for a small subset of the reflections (typically 5%–10%) that were not included in the refinement protocol (). It is a useful validation metric showing whether there are sufficient experimental data and restraints compared with the number of adjustable parameters in the model: Rfree values much higher than R could indicate an overfitting to experimental data during refinement. R values provided by the depositor are displayed along with R values recalculated by the DCC tool () from the atomic model and structure factors with the same refinement program as was used to refine the atomic model. Good agreement between the depositor R values and those recalculated serves to check whether the data have been uploaded and interpreted correctly within the OneDep system.For NMR structures, the report contains an overview of the structure determination process and the overall completeness of the resonance assignments. For 3DEM structures, if a volume map is available, basic information describing the experimental setup and the map is included.The section on model validation provides further details for each criterion covering polypeptides, ribonucleic acids, small molecules, and non-standard polymer residues. The bond lengths and bond angles of amino acid and nucleotide residues are checked by MolProbity's Dangle module () against standard reference dictionaries (, ). Close contacts between non-bonded atoms are analyzed using MolProbity. As MolProbity does not deal with close contacts between symmetry-related molecules in the case of crystallographic experiments, these are checked by the in-house software “MAXIT” (Z.F., MolProbity also performs protein-backbone and side-chain torsion-angle analysis (Ramachandran plot and rotameric state) and RNA-backbone and ribose-pucker analysis. For X-ray crystal structures of proteins, cases where 180° flips of histidine rings and glutamine or asparagine side chains improve the hydrogen-bonding network without detriment to the electron density fit are also reported. The MAXIT software is also used to identify and report cis-peptides and stereochemistry issues, such as chirality errors and polymer linkage artifacts.The geometry of all non-standard or modified residues of a polymer, small-molecule ligands, and carbohydrate molecules is analyzed with the Mogul software (). For each bond length, bond angle, dihedral angle and ring pucker, Mogul searches through high-quality, small-molecule crystal structures in the Cambridge Structural Database (CSD) () to identify similar fragments. Each bond length, angle, and so forth in the compound is compared against the distribution of values found in comparable fragments in the CSD, and outliers are highlighted. Chirality problems are diagnosed by checking against the wwPDB Chemical Component Dictionary definitions ().The fit of the atomic model to experimental data (currently only available for X-ray crystal structures) is analyzed by the procedure developed for the Uppsala Electron Density Server (). Electron density maps are calculated with the REFMAC program () using the atomic model and the structure factors. The fit is assessed between an electron density map calculated directly from the model (DFcalc map) and one calculated based on model and experimental data (2mFobs-DFcalc map). The fit is analyzed on a per-residue basis for proteins and polynucleotides, and reported as the real-space R value (RSR) (). These RSR values are normalized by residue type and resolution band to yield RSRZ (). Residues with RSRZ >2 are reported as outliers. At present, this analysis is not possible for non-standard amino acids/nucleotides or ligands, as these compounds are not present in sufficient numbers in the PDB to generate reliable Z scores. For these, therefore, only the RSR value, real-space correlation coefficient, and the so-called Local Ligand Density Fit score (LLDF) are reported. LLDF for a ligand or non-standard residue is calculated as follows: all standard amino acid or nucleotide residues within 5.0 Å distance of any atom of the ligand or non-standard residue are identified by the CCP4 NCONT program, taking crystallographic symmetry into account (). The mean and SD of the RSR values for these neighboring residues are then calculated, and these are used with the RSR value of the ligand or the non-standard residue itself to provide a local, internal Z score. If fewer than two neighboring residues are within 5.0 Å of the entity, then LLDF cannot be calculated (this occurs for ∼20% of ligands in PDB entries released before 31 December 2016). LLDF values greater than 2 are highlighted in the reports (this occurs for 34% of ligands in PDB entries released before 31 December 2016 for which an LLDF value could be calculated) (O.S.S. et al., unpublished data). The wwPDB partners and the crystallography community are evaluating this and other metrics to reliably assess the fit to electron density for bound ligands, following the recommendations of the wwPDB/CCDC/D3R Ligand Validation Workshop ().For NMR structures, the report contains a section on validation of assigned chemical shifts. Each structure can potentially be linked to more than one list of chemical shifts (e.g., from samples with different experimental conditions or isotope labeling pattern). Therefore, each chemical-shift list is treated independently. For each list, a table summarizing any parsing and mapping issues between the chemical shifts and the model coordinates helps depositors detect and correct data entry errors. For entries containing proteins, the PANAV package () is invoked to suggest corrections to chemical-shift referencing. Completeness of resonance assignments per chemical-shift list is calculated for each type of nucleus and location (e.g., backbone, aliphatic or aromatic side chain). Unusual chemical-shift assignments are identified according to the statistics compiled by BMRB (). Severe chemical-shift outliers (e.g., >30 SDs from the average value) are frequently the result of spectral “aliasing,” and these need to be corrected to achieve valid data deposition. Finally, for entries containing polypeptides, the amino acid sequence and chemical shift information is used by the RCI software () to calculate a random coil index (RCI) for each residue, which estimates how likely the residue is to be disordered. In a bar-graph representation of RCI for each polypeptide chain, each residue considered to be ill-defined from the analysis of the NMR ensemble of conformers (see above; ) is colored cyan; this result from analysis of coordinates alone can then be compared with experimental evidence for potential disorder from the RCI. […]

Pipeline specifications

Software tools Coot, PHENIX, MolProbity, Mogul, CCP4, RCI
Databases CSD BMRB CCD wwPDB EMDataBank
Application Protein structure analysis
Diseases Osteitis Deformans