Computational protocol: Improving protein structure similarity searches using domain boundaries based on conserved sequence information

Similar protocols

Protocol publication

[…] The April 2005 Conserved Domain Database (CDD) and medium redundancy Molecular Modeling Database (MMDB) were used for the sequence to structure domain comparisons, the current versions of which are available at and . Of the Conserved Domain Database, entries derived from the Clusters of Orthologous Groups[,] database were excluded, as they tend to be full gene products containing multiple sequence and structure domains. Each domain profile from the CDD was compared to the sequence of protein structures in our subset of MMDB. These comparisons involved using the sequence of MMDB entries as queries in RPS-BLAST against our subset of the CDD, using a 'hit' expectation value threshold of 0.01 and a requirement that at least 90% of a Conserved Domain (CD) sequence be aligned to a query to be considered for comparison. We also tested the effect of reducing the percentage of CD sequence alignment required for boundary difference comparisons. Since these tests resulted in few additional domain identifications at the cost of reduced sequence alignment length, we focused on our most stringent 90% alignment coverage requirement. Because the CDD is collected from several database sources, some domains in the database are very similar, thus sequence domains are curated into domain families. When collecting the set of sequence domains, if multiple sequence domains from the same family aligned to a protein structure, a family representative was chosen based on the following criteria: 1) The domain family member with the greatest percentage alignment was chosen, and 2) if more than one domain family member had the same percentage alignment, the member with shorter overall length was chosen.The SSE compositions of the sequence domains were then compared to the composition of the entire chain on which the sequence domain was identified, as well as the domains of the chain identified based on compactness. A SSE composition metric was used, rather than an amino acid sequence difference requirement, because the VAST algorithm applied later in the study uses SSE alignment to detect protein similarity. By requiring different SSE composition, we avoided the potential identification of domain differences resulting from the inclusion or removal of unstructured protein regions which would have no effect on structural similarity searches employed later in the study. The footprint of a sequence domain was considered different from the structure domains based on the following criteria: 1) The sequence domain must contain at least 4 secondary structure elements (SSEs), 2) the sequence domain must be at least 4 SSEs shorter than the whole chain, and 3) the sequence domain must be at least 4 SSEs different from the closest structure based domain. SSEs and structure domains for a given structure were those identified by the MMDB structure domain parser. Item 1) simply means that we are not considering very small domains with 3 or fewer SSEs, which corresponds fairly well to having 50 or fewer residues. Items 2) and 3) define when we consider a sequence domain to be "different" from a structure-based domain. Additional SSE difference requirements were tested, and as expected, reducing the number of SSEs required resulted in many domain differences identified, while more SSEs required for being classified as different quickly reduced the number of differences found. This testing led us to select the 4 SSE difference requirement as a "middle ground", allowing us to identify a large number of domain differences without selecting all sequence based domains in the structure database. The choice of 4 SSEs is natural, because this is about the size of a small domain, and it should allow us to clearly see the effect of using different domain boundaries in the structure similarity searches. The closest structure domain to a sequence domains was determined as follows: a) the structure domain completely covered by the sequence domain, b) the longest structure domain completely covered by the sequence domain, or c) the structure domain with the longest length covered by the sequence domain. To test the possibility that domain differences were due to unaligned ends of the sequence domains not being included in the regions to be used as queries, we repeated the method using the July 2005 databases in which the sequence domain boundaries were extended to the ends of the complete sequence domains and tested using the same SSE difference requirements described above. Differences in the size and composition of the footprint and extended footprint sets were then compared, revealing minimal differences in the domains identified as different. A flowchart of the order of operations can be seen in Figure . [...] The domain entries from MMDB and sequence domains identified as different were used as queries for structure similarity searches against the medium redundancy set of MMDB using Vector Alignment Search Tool (VAST), available on the web at . VAST is essentially a two-phase process, the first being the alignment of vectors of secondary structure and preliminary scoring. Those initial alignments whose scores exceed an empirically derived threshold are then refined in the second phase of structural alignment using the Ca coordinates. Only those refined alignments with a statistical significance of P < 10-5 are reported as structurally similar. Although available to the public on the web, our study used an in-house version of the VAST executable to allow the submission of multiple queries and more efficient use of computational resources. To evaluate the change in structure similarity search results when using the new domains based on sequence, we considered structurally similar domains classified within the same superfamily division as the query domain of SCOP 1.69, available at , to be homologs. Since the study explicitly looked for differences in domain boundaries, it was not possible to directly map both structure and sequence domains to corresponding entries in the SCOP database. For example, if a structure domain from MMDB has very similar domain boundaries as a SCOP domain, then a sequence domain found to be different from the MMDB domain would also be different from a SCOP domain definition. Thus, in order to measure the ability to identify similar domains, a homolog set for a query domain was identified as the SCOP superfamily members for all SCOP domains identified on the query chain. Although this 'collapsing' of superfamilies on a chain could introduce the possibility of some false homolog mapping or unrealistically large homolog sets, it allowed for sensitivity and specificity analysis of individual domains in the test set as well as overall assessment of the domain based structure similarity search result sets. In addition, to avoid missing data issues due to the smaller size of the SCOP database, all domains used as VAST queries and resulting similar structures were reduced to only those structures included in the 1.69 release of SCOP. Individual search results were also evaluated using SCOP fold classification members to test the possibly that previously identified non-homologs were potentially distant homologous structures that were not included in the superfamily classification. The structure similarity search results for each domain query and domain type sets were then compared based on the homologous and non-homologous structures found, as well as search result overlap, e.g. hits common to both sequence and structure domain similarity search results, regardless of the significance scores of the alignment other than the statistical significance of P < 10-5 required for being reported as similar by the VAST algorithm. Individual search results of the new domains were then compared to results of the original structure domains and visualized using PyMOL [] and Cn3D []. […]

Pipeline specifications

Software tools BLASTN, VAST, PyMOL, Cn3D
Databases MMDB
Applications Drug design, Protein structure analysis