Computational protocol: Long-term trends in evolution of indels in protein sequences

Similar protocols

Protocol publication

[…] The analysis was performed on a set of protein families with curated alignments from the NCBI Conserved Domain Database (CDD). CDD comprises diverse non-redundant sequences and alignments are refined using three-dimensional structures and structure-structure alignments []. CD alignments are block-wise multiple alignments where block regions are defined as those aligned among all family members. CD alignments are constructed to ensure enough sequence diversity and taxonomy span while avoiding bias towards highly represented sequences in the database, which is important for our analysis. The redundancy is removed by using single-linkage clustering to group the domain sequences with greater than 67% sequence identity and then choosing one representative from each preferred taxonomy node within each sequence cluster (the list of preferred taxonomy nodes can be downloaded from the CDTree []. We start our analysis with a set of 362 manually curated parent node alignments from CDD version 2.00 [,]. Parent alignments correspond to the top node alignments in the hierarchy of CD families. We excluded CD families consisting of short sequence repeats (ex. SUSHI repeats) and those containing less than 10 sequences. The redundancy between protein domain families was checked using the procedure implemented in the CDART algorithm []; and not more than one domain family from the same domain cluster was retained in the final test set, which yielded 278 domain families. A table is available listing the 278 test domains with taxonomy assignments and computed regression coefficients [].The domain families from the test set encompass a large spectrum of functional and taxonomic groups. Protein function was categorized by the Gene Ontology (GO) terms []. Gene ontology (GO) annotations were obtained from GenBank for individual family members and pooled for the whole family. The taxonomic information for each CD family was assigned according to the range of organisms in which the family members were represented []. We used a simplified classification of the families into the following three categories: "R" ("Root", family members are present in at least two kingdoms among eukaryotes, prokaryotes and archaea and thus thought to be of ancient origin, dating back at least to the Last Universal Common Ancestor; 182 families); "E" (eukaryote-specific protein families; 85 families) and "B" (bacteria-specific protein families; 11 families). There were no archaea-specific families in our dataset.Phylogenetic trees were constructed from the aligned block regions (in case of sequence repeats only one instance was kept) by the neighbor-joining method [] with the PHYLIP package []. Blocks represent regions where all CDD sequences are aligned so that the resulting trees are not in any case dependent on the difference between spacer's lengths. The neighbor joining trees were rooted manually using the taxonomy of represented organisms. If multiple subfamilies within a protein domain family were present, the root was placed on the deepest inter-subfamily branch so as to balance the average length between the root and every external node of each subtree. For about 30% of the trees an alternative root placement was checked and it was observed that the overall results do not change if alternatively rooted trees were used. The phylogenetic trees are available at the ftp site []. […]

Pipeline specifications

Software tools CDTree, CDART, PHYLIP
Databases CDD
Applications Phylogenetics, Protein structure analysis