Computational protocol: HMMER Cut-off Threshold Tool (HMMERCTTER): Supervised classification of superfamily protein sequences with a reliable cut-off threshold

Similar protocols

Protocol publication

[…] outlines the HMMERCTTER training procedure and target sequence analysis, which are described in brief below and in detail in . The training sequences are clustered using a user provided phylogeny. All possible monophyletic clusters are determined, sorted by size and tested as follows. The cluster's sequences are aligned and used to generate a HMMER profile that is subsequently used to screen the cluster's sequences as well as all training sequences. Obtained HMMER scores are compared and 100% P&R self detection is obtained when the lowest scoring cluster sequence has a higher score than the highest scoring non-cluster sequence. 100% self detection P&R clusters are provisionally accepted whereas non 100% P&R clusters are automatically rejected.An interface showing score plots of cluster and training sequences as well as a tree with the provisional clustering is presented as shown in , at which point the user can reject or accept the cluster. Upon rejection, the program proceeds with the next cluster on the size-ordered list. Upon acceptance of a cluster, all its nested and overlapping clusters are removed from the list, and the program proceeds with the next cluster in the sorted list until no more clusters are encountered. This yields a number of clusters that show 100% P&R self detection in HMMER profiling as well as, possibly, a number of unclustered orphan sequences.The HMMER profiles and corresponding cut-off scores form the initial classifiers that are used for screening the target dataset. In order to clarify whether we refer to the clustering or the classification phase of the pipeline, a cluster results from the clustering phase whereas a group is the result of the classification phase. Sequences with scores equal or above the cluster threshold are automatically accepted and added to the cluster, forming a group. We refer to these sequences as prior positives since they were not yet included in the cluster or group when tested. Sequences are realigned to construct a new HMMER profile with a new cut-off score in order to obtain higher sensitivity in subsequent HMMER profiling. As such, groups remain 100% P&R provided classification overlap is prevented. When a target sequence becomes classified by more than one group, all involved groups are excluded from subsequent iterations. Conflicting training sequences are removed from all but the original group whereas conflicting target sequences are removed from all groups and target dataset.This automated step of classification terminates upon data convergence, when no novel sequences with a score above the threshold are identified. Hitherto, all accepted sequences were accepted based on a prior inclusion HMMER cut-off threshold, i.e. by a HMMER profile that did not include the to be accepted sequence(s). However, certain sequences might only be accepted once their information has been included into the profile, i.e. according to a posterior inclusion HMMER cut-off threshold. Hence, in the subsequent classification step, sequences with a score below the threshold are considered. Candidates are included in the group and tested with a novel HMMER profile that includes the candidate. An interactive interface () allows the user to guide this process while 100% P&R self detection remains imposed and classification conflicts remain prohibited as described for the automated phase. The process is terminated by the user, resulting in updated groups and a file that indicates which sequences generated conflicts. [...] HMMERCTTER performance was compared with PANTHER[], a major phylogenomics platform. Panther classification was performed with the complete datasets. i.e. both training and target sequences, since PANTHER uses its own HMMER profile database as training set. PANTHER reports to which subfamily or family HMMER profile a sequence scores best.In the PG case, all except five sequences correspond with four major families (PTHR31736, PTHR31339, PTHR31375 and PTHR31884) with a large number of subfamilies. We report the analyses at the family as well as subfamily level. For the family analysis we collected all subfamily hits that correspond to the same family. Results are summarized in .Interestingly, PTHR31736 contains all exoPGs as well as the RhamnoPGs and XyloPGs, even although these do not form a monophyletic cluster. PTHR31339 corresponds to bacterial PGs, PTHR31375 to plant PGs and PYHR31884 to fungal endoPGs. Only four false positives, and as such also false negatives, were identified. In addition five sequences were selected in rather different PANTHER families. Hence, 99% classification recall was obtained when subfamilies were grouped to their respective families, a result slightly better than what was obtained by HMMERCTTER in the C7 classification (98% see ). The latter shows a better correspondence with functional classification. The PANTHER classification into subfamilies shows a lower performance (compare with ). Both the bacterial and the plant families show a high quality classification with 94 and 89% classification recall, respectively, and 100% precision. However, sequences were assigned to 15 and 35 subfamilies for bacterial and plant families, respectively. This seems dis-proportionally high given the amount of known functional subfamilies. Classification of the two fungal families was however poor. The endoPG classification () only shows a single subfamily, 31884:SF9 with 100% P&R self detection, all other subfamilies show many errors. ExoPG 31736:SF7 () shows perfect correspondence with PGXC, 31736:SF9 corresponds to the endo-xylogalacturonases and 31736:SF5 corresponds with part of exo-rhamnogalaturonases. 31736:SF8 contains most but not all sequences from PGXA and PGXB, whereas subfamily 10 and the not further classified sequences contain a mixture of sequences. Hence, although a number of functional subfamilies are well classified, others show poor classification.In the ACD case PANTHER identified six families (PTHR11527; PTHR15348; PTHR33879; PTHR33981; PTHR34661; PTHR43670) of which three show 100% P&R self detection. Panther however, failed to correctly identify the sHSP as a family (See ). The family containing all sHSP sequences also contained UAPXII and a number of other non-HSP sequences (See analysis output in Github repository). At the subfamily level, many sHSP classes were identified correctly but the C1 HSP consisted of four non polyphyletic subfamilies and a single false negative. Given the complexity of the classification pattern when plotted on the phylogeny, it is not possible to determine classification recall objectively. However, comparison of the classification patterns ( and ) clearly shows HMMERTCTTER yields a much better classification.In the PLC case, many subfamilies of a single family (PTHR10336) were identified, which hampers analysis. We first analyzed all subfamilies that contain at least one training sequence plus the two subfamilies that contain more sequences than the smallest training sequence containing subfamily. Then we combined the PTHR subfamilies according to the initial functional classification of B, D, E, G, H, L, Z, Plant and Yeast PLC. Performance is shown in . The PLC-B, PLC-D, PLC-G and the PLC-Plant subfamilies appear represented by more than one PANTHER subfamily and clearly, combining these does improve classification recall. However, classification recall is still significantly below the classification recall obtained by HMMERCTTER (). In addition a total number of 47 false positives were identified. […]

Pipeline specifications

Software tools HMMER, PANTHER
Applications Phylogenetics, Protein sequence analysis