Computational protocol: A Differentiation-Based Phylogeny of Cancer Subtypes

Similar protocols

Protocol publication

[…] We compare the different methodologies implemented in our algorithm for each step of the analysis in order to identify those methods and parameters that perform well in the analysis of our datasets. We apply our algorithm to all datasets using all combinations of the following methods and parameters: for finding differentially expressed genes: ANOVA, Kruskal-Wallis (KW) and Welch approximation (Welch); two methodologies for p-value correction: Benjamini-Hochberg (BH) and Holm; two p-value cutoffs: 0.01 and 0.05; five tree reconstruction and clustering algorithms: Weighted Least Squares (WLS), Minimum Evolution (ME), Neighbor-Joining (NJ), FastME, and Average Linkage (UPGMA); and two distance measures: Pearson correlation and Euclidean distance. The results of these analyses are shown in , , , . The topologies found among the different combinations of parameters show that WLS, Pearson correlation, and BH with a cutoff value of 0.01 perform accurately in accordance with the AML (), breast cancer (), and liposarcoma datasets ().Note that two main assumptions of the UPGMA algorithm are not fulfilled by cancer subtype data, namely: all species originate from a common ancestor and they all have evolved at the same pace. This issue explains why this method fails to reconstruct the right tree topologies; for example, in all sarcoma UPGMA topologies (trees 1 and 4 of ), some liposarcoma subtypes branch together with leiomyosarcoma, which is thought to arise from smooth muscle tissue.It has been shown in previous studies that, in general, WLS performs better than NJ when trees have long external or internal branches (e.g. ). Note also that the use of Euclidean distance leads to less robust results than the use of Pearson correlation when trees with long branches are considered. For example, when the Euclidean distance method is applied to the liposarcoma data, the dedifferentiated and pleomorphic subtypes cluster together with the well-differentiated subtype and normal fat (Topology 3 of ). The effect of long branches on the Euclidean distance method becomes even more pronounced when analyzing the sarcoma data (); in this case, the least common topologies are observed only when the Euclidean distance method is used. If distant subgroups (i.e. hMSC and hMSC MPC) are removed from the analysis, then most parameter combinations including the Euclidean distance method favor topology 5. This topology was previously only observed with the Pearson correlation distance (see Table in , left).We do not observe a significant influence of the choice of the method on the identification of differentially expressed genes. More important for our data is the choice of the p-value cutoff. For the sarcoma data, conservative p-value cutoffs favor topology 3 while parameter combinations with Benjamini-Hochberg adjusted p-values seem to favor topology 5 (). The results of our study suggest that BH with a cutoff of 0.01 is a good compromise, but we recommend investigating the effects of using different cutoff values.In general, all tree reconstruction methods are very fast, especially since the number of different tumor subtypes in our analysis is typically limited. So it is possible to test many parameters in a reasonable time and we recommend doing so. […]

Pipeline specifications

Software tools FastME, MUSCLE
Applications Phylogenetics, Nucleotide sequence alignment
Organisms Homo sapiens
Diseases Breast Neoplasms, Leukemia, Liposarcoma, Neoplasms, Sarcoma