[…] To compare the clustering accuracy of CLUSTOM, NW-DOTUR-AL, ESPRIT-Tree, NAST-__STRONG_START__mothur-AL and NW-mothur-AL, we prepared test datasets from three different sources. We first retrieved bacterial 16S sequences from the database of the SILVA database, release 108, which are reliably curated by considering alignment quality and phylogenetic relationships. The sequences that had duplicated accession numbers or were shorter than 1,200 nt in length were removed. Referring to the LPSN database (, we extracted the sequences to which valid scientific names are assigned. Consequently, only 27,213 16S sequences (1451 bp on average) that are referred to as 16S–SILVA in this study were prepared. The second dataset that is referred to as 454–HMP was curated from 16S sequences of microbial communities that were isolated from various human body sites using the Roche-454 FLX Titanium platform (NCBI accession: SRP002395). This data was retrieved from the data archive of the Human Microbiome Project ( The dataset contains over 7×107 reads that were already trimmed and processed. The third dataset that are referred to as 454–SPONGE was prepared from public 16S pyrosequencing sequences of the V1–3 region . The 454–SPONGE dataset consists of complex, simple, and intermediate bacterial communities that are associated with marine sponges Raspailia ramose (24,433 reads) and Stelligera stuposa (26,918 reads), and seawater (18,271 reads) collected from the sponge-sampling site, respectively. Since the 454–SPONGE dataset was not fully processed, we removed sequencing errors using AmpliconNoise and trimmed tags (barcodes, linkers and primers) using an in-house developed script. As a result, three processed datasets of R. ramose (12,898 reads, 456 bp on average), S. stuposa (10,898 reads, 471 bp on average), and seawater (9,944 reads, 397 bp on average) were prepared. […]

