k-mer counting software tools | High-throughput sequencing data analysis
Counting k-mers (substrings of length k in DNA sequence data) is an essential component of many methods in bioinformatics, including for genome and transcriptome assembly, for metagenomic sequencing, and for error correction of sequence reads. Although simple in principle, counting k-mers in large modern sequence data sets can easily overwhelm the memory capacity of standard computers.
A k-mer counting algorithm. It is based on a multithreaded, lock-free hash table optimized for counting k-mers up to 31 bases in length. Due to their flexibility, suffix arrays have been the data structure of choice for solving many string problems. For the task of k-mer counting, important in many biological applications, Jellyfish offers a much faster and more memory-efficient solution.
Aims to find the best value of k for assembly. KmerGenie is a sampling approach that enables users to construct histograms with multiple orders of magnitude of performance. It works in several steps: (i) it computes the k-sea abundance histogram for each k value, (ii) estimates the number of distinct genomic k-seas in the dataset, and (iii) outputs the k-sea length that maximizes that number.
Provides molecular ecologists with a high throughput choice for comparing large sequence sets to find similarity. Simrank was developed to enable advances in curation and annotation practices of large biomarker data-sets. It is specifically designed for matching queries against large reference sets. This module is maintained by molecular biologists who use it for searching for similarities among strings representing contiguous DNA or RNA sequences.
Enables creation and modification of k-mer lists. GenomeTester is a toolkit consisting of three programs: (1) GListMaker which generates k-mer count lists from nucleotide sequences, (2) GListCompare which performs basic algebraic set operations with these lists and (3) GListQuery which searches for user-provided sequences from lists generated either by GListMaker or GListCompare. The software can generate lists of k-mer counts from nucleotide sequences and perform basic algebraic set operations (union, intersection and complement) on these lists.
A flexible and memory-efficient collection of programs for k-mer counting and indexing of large sequence sets. Unlike previous methods, Tallymer is based on enhanced suffix arrays. This gives a much larger flexibility concerning the choice of the k-mer size. Tallymer can process large data sizes of several billion bases. We used it in a variety of applications to study the genomes of maize and other plant species.
A streaming algorithm for k-mer counting which only requires a fixed user-defined amount of memory and disk space. This approach realizes a memory, time and disk trade-off. The multi-set of all k-mers present in the reads is partitioned, and partitions are saved to disk. Then, each partition is separately loaded in memory in a temporary hash table. The k-mer counts are returned by traversing each hash table. Low-abundance k-mers are optionally filtered.