A method that balances time, space and accuracy requirements to efficiently extract frequent k-mers even for high-coverage libraries and large genomes such as human. Turtle is designed to minimize cache misses in a cache-efficient manner by using a pattern-blocked Bloom filter to remove infrequent k-mers from consideration in combination with a novel sort-and-compact scheme, instead of a hash, for the actual counting. Although this increases theoretical complexity, the savings in cache misses reduce the empirical running times. A variant of method can resort to a counting Bloom filter for even larger savings in memory at the expense of false-negative rates in addition to the false-positive rates common to all Bloom filter-based approaches.
Department of Computer Science, Rutgers University, NB, NJ, USA; Department of Ecology, Evolution and Natural Resources, Rutgers University, NB, NJ, USA; Institute of Marine and Coastal Sciences, Rutgers University, NB, NJ, USA; BioMaPS Institute for Quantitative Biology, Rutgers University, NB, NJ, USA
Turtle funding source(s)
A grant from the National Science Foundation (DEB-1004213)