Index construction software tools | High-throughput sequencing data analysis
FM-index plays an important role in DNA sequence alignment, de novo assembly (Simpson and Durbin, 2012) and compression (Cox et al., 2012). Fast and lightweight construction of FM-index for a large dataset is the key to these applications. Source text: Li, 2014.
Allows users to interact with high-throughput sequencing data. SAMtools permits the manipulation of alignments in the SAM/BAM/CRAM formats: reading, writing, editing, indexing, viewing and converting SAM/BAM/CRAM format. It limits the mapping quality of reads with excessive mismatches and applies base alignment quality to fix alignment errors. This tool can sort and merge alignments, remove polymerase chain reaction (PCR) duplicates or generate per-position information.
Aligns short read geared toward mammalian re-sequencing. Bowtie is based on a Burrows-Wheeler index based on the full-text minute-space (FM) index. It follows two steps: an initial, ungapped seed-finding stage that derives advantage from the speed and memory efficiency of the full-text minute index and a gapped extension stage that employs dynamic programming and benefits from the efficiency of single-instruction multiple-data (SIMD) parallel processing available on modern processors.
Permits users to perform gapped alignment. Bowtie2 is a program that enables gapped alignment by dividing the algorithm broadly into two stages: (1) an ungapped seed-finding stage that benefits of the full-text minute index; and (2) a gapped extension stage that uses single-instruction multiple-data (SIMD) parallel processing. Furthermore, this tool includes features for indexing genome with an FM index to keep its memory footprint small.
Gives access to many free software tools for sequence analysis. EMBOSS aims to serve the molecular biology community. It permits the creation and the release of software in an open source spirit. This tool is useful for sequence analysis into a seamless whole. It is free of charge and is available in open source.
A software suite for the comparison, manipulation and annotation of genomic features in browser extensible data (BED) and general feature format (GFF) format. BEDTools also supports the comparison of sequence alignments in BAM format to both BED and GFF features. The tools are extremely efficient and allow the user to compare large datasets (e.g. next-generation sequencing data) with both public and custom genome annotation tracks. BEDTools can be combined with one another as well as with standard UNIX commands, thus facilitating routine genomics tasks as well as pipelines that can quickly answer intricate questions of large genomic datasets.
Allows users to conduct large-scale comparisons of their results with thousands of reference datasets and genome annotations in seconds. GIGGLE permits to identify novel and unexpected relationships among local datasets as well as the vast amount of publicly available genomics data. It uses a temporal indexing scheme to create a single index of the genome intervals from thousands of annotations and genomic data files.
A high performance robust tool and library for working with SAM, BAM and CRAM sequence alignment files; the most common file formats for aligned next generation sequencing (NGS) data. Sambamba is a faster alternative to samtools that exploits multi-core processing and dramatically reduces processing time. Sambamba is being adopted at sequencing centers, not only because of its speed, but also because of additional functionality, including coverage analysis and powerful filtering capability.
Deals with RNA structure probing and post-transcriptional modifications mapping high-throughput data. RNA Framework is a modular toolkit. Its main features are (i) automatic reference transcriptome creation, (ii) automatic reads preprocessing (adapter clipping and trimming) and mapping, (iii) scoring and data normalization and (iv) accurate RNA folding prediction by incorporating structural probing data. It can perform not only RNA Structure analysis, but also analysis of RNA post-transcriptional modifications mapping experiments (such as m1A-seq, m6A-seq, 2OMe-seq, and Pseudo-seq).
Examines epigenomic and transcriptomic next generation sequencing (NGS) data. Octopus-toolkit can be used for antibody- or enzyme-mediated experiments and studies for the quantification of gene expression. It can accelerate the data mining of public epigenomic and transcriptomic NGS data for basic biomedical research. This tool provides a private and a public mode: one to process the user’s own data, and the other to analyze public NGS data by retrieving raw files from the GEO database.
Indexes reference sequences. SOAP2 employs Burrows Wheeler Transformation (BWT) compressed index to work. It can align single-end reads, identify the best alignment hits and align paired-end read. This tool can map short reads onto a reference sequence for large-scale resequencing projects. It can confront the assembled sequence to the reference genome to find single nucleotide polymorphisms (SNPs). This version, by using BWT compressed index instead of a seed algorithm, has a better alignment speed and less use of memory.
Assists users in manipulating high-throughput sequencing (HTS) data and formats. Picard is a Java toolkit that provides a set of command line scripts. It comprises Java-based utilities that manipulate SAM files, and a Java API for creating new programs that reads and writes SAM files. Both SAM text format and SAM binary (BAM) format are supported. It also works with next generation sequencing (NGS).
Indexes position sorted files in TAB-delimited formats such as GFF, BED, PSL, SAM and SQL export, and quickly retrieves features overlapping specified regions. Tabix features include few seek function calls per query, data compression with gzip compatibility and direct FTP/HTTP access.
A software suite for programmers and end users that facilitates research analysis and data management using BAM files. BamTools provides both the first C++ API publicly available for BAM file support as well as a command-line toolkit. The BamTools C++ API/library has been successfully integrated into a variety of applications. It provides the BAM file support for several utilities in the BEDtools suite.
Maps short reads using a redesigned data structure. SOAP3 can determine if a pattern would introduce too many branches during the runtime. It enables researchers to conduct alignments with up to four mismatches. This third version of SOAP uses the multi-processors of the GPU to improve its speed. It is not heuristic-based and reports all answers for an inputted file.
A tool for constructing the FM-index for a collection of DNA sequences. ropeBWT works by incrementally inserting one or multiple sequences into an existing pseudo-BWT position by position, starting from the end of the sequences. This algorithm can be largely considered a mixture of BCR and dynamic FM-index. Nonetheless, ropeBWT2 is unique in that it may implicitly sort the input into reverse lexicographical order (RLO) or reverse-complement lexicographical order (RCLO) while building the index.
Accelerates the locating operation of FM-indexes for genomic data. FMtree is a locating algorithm that permits to build a conceptual multiway tree. By utilizing this multiway tree, FMtree is able to calculate the non-sampled positions block-by-block. It can also be applied to any implementation of FM-indexes without modification. This algorithm is cache-friendly and avoids many unnecessary operations.
A program that can chop a BAM index (BAI) file into small pieces. The program outputs a list of BAI files each indexing a specified genomic interval. The output files are much smaller in size but maintain compatibility with existing software tools. We show how preprocessing BAI files with chopBAI can lead to a reduction of I/O by more than 95% during the analysis of 10Kbp genomic regions, eventually enabling the joint analysis of more than 10,000 individuals. As sequencing is becoming more and more common, chopBAI will be equally useful for analyzing large sequencing cohorts of other species where the BAI indexing scheme allows for fast access to small subsets of reads.
Implements an indexing data structure for compacted de Bruijn graph (dBG) and colored compacted dBG. pufferfish exploits a minimum perfect hash function (MPHF) and provides to users a k-mer lookup. The data structure of this tool is available through two variants: a dense variant for fast queries and a sparse variant that offers the ability to trade off space for speed in a fine-grained way.
Allows users to reformate and filter bioinformatics files. JVARKIT aims to simplify the grammar employed to filter bioinformatic file, for rendering possible to write a loop or a custom function. JVARKIT is a set of more than 100 java-based tools for bioinformatics.
A highly hardware-acceleration friendly k-ordered FM-index for exact string matching, overlap graph construction for de novo assembly, and more. sBWT is a Burrows–Wheeler transform (BWT) based fast indexer/aligner specialized in parallelized indexing and searching for next-generation sequencing data. In our tests, the implementation achieves significant speedups in indexing and searching compared to other BWT based tools and can be applied to a variety of domains.
Allows users to map long readings with high error rates. lordFAST is designed to align readings from PacBio sequencing technology. It also allows the user to modify alignment parameters according to readings and application. This application includes both cut and split read alignments, allowing readings from regions to be aligned with long structural variations (SVs).
Permits multiple mismatches and gaps in production environments. SOAP3-dp recognizes candidate regions by exact or mismatch alignment of short substrings in the reads. It computes a detailed alignment of the read to the regions using dynamic programming. This version of SOAP3 is suitable for real data alignments. It can align reads and multi-nucleotide polymorphisms (MNP) among the whole genome were identified.
Stores the k-words corresponding to the edges of a de Bruijn subgraph in a compact manner. kFM-index is a data structure that enables random access to vertices and edges. It avoids the direct storage of k-words and pointers, which make it compact. The software purposes to assist users in representing the k-mer composition of the sequences.
Provides utility functions implementing commonly used genomic operations. bedr is a formal BED-operations framework that offers a formal R interface to interact with BEDTools and BEDOPS. In addition to sort operations, it also supports identification of overlapping regions which can be collapsed to avoid downstream analytical challenges. This method is compatible with the ubiquitous BED tools paradigm and integrates with R-based workflows.
An open-source software using Clojure, which is a functional programming language that works on the Java Virtual Machine. Cljam can process and analyze SAM/BAM files in parallel and at high speed. The execution time with cljam is almost the same as with SAMtools. The Clojure code of cljam has fewer lines and an equivalent performance compared with SAMtools and Picard, which are similar tools.
Permits to grab arbitrary lines from a BGZIP compressed file. Grabix provide random access into text files that have been compressed with bgzip. This tool creates its own index of the file, then users can extract arbitrary lines from the file with the grab command or can choose random lines with the random command.
Permits to get access to high-throughput sequencing data (HTS) formats. Htsjdk does not support latest Variant Call Format Specification, for example VCFv4.3 and BCFv2.2. It can be useful to manipulate data in HTS fields.
Indexes a fasta file database. dbifasta needs a flat file database of one or more files, and builds EMBL CD-ROM format index files. The resulting index-file format is used by the software on the EMBL database CD-ROM distribution and by the Staden package in addition to EMBOSS, and appears to be the most generally used and publicly available index file format for these databases.
Generates FASTA index for FASTA files. Fastahack is an application for indexing and extracting sequences and subsequences from FASTA files. The included library provides a FASTA reader and indexer that can be embedded into applications which would benefit from directly reading subsequences from FASTA files. This resource also uses the C function fseek64 to extract sequence and subsequence. It permits fastest-possible extraction and makes fastahack a useful method for bioinformatician who need to quickly extract many subsequences from a reference FASTA sequence.
Uses to designe multi-thread sort/merge tools for BAM files. NovoSort reduces run times from multi-threading and by combining sort & merge in one step. It uses a stable sort/merge algorithm that will not change the order of alignments with the same sort key and can optionally create BAM index file. This is a two phase sort merge, the first phase sorts as many reads as possible in memory and then writes segments of sorted records to temporary disk files. The second phase merges the sorted fragments to produce the final sorted file.
Allows users to obtain PacBio BAM files and their associated indices. pbbam is mainly composed of a core C++ library which permits to create, query and edit files corresponding to the PacBio Bam files specification. Besides, the software can be configured to accept additional languages and command-line utilities. It can also integrate CMake-based projects.
Allows manipulation of SAM, FASTA, binary variant call (BCF) and compressed indexed tab-delimited (tabix) files by providing an interface to the “samtools”, “bcftools”, and “tabix” utilities. Rsamtools is an R package that also offers facility for file access such as record counting, index file creation, and filtering to create new files containing subsets of the original. It can be used as a starting point for creating R objects suitable for a diversity of workflows.
0 - 0 of 0
1 - 39 of 39