Genome assembly software tools | De novo sequencing data analysis
Genome assembly is an obligatory step in de novo DNA sequencing, where long and short reads are put together to reconstitute a complete genome sequence. In de novo genome assembly, no reference genome is used, which renders the task more complex and time-consuming than mapping. De novo genome assembly software tools detect overlaps between reads, assemble overlaps into contigs, and then combine contigs into scaffolds in order to obtain a draft genome sequence.
A single-cell assembler for capturing and sequencing “microbial dark matter” that forms small pools of randomly selected single cells (called a mini-metagenome) and further sequences all genomes from the mini-metagenome at once. SPAdes is intended for both standard isolates and single-cell MDA bacteria assemblies. It works with Illumina or IonTorrent reads and is capable of providing hybrid assemblies using PacBio, Oxford Nanopore and Sanger reads. You can also provide Additional contigs can also be provided to be used as long reads. SPAdes supports paired-end reads, mate-pairs and unpaired reads and can take as input several paired-end and mate-pair libraries simultaneously.
Allows correction of inherent in long, single-molecule sequences using short, high-identity sequences. PBcR is a program that trims and corrects individual long-read sequences by first mapping short-read sequences to them, and computes an accurate hybrid consensus sequence. The corrected PBcR reads can then be exported for other application or can be de novo assembled alone in combination with other data.
Builds transcriptomes from RNA-seq data. Trinity is a standalone software composed of three main components: (i) Inchworm, that first generates transcript contigs; (ii) Chrysalis, for clustering them and constructing complete de Bruijn graphs for each cluster and; (iii) Butterfly that processes individual graphs in parallel for finally resulting to the reconstruction of the transcript sequences.
A reference implementation of a probabilistic sequence overlapping algorithm. MHAP is designed to efficiently detect all overlaps between noisy long-read sequence data. It efficiently estimates Jaccard similarity by compressing sequences to their representative fingerprints composed on min-mers (minimum k-mer). MHAP is included within the Canu assembler which is a fork of the Celera Assembler designed for high-noise single-molecule sequencing (such as the PacBio RSII or Oxford Nanopore MinION).
Gives access to many free software tools for sequence analysis. EMBOSS aims to serve the molecular biology community. It permits the creation and the release of software in an open source spirit. This tool is useful for sequence analysis into a seamless whole. It is free of charge and is available in open source.
Conducts next generation sequencing (NGS) investigation. Geneious provides visual sequence alignment and editing, sequence assembly, comprehensive molecular cloning and phylogenetic analysis. It increases process efficiency and improves data organization. This tool enables the importation and conversion of a vast range of data types and offers a solution to customize researchers’ algorithms.
A variant caller and small genome assembler. The heart of DISCOVAR is a de novo genome assembler, one that is accurate enough to produce assemblies that can be used for variant calling given a reference sequence. DISCOVAR can also generate de novo assemblies for small genomes, but consider using DISCOVAR de novo instead which can assemble genomes up to mammalian size. DISCOVAR provides a more complete inventory of an individual’s genetic variants than had been previously possible. As such, it adds to the tools that can be used to probe the genetic basis of disease. It may be particularly useful in cases where targeted or exome sequencing fails to find causal mutations.
A program for high-quality de novo microbial genome assemblies using only a single, long-insert shotgun DNA library in conjunction with Single Molecule, Real-Time (SMRT) DNA sequencing. The process itself relies on a succession of steps to generate de novo assemblies of a genome.
Provides a de novo assembler for short DNA sequence reads. SSAKE is designed to help leverage the information from short sequences reads by assembling them into contigs and scaffolds that can be used to characterize novel sequencing targets. SSAKE assembles whole reads (not k-mers) and as such, is well-suited for structural variant assembly/detection. SSAKE is written in PERL and runs on Linux. SSAKE cycles through short sequence reads stored in a hash table and progressively searches through a prefix tree for extension candidates. The algorithm assembled 25 to 300 bp (genome, transcriptome, amplicon) reads from viral, bacterial and fungal genomes. SSAKE is lightweight, simple to setup & run and robust.
Computes an improved consensus sequence for the assembly. LQS uses accurate short-read data and/or Pacific Biosciences circular consensus reads to correct error-prone long reads sufficiently for assembly. This approach uses three steps: (i) overlapping is recognized between corrected and readings with a multiple alignment process, (ii) corrective readings are merged using the Celera Assembler, and (iii) the assembly is improve using a probabilistic model of the signal level data.
Aligns sequences with or without local alignment. MECAT is a program based on a pseudolinear alignment scoring algorithm that exploits distance difference factors (DDFs). The software is composed of four modules: (i) a single molecule real time (SRMT) reads pairwise mapper, (ii) an SMRT reads reference mapper; (iii) a noise corrector and (iv) a pipeline for hierarchical assembly which is an extension version of the CANU pipeline.
Allows de novo genome assembly and multisample variant calling. Cortex is a modular set of multi-threaded programs for manipulating assembly graphs. Linked de Bruijn Graph (LdBG) data structure and associated algorithms are implemented as part of the software. It was used for two tasks where long-range information is likely to be beneficial: finding large differences from a reference and analysis of genomic context for drug resistance genes, which was validated using a PacBio reference assembled for the sample.
Allows integrative investigation of next generation sequencing (NGS) microbiology data. Orione supports the whole life cycle of microbiology research data from production and annotation to publication and sharing. It can be used for a variety of microbiological projects including bacteria resequencing, de novo assembling and microbiome investigations. This tool is implemented on the Galaxy web platform.
Provides a whole‐genome shotgun assembler that can generate high‐quality genome assemblies using short reads (~100bp) such as those produced by the new generation of sequencers. The ALLPATHS-LG assemblies are not necessarily linear, but instead are presented in the form of a graph. This graph representation retains ambiguities, such as those arising from polymorphism, uncorrected read errors, and unresolved repeats, thereby providing information that has been absent from previous genome assemblies. ALLPATHS‐LG requires high sequence coverage of the genome in order to compensate for the shortness of the reads. The precise coverage required depends on the length and quality of the paired reads, but typically is of the order 100x or above.
A package for assembling genome sequence using paired-end whole-genome shotgun reads. ARACHNE has several key features, including an efficient and sensitive procedure for finding read overlaps, a procedure for scoring overlaps that achieves high accuracy by correcting errors before assembly, read merger based on forward-reverse links, and detection of repeat contigs by forward-reverse link inconsistency.
Allows alignment of long reads for consensus and assembly. FALCON is a set of tools based on a hierarchical genome assembly process. The software follows a "Hierarchical Genome Assembly Process" constituted of several steps for generating a genome assembly from a set of sequencing reads. Each step is accomplished with different command line tools implementing different sets of algorithms to accomplish the work.
Enables the assembly of a human genome, using short reads from a high-throughput sequencing platform. ABySS consists of a parallelized sequence assembler that allows parallel computation of the assembly algorithm across a network of commodity computers. This algorithm proceeds in two stages: (1) it generates all possible substrings of length k (termed k-mers) form the sequence reads; and (2) it uses mate-pair information to extend contigs by resolving ambiguities in contig overlaps.
Provides tools and class interfaces for the assembly of DNA reads. AMOS includes modular assembly pipelines, as well as tools for overlapping, consensus generation, contigging, and assembly manipulation. The AMOS pipeline config file can be modified by users to add additional processing steps. The software includes a number of conversion utilities allowing to process data from a variety of input sources and to output the data in commonly used assembly formats.
Constructs de novo draft assembly for the human-sized genomes. SOAPdenovo is specially designed to assemble Illumina GA short reads and is able to resolve longer repeat regions in contig assembly. SOAPdenovo is made up of six modules that handle read error correction, de Bruijn graph (DBG) construction, contig assembly, paired-end (PE) reads mapping, scaffold construction, and gap closure. It was used as a basis for the MEGAHIT software.