Assembly scaffolding software tools | De novo genome sequencing data analysis
The de novo assembly of short-read sequencing data usually leads to a fragmented set of genomic sequences (contigs). Ordering and orientating such contigs (scaffolding) represents the first, nontrivial step towards genome finishing and usually requires extensive processing and manual editing of large blocks of sequence. The preferred approach to genome scaffolding is currently based on assembling the sequenced reads into contigs and then using paired-end information to join them into scaffolds.
Assists in parallelizing scaffolding process. DNA triangulation proceeds by assembling an initial set of contigs or scaffolds by using any available assembly software. The software maps Hi-C data to the contigs with any available Hi-C mapping/correction pipeline. It then partitions contigs into putative chromosomes via a karyotyping step and assigns each bin to a cluster/chromosome number.
Allows integrative investigation of next generation sequencing (NGS) microbiology data. Orione supports the whole life cycle of microbiology research data from production and annotation to publication and sharing. It can be used for a variety of microbiological projects including bacteria resequencing, de novo assembling and microbiome investigations. This tool is implemented on the Galaxy web platform.
Constructs de novo draft assembly for the human-sized genomes. SOAPdenovo is specially designed to assemble Illumina GA short reads and is able to resolve longer repeat regions in contig assembly. SOAPdenovo is made up of six modules that handle read error correction, de Bruijn graph (DBG) construction, contig assembly, paired-end (PE) reads mapping, scaffold construction, and gap closure. It was used as a basis for the MEGAHIT software.
Allows de-novo assembly of transcriptome using a reference proteome. STM exploits the fact that, by translating contigs into amino acid sequences, it is possible to search for orthologous regions in a reference proteome, even when it belongs to a distantly related organism. The method can join multiple transcript fragments that are part of a single gene, providing new and valuable information on the order and the orientation of these fragments along original transcript. Multiple- k, a method that performs multiple assemblies with various k-mer lengths and retains the best part of each one to form the final assembly is also available.
Produces midrange scaffolding comparable to 40-kbp fosmid-based mate-pair libraries. fragScaff leverages coincidences between the content of different pools as a source of contiguity information. Specifically, contiguity preserving transposase sequencing (CPT-seq) data is mapped to a de novo genome assembly, followed by the identification of pairs of contigs or scaffolds whose ends disproportionately co-occur in the same indexed pools, consistent with true adjacency in the genome. Such candidate “joins” are used to construct a graph, which is then resolved by a minimum spanning tree.
Calculates bacterial artificial chromosome (BAC) end length constraints from clone fingerprint map contigs created by the FPC package. FASSI aims to improve the contiguity of whole-genome shotgun sequence (WGS) assemblies. Its output is fully compatible with the PCAP assembly program and could be easily formatted to work with other assembly algorithms able to accept length constraints for individual clones as input.
A scalable, exact algorithm for the scaffold assembly of large, repeat-rich genomes, with consistent improvement over state-of-the-art programs for scaffold correctness and contiguity. OPERA provides a rigorous framework for scaffolding of repetitive sequences and a systematic approach for combining data from different second-generation (Illumina, Ion Torrent) and third-generation (PacBio, ONT) sequencing technologies. OPERA efficiently scaffolds large genomes with provable scaffold properties, providing an avenue for systematic augmentation and improvement of 1000s of existing draft eukaryotic genome assemblies.
A scaffolding program to deal with complicated genome. GOBOND allows to assemble contigs from sequencing platforms. Pre-assembled contigs from one sequencing platform could be oriented and linked by pair-end/mate-pair reads from other platforms.
A stand-alone program for scaffolding pre-assembled contigs using long reads (e.g. PacBio RS reads). Using the long read information, contigs (or scaffolds) are placed in the right order and orientation in so-called super-scaffolds. The SSPACE-LongRead hybrid assembly approach has been tested on a number of bacterial genomes and in most cases results in less than 10 super-scaffolds (numbers based on draft assemblies constructed with one Illumina MiSeq paired-end and one PacBio RS C2 SMRT library, both at 100X coverage).
Based on the combination of direct link and paired link graphs to address above scaffolding obstacles. inGAP-sf employs direct link to provide extra routes and decreases the complexity of repetitive contigs enriched regions. The main advantage of inGAP-sf is that it introduces the direct link graph to cluster and link Killer Ig-Like Receptor (KIR) contigs and also the Statistic-based estimation model to screen out correct routes from numerous noise routes in repetitive regions.
A random barcode based BAC library system. pBACode is a pool of Bacterial Artificial Chromosome (BAC) cloning vectors. It is able to introduce random sequences into BAC vectors so that each BAC clone has a pair of unique barcodes flanking its cloning site. From the unique barcode pairs in every BAC clone, a long and accurate sequences of BAC paired-ends is obtained and clones in a BAC library can be located in high-throughput.
A package for scaffolding genomic assemblies. BESST contains several modules for e.g. building a "contig graph" from available information, obtaining scaffolds from this graph, and accurate gap size information (based on GapEst). The scaffolder accurately infers scaffolds, even with high levels of contamination, and we showed that other scaffolders are vulnerable to PE-contaminated libraries.
A stand-alone program for scaffolding pre-assembled contigs using NGS paired-read data. It is unique in offering the possibility to manually control the scaffolding process. By using the distance information of paired-end and/or matepair data, SSPACE is able to assess the order, distance and orientation of your contigs and combine them into scaffolds.
Calculates covariance scores for constrained regions of background conservation. McBASC can be utilized as a covarying or highly conserved filter. This software gives an equally high score to conserved or covarying alignments and allows, without a reduction in score, substitution of conserved pairs of residues for covarying ones.
Provides a de novo transcriptome assembler for short RNA-seq reads. Oases congregates unmapped RNA-seq reads into full length transcripts. It enables reconstruction with different k-values via dynamic cutoffs. This software adds as features an array of hash lengths, a dynamic filtering of noise, a resolution of alternative splicing (AS) events and merging of multiple assemblies.
Affords a way for analysis of Pacific Biosciences long-read sequencing data. PBSuite is composed of two projects: PBJelly and PBHoney. The first one is an automated pipeline for aligning long sequencing reads to draft assembles. The second provides identification approaches for analyze high mappability of long reads considering intra red discordance and soft-clipped tails.
Provides a de novo transcriptome assembler specifically made for RNA-Seq. SOAPdenovo-Trans is derived from the SOAPdenovo2 genome assembler which is made for transcriptome assembly. The software aims to process RNA-Seq data and enables alternative splicing (AS). It uses a multiple k-mers method to either merge the resultant assemblies in to one final set or to iterate several k-mers de Bruijin graph (DBG) assemblies during contig construction.
An algorithm for genome scaffolding. MeDuSa exploits information obtained from a set of (draft or closed) genomes from related organisms to determine the correct order and orientation of the contigs. MeDuSa formalises the scaffolding problem by means of a combinatorial optimisation formulation on graphs and implements an efficient constant factor approximation algorithm to solve it. In contrast to currently used scaffolders, it does not require either prior knowledge on the microrganisms dataset under analysis (e.g. their phylogenetic relationships) or the availability of paired end read libraries.