Read simulation software tools | Whole-genome sequencing data analysis
In the past few years, high-throughput next-generation sequencing technologies have effectively replaced earlier data types for genome-wide studies measuring gene expression changes and discovering genomic/epigenetic variations, and many tools were developed for analyzing such datasets. Simulated data is indispensable for guiding tool development and evaluating tool performance, and therefore it is essential to develop simulation software that can produce next-generation sequencing reads that captures the most essential characteristics of real data.
Gives access to many free software tools for sequence analysis. EMBOSS aims to serve the molecular biology community. It permits the creation and the release of software in an open source spirit. This tool is useful for sequence analysis into a seamless whole. It is free of charge and is available in open source.
An illumina paired-end and mate-pair short read simulator. This project attempts to model as many of the quirks that exist in Illumina data as possible. Some of these quirks include the potential for chimeric reads, and non-biotinylated fragment pull down in mate-pair libraries.
Mimics studies assisting to design data collection modalities of the 1000 Genomes Project. ART builds ‘synthetic’ sequencing reads in a manner that feigns the technology-specific sequencing process. It is able to generate sequencing data with customized read length and error characteristics. This tool supports all three types of common sequencing errors: base substitutions, insertions and deletions.
Implements a probabilistic model of the evolution of RNA-, DNA-, or protein-like sequences. Rose allows for varying rates of mutation within the sequences, making it possible to establish so-called sequence motifs. The data created by Rose are suitable for the evaluation of methods in multiple sequence alignment computation and the prediction of phylogenetic relationships. It can also be useful when teaching courses in or developing models of sequence evolution and in the study of evolutionary processes
The commercial launch of 454 pyrosequencing in 2005 was a milestone in genome sequencing in terms of performance and cost. Flowsim is a simulator that generates realistic pyrosequencing data files of arbitrary size from a given set of input DNA sequences.
PacBio sequencers produce two types of characteristic reads (continuous long reads: long and high error rate and circular consensus sequencing: short and low error rate), both of which could be useful for de novo assembly of genomes. Currently, there is no available simulator that targets the specific generation of PacBio libraries. PBSIM simulates those PacBio reads by using either a model-based or sampling-based simulation.
A targeted re-sequencing simulator that generates synthetic exome sequencing reads from a given sample genome. Wessim emulates conventional exome capture technologies, including Agilent's SureSelect and NimbleGen's SeqCap, to generate DNA fragments from genomic target regions. The target regions can be either specified by genomic coordinates or inferred from in silico probe hybridization. Coupled with existing next-generation sequencing simulators, Wessim generates a realistic artificial exome sequencing data, which is essential for developing and evaluating exome-targeted variant callers.
A set of programs aimed at simulating ancient DNA fragments. Gargamel can simulate most common features of a DNA sequences, including post-mortem DNA damage and base misincorporations. It simulates base compositional bias due to the molecular tools used in library preparation, sequencing bias against GC-rich fragments and errors introduced by the sequencing platform. Gargammel provides researchers with the opportunity to perform various inquiries to evaluate the robustness of various analyses to a DNA properties.
A Python-based read simulator whose error model corresponds to third-generation SMRT error characteristics, with default parameters based on public datasets. SimLoRD offers options to choose the read length distribution and to model error probabilities depending on the number of passes through the sequencer. The new error model makes SimLoRD the most realistic SMRT read simulator available.
Simulates targeted capture sequencing data. CapSim simulates the dynamics of probe hybridisation in silico to generate a set of fragments to be sequenced. It emulates all various stages of the sequencing process, including fragmentation, fragment capture, and sequencing. Users can modify experimental parameters at each of these stages in order to optimise the sequencing protocols in silico.
Creates prokaryotic pseudo-genomes. Simulome provides options that can be used in combination to create mutated variants of the simulated genome, which allows for controlled testing of specific genomic conditions. It can be used in combination with real reads generated from next-generation sequencing (NGS) platforms, or with simulated reads. The tool allows to analyze the effect of specific mutation types on a large scale, providing researchers with the ability to investigate the efficacy of analysis methodologies on a large number of genes that contain similar mutation events.
Profiles the characteristics of third generation, single-molecule sequencing technologies and simulates accordingly. It also uses context-dependent error profiles for realistic simulation (learn-and-simulate approach). LongISLND learns from alignment data by recording base calls, with and without error, according to sequencing contexts of the reference. The output format is also easily customizable.
Builds a list of prioritized representative genomes from either supervised or unsupervised clustering of related genomes. GGRaSP can reduce the loss of information by prioritizing medoids as representative genomes. It employs supervised methods such as specifying the number of clusters or the cluster cut-off distance to cluster genomes. This tool can serve for the identification of a cut-off value that separates the most closely related genomes from the more diverse genomes.
Generates realistic Illumina reads via a sequencing simulator. InSilicoSeq leans on kernel density estimators to design read quality of real sequencing data. This software suits for simulating metagenomic samples and creating sequencing data from a single genome. It can model GC-bias, insert size distribution and PHRED scores and it features substitution, insertion and deletion errors.
Allows users to simulate reads (Illumina or PacBio) based on a population with an arbitrarily complex Transposable elements (TE) landscape. SimulaTE is composed of four scripts that permits to define the landscape, to construct the population genome and finally simulates reads for pool-seq or sequencing individuals. It can be used for estimating the suitability of both given genomic resources and software dealing with TEs identification.
Simulates the entire procedure of Nanopore sequencing. DeepSimulator can mimic the reads from the statistical patterns of the real data and also both the raw electrical current signals and nucleotide reads. It can be used to create benchmark datasets to evaluate the newly developed methods for Nanopore sequencing data analysis. This tool is composed of several modules: sequence generation, sequence feeding, creation of the simulated current signals.
Captures the technology-specific features of Oxford Nanopore technologies (ONT) data and allows adjustments upon improvement of nanopore sequencing technology. NanoSim is a read simulator which provides a comprehensive alignment-based analysis, and generates a set of read profiles serving as the input to the next step and the simulation stage. The simulation tool uses the model built in the previous step to produce in silico reads for a given reference genome.
A customized tool which generates synthetic New-Generation Sequencing reads, supporting read simulation for major letter-base sequencing platforms. CuReSim is developed in Java and is distributed as an executable jar file. Wrappers to integrate CuReSim in Galaxy are also available.
Takes the reference genome (in FASTA format) as input and outputs artificial FASTQ files in the Sanger format. It can accept Phred base quality scores from existing FASTQ files, and use them to simulate sequencing errors. Since the artificial FASTQs are derived from the reference genome, the reference genome provides a gold-standard for calling variants (Single Nucleotide Polymorphisms (SNPs) and insertions and deletions (indels)).
Automates the identification of single nucleotide polymorphisms (SNPs) discriminating between groups. mPSQed assists researchers to construct multiplex pyrosequencing assays. Its multiplex pyrosequencing can serve for a broader range of challenging diagnostic applications. This tool can group sequences, and compute consensus sequences both for the alignment globally and for each group individually.