1 - 25 of 25 results


A Java based variant caller designed for detecting contamination-induced point mutations from hybrid-capture-based genome sequencing data (WGS, WES, targeted capture, etc). Vecuum is specialized for identifying false variants caused by recombinant vector contamination, however, can be applied to detect spurious calls for various external contaminants such as xenogeneic genomes, mRNA (cDNA) libraries, or even pseudogenes. The important features of Vecuum are: (i) estimation of the genomic location of vector-contaminated regions; (ii) identification of false variants originated from vector inserts. Tests on simulated and spike-in experimental data validated that Vecuum could detect 93% of vector contaminants and could remove up to 87% of variant-like false calls with 100% precision. Application to public sequence datasets demonstrated the utility of Vecuum in detecting false variants resulting from various types of external contamination.

AFS / All-Food-Seq

Quantifies species composition in food. AFS uses metagenomic shotgun sequencing and sequence read counting to infer species proportions. It screens for species composition and relative quantities via DNA sequence read counting, using sophisticated read alignment algorithms and published reference genomes. Using Illumina data from a reference sausage comprising four species, the study reveal that AFS is independent of the sequencing assay and library preparation protocol.

MCSC / Model-based Categorical Sequence Clustering

Provides an efficient way to decontaminate assemblies from non-model organisms by using the information contained in the sequences themselves. MCSC is a decontamination method based on a hierarchical clustering algorithm. It uses frequent patterns found in sequences to create clusters. It can effectively clean de novo assembled transcriptomes from two different types of samples: (i) golden nematode cysts highly contaminated with unknown soil-borne microorganisms and (ii) carrot weevils infected with a parasitic nematode.

HYSYS / Have You Swapped Your Samples

A statistical method to estimate the relatedness of samples and test for sample swaps and contamination. The test uses the concordance of homozygous single-nucleotide polymorphisms between samples. The method is motivated by the observation that homozygous germline population variants rarely change in the disease and are not affected by loss of heterozygosity. HYSYS offers several advantages: (i) dealing with allele-specific copy number changes and loss of heterozygosity, (ii) visualising relationships between samples, (iii) automatic modeling and flagging of unusual relationships and (iv) ease of use.

Conpair / Concordance/Contamination of paired samples

A tool for detection of sample swaps and cross-individual contamination in whole-genome and whole-exome tumor–normal sequencing experiments. Conpair is a fast and robust method dedicated for human tumor-normal studies to perform concordance verification, as well as cross-individual contamination level estimation in whole-genome and whole-exome sequencing experiments. Importantly, our method of estimating contamination in the tumor samples is not affected by copy number changes and is able to detect contamination levels as low as 0.1%.


Identifies and removes contaminants in metagenomics sequencing (MGS) data. decontam allows generation of accurate profiles of microbial community. This tool implements two simple statistical tests based on widely reproduced signatures of contamination: (1) sequences from contaminating taxa that are likely to have frequencies that inversely correlate with sample DNA concentration; and (2) sequences from contaminating taxa that are likely to have higher prevalence in control samples than in true samples.

ACDC / Automated Contamination Detection and Confidence estimation

Detects both known and de novo contaminants. ACDC was specifically developed to aid the quality control process of genomic sequence data. First, 16S rRNA gene prediction and the inclusion of ultrafast exact alignment techniques allow sequence classification using existing knowledge from databases. Second, reference-free inspection is enabled by the use of state-of-the-art machine learning techniques that include fast, non-linear dimensionality reduction of oligonucleotide signatures and subsequent clustering algorithms that automatically estimate the number of clusters. The latter also enables the removal of any contaminant, yielding a clean sample.

MIDAS / Metagenomic Intra-species Diversity Analysis System

Allows users to measure bacterial strain-level gene content, single nucleotide polymorphism (SNPs) and species abundance from shotgun metagenomes. MIDAS is able to categorize genetic variants into strains to analyze large-scale population-genetic of metagenomes. The application provides a computational pipeline that combines a taxonomic profiling and an alignment of both pan-genome and whole-genome to permits users to compare over 30,000 reference genomes.


Extracts relevant information from complex and mixed datasets. PhagePhisher improves the examination of bacteriophages, viruses, and virally related sequences, in a range of environments. It can be used with limited operator knowledge of bioinformatics on a standard workstation. The tool is able to resolve viral-related sequences which may be obscured by or imbedded in bacterial genomes. With PhagePhisher, the user must only inspect a handful of sequences in comparison with those from the full run.


Identifies host sequences contaminating metagenomic datasets. Validation results indicate that CS-SCORE is 2-6 times faster than the current state-of-the-art methods. Furthermore, the memory footprint of CS-SCORE is in the range of 2-2.5GB, which is significantly lower than other available tools. CS-SCORE achieves this efficiency by incorporating (1) a heuristic pre-filtering mechanism and (2) a directed-mapping approach that utilizes a novel sequence composition metric (cs-score). CS-SCORE is expected to be a handy 'pre-processing' utility for researchers analyzing metagenomic datasets.


A computational tool for rapid read extraction or removing according to a provided list of k-mers generated from a FASTA file. Cookiecutter is based on the implementation of the Aho-Corasik algorithm and is useful in routine processing of high-throughput sequencing datasets. Cookiecutter can be used for both the removal of undesirable reads and read extraction with a user-defined region of interest. The extraction functionality can be an effective alternative or a complete replacement for read-mapping based pipelines.


Removes vectors from the ends of one or more nucleotide sequences. vectorstrip writes nucleotide sequences out again but with any of a specified set of vector sequences removed from the 5' and 3' termini. The vector sequences to strip out are (typically) provided in an input file. The pair of 5' and 3' vector sequences are searched against each input sequence allowing a specified maximum level of mismatches. Each 5' hit is paired with each 3' hit and the resulting subsequences output.


Identifies segments within nucleic acid sequences which may be of vector origin. UniVec is an efficient database because a large number of redundant subsequences have been eliminated to create a database that contains only one copy of every unique sequence segment from a large number of vectors. In addition to vector sequences, UniVec also contains sequences for those adapters, linkers, and primers commonly used in the process of cloning cDNA or genomic DNA. This enables contamination with these oligonucleotide sequences to be found during the vector screen.


Extracts, from mixed DNA sequence data, subsets that correspond to single species’ genomes and thus improving genome assembly. Blobology aims to create blobplots or Taxon-Annotated GC-Coverage (TAGC) plots to visualise the contents of genome assembly data sets as a Quality Control (QC) step. The method can create a preliminary assembly, and create and collate GC content, read coverage and taxon annotation for the preliminary assembly. The results are displayed with the Blobsplorer visualiser.


A fast, accurate and holistic NGS data quality-control method. The tool synergeticly comprised of user-friendly tools for (1) quality assessment and trimming of raw reads using Parallel-QC, a fast read processing tool; (2) identification, quantification and filtration of unknown contamination to get high-quality clean reads. QC-Chain was optimized based on parallel computation, so the processing speed is significantly higher than other QC methods. Experiments on simulated and real NGS data have shown that reads with low sequencing quality could be identified and filtered. Possible contaminating sources could be identified and quantified de novo, accurately and quickly. Comparison between raw reads and processed reads also showed that subsequent analyses (genome assembly, gene prediction, gene annotation, etc.) results based on processed reads improved significantly in completeness and accuracy. As regard to processing speed, QC-Chain achieves 7-8 time speed-up based on parallel computation as compared to traditional methods.