Extracts, from mixed DNA sequence data, subsets that correspond to single species’ genomes and thus improving genome assembly. Blobology aims to create blobplots or Taxon-Annotated GC-Coverage (TAGC) plots to visualise the contents of genome assembly data sets as a Quality Control (QC) step. The method can create a preliminary assembly, and create and collate GC content, read coverage and taxon annotation for the preliminary assembly. The results are displayed with the Blobsplorer visualiser.
Serves for the decontamination of genomes. ProDeGe is a program that utilizes a combination of homology-based and sequence composition-based approaches to separate contaminant sequences from the target genome draft. Moreover, this tool classifies sequences into two classes - clean and contaminant - using a combination of homology and feature-based methodologies.
Detects contaminant DNA by exploring oligonucleotide composition similarity between assembly contigs or scaffolds. PhylOligo generates an all-by-all contig distance matrix and regroups contigs by compositional similarity. After that, it extracts DNA segments with homogeneous oligonucleotide composition from this genome assembly.
Provides an efficient way to decontaminate assemblies from non-model organisms by using the information contained in the sequences themselves. MCSC is a decontamination method based on a hierarchical clustering algorithm. It uses frequent patterns found in sequences to create clusters. It can effectively clean de novo assembled transcriptomes from two different types of samples: (i) golden nematode cysts highly contaminated with unknown soil-borne microorganisms and (ii) carrot weevils infected with a parasitic nematode.
Makes the classification of VecScreen matches into true and false matches automatically and deterministically. VecScreen_plus_taxonomy compares submitted nucleotide sequence(s) as a query to a database of vector segments. It produces higher-scoring matches that may be false positives for reasons that can be detected systematically. The software aims to identify and discriminate these false positives from true positive vector matches.
Detects sequences of different phylogenetic origins in the published genome of the kelp Saccharina latissima. Taxoblast is an analysis pipeline with a graphical user interface based on multiple blastn searches with small sequence fragments. It provides outputs compatible with most spreadsheet programs. This makes it easy to combine results with other sources of information or integrated alignment-free tools.
Classifies sequences using a decision tree. SIDR consists in a supervised machine learning method allowing identification of target and contaminant DNA in de novo genome assembly projects. An advantage of decision trees is that they do not require data transformations or normalizations and produce simple, easily interpretable relationships. The software can decontaminate sequences and is useful for non-model organisms.