Duplication identification software tools | High-throughput sequencing data analysis
Segmental duplication or low-copy repeat. A segment of DNA >1 kb in size that occurs in two or more copies per haploid genome, with the different copies sharing >90% sequence identity. They are often variable in copy number and can therefore also be CNVs.
Enables discovering and genotyping structural variations using sequencing data. Genome STRiP performs discovery and genotyping of copy number variations (CNVs) by analyzing the data from many samples simultaneously in a population-based framework. The software can discover polymorphisms and produce genotypes. It can be used to find novel structural variations or to genotype known variants in new samples.
Identifies large segmental duplications and deletions. mrCaNaVaR is a package furnishing a copy number caller based on the investigation of whole-genome sequence mapping read depth. It includes features for hiding common and tandem repeats, mapping reads in conjunction with mrFAST/mrsFAST software or building search indexes. It can also be used to determine absolute copy numbers of genomic intervals.
A tool to generate local assemblies of breakpoints genome-wide. NovoBreak is an algorithm used in cancer genomic studies to discover structural variants (both somatic and germline) breakpoints in whole-genome sequencing data. Assemblies realized by novoBreak are based on clusters of reads which share a set of short nucleotide stretches of length K (K-mers) present in a subject genome but not in the reference genome or control data.
Integrates prior knowledge about the characteristics of structural variants (SVs). forestSV is a statistical learning approach, based on Random Forests (RFs) that leads to improved discovery in high throughput sequencing (HTS) data. This application offers high sensitivity and specificity coupled with the flexibility of a data-driven approach. It is particularly well suited to the detection of rare variants because it is not reliant on finding variant support in multiple individuals.
A high performance robust tool and library for working with SAM, BAM and CRAM sequence alignment files; the most common file formats for aligned next generation sequencing (NGS) data. Sambamba is a faster alternative to samtools that exploits multi-core processing and dramatically reduces processing time. Sambamba is being adopted at sequencing centers, not only because of its speed, but also because of additional functionality, including coverage analysis and powerful filtering capability.
A computational tool for copy number variants (CNV) detection in whole human genome sequence data using read depth (RD) coverage. CNV detection is based on the Event-Wise Testing (EWT) algorithm. The read depth coverage is estimated in non-overlapping intervals (100bp Windows) across an individual genome based on the pileup generated by SAMTools.
Allows structural variant (SV) discovery. LUMPY is a general probabilistic SV discovery framework that integrates multiple SV detection signals, including those generated from read alignments or prior evidence. The software is based upon a general probabilistic representation of an SV breakpoint that allows any number of alignment signals to be integrated into a single discovery process. It can detect SV from multiple alignment signals in files from one or more samples. A simplified wrapper for standard analyses, LUMPY Express, can also be executed.
Identifies structural variant (SV) breakpoint junctions by clustering split reads. NanoSV first orders all mapped segments of each split read by their positions within the originally sequenced read. This tool utilizes split read mapping to discover all defined types of SVs. It finishes by gathering evidence form different reads supporting the same candidate breakpoint junction. NanoSV suits for Nanopore and Pacific Biosciences data.
Allows identification of genomic rearrangements. GRIDSS is a module software suite containing tools which performs genome-wide break-end assembly prior to variant calling using a positional de Bruijn graph assembler. The GRIDSS pipeline comprises three distinct stages: extraction, assembly, and variant calling. The software identifies non-template sequence insertions, microhomologies and large imperfect homologies, and supports multi-sample analysis.
Identifies segmental duplication (SDs) and reveal previously unknown ancient SDs in the human genome. SDquest proceeds by decomposing mosaic SDs into elementary SDs that are more responsive to evolutionary analysis. This software then builds the breakpoint graph of these mosaic SDs. It can also reveal SD-block and allows the analysis of cyclical components in the breakpoint graph of SDs.
Detects a wide range of structural variation within repetitive sequence including tandem elements of transposable elements (TE). ConTExt intends to establish the repeat-family from which the read originates and the location within the repeat-family. This software exploits read pair information, coverage and sequence polymorphism via arranged reads to pinpoint structures involving repetitive sequence.
A data processing pipeline for copy number variations and aberrations (CNVs and CNAs) from next generation sequencing (NGS) data. The package supplies functions to convert BAM files into read count matrices or genomic ranges objects, which are the input objects for cn.MOPS. It models the depths of coverage across samples at each genomic position. Therefore, it does not suffer from read count biases along chromosomes. Using a Bayesian approach, cn.MOPS decomposes read variations across samples into integer copy numbers and noise by its mixture components and Poisson distributions, respectively.
A versatile variant caller for both DNA- and RNA-sequencing data. VarDict contains many features that are distinct from other variant callers, including linear performance to depth, intrinsic local realignment, built-in capability of de-duplication, detection of polymerase chain reaction (PCR) artifacts, accepting both DNA- and RNA-seq, paired analysis to detect variant frequency shifts alongside somatic and loss of heterozygosity (LOH) variant detection and structural variant (SV) calling. VarDict facilitates application of next-generation sequencing in cancer research, enabling researchers to use one tool in place of an alternative computationally expensive ensemble of tools.
Detects common DNA events (recurrent CNVs) across individuals. JointSLM is an algorithm extending univariate shifting level model (SLM). This application can be used for identifying small shifts in the signals, identifies boundaries of common DNA events as well as for analyzing multiple tumor samples data for the discovery of recurrent copy number alterations.
Allows users to detect all pairs of duplicate genes in a genome. GenomeHistory assists users in determining the degree of synonymous and non-synonymous divergence between each duplicate pair. It provides a method permitting analysis of the relations between the number of genes in a family and the family’s rate of sequence evolution. The function of a gene and its propensity to the duplication also can be indicated with this tool.
An accurate structural variation (SV) detection method, which compares the statistics of the mapped read pairs in tumor samples with isogenic normal control samples in a distinct asymmetric manner. COSMOS also prioritizes the candidate SVs using strand-specific read-depth information. Performance tests on modeled tumor genomes revealed that COSMOS outperformed existing methods in terms of F-measure.
Retrieves balanced and unbalanced forms of structural variation, such as deletions, tandem duplications, inversions and translocations. DELLY is based on a combination of short-range and long-range paired-end mapping and split-read analysis. It is useful for massively parallel sequencing (MPS) data from various sources, including deep whole-genome sequencing data and low-pass mate-pair sequencing data with longer inserts.
Marks duplicates and extracts discordant and split reads from SAM files. SAMBLASTER is able to mark duplicates in a single pass over a SAM file in which all alignments for the same read-id are grouped together. The software can extract reads directly from the SAM output of an aligner, such as BWA-MEM.