Index construction software tools | High-throughput sequencing data analysis
FM-index plays an important role in DNA sequence alignment, de novo assembly (Simpson and Durbin, 2012) and compression (Cox et al., 2012). Fast and lightweight construction of FM-index for a large dataset is the key to these applications. Source text: Li, 2014.
Aligns short read geared toward mammalian re-sequencing. Bowtie is based on a Burrows-Wheeler index based on the full-text minute-space (FM) index. It follows two steps: an initial, ungapped seed-finding stage that derives advantage from the speed and memory efficiency of the full-text minute index and a gapped extension stage that employs dynamic programming and benefits from the efficiency of single-instruction multiple-data (SIMD) parallel processing available on modern processors.
Allows users to interact with high-throughput sequencing data. SAMtools permits the manipulation of alignments in the SAM/BAM/CRAM formats: reading, writing, editing, indexing, viewing and converting SAM/BAM/CRAM format. It limits the mapping quality of reads with excessive mismatches and applies base alignment quality to fix alignment errors. This tool can sort and merge alignments, remove polymerase chain reaction (PCR) duplicates or generate per-position information.
Gives access to many free software tools for sequence analysis. EMBOSS aims to serve the molecular biology community. It permits the creation and the release of software in an open source spirit. This tool is useful for sequence analysis into a seamless whole. It is free of charge and is available in open source.
A software suite for the comparison, manipulation and annotation of genomic features in browser extensible data (BED) and general feature format (GFF) format. BEDTools also supports the comparison of sequence alignments in BAM format to both BED and GFF features. The tools are extremely efficient and allow the user to compare large datasets (e.g. next-generation sequencing data) with both public and custom genome annotation tracks. BEDTools can be combined with one another as well as with standard UNIX commands, thus facilitating routine genomics tasks as well as pipelines that can quickly answer intricate questions of large genomic datasets.
Allows users to conduct large-scale comparisons of their results with thousands of reference datasets and genome annotations in seconds. GIGGLE permits to identify novel and unexpected relationships among local datasets as well as the vast amount of publicly available genomics data. It uses a temporal indexing scheme to create a single index of the genome intervals from thousands of annotations and genomic data files.
A high performance robust tool and library for working with SAM, BAM and CRAM sequence alignment files; the most common file formats for aligned next generation sequencing (NGS) data. Sambamba is a faster alternative to samtools that exploits multi-core processing and dramatically reduces processing time. Sambamba is being adopted at sequencing centers, not only because of its speed, but also because of additional functionality, including coverage analysis and powerful filtering capability.
Deals with RNA structure probing and post-transcriptional modifications mapping high-throughput data. RNA Framework is a modular toolkit. Its main features are (i) automatic reference transcriptome creation, (ii) automatic reads preprocessing (adapter clipping and trimming) and mapping, (iii) scoring and data normalization and (iv) accurate RNA folding prediction by incorporating structural probing data. It can perform not only RNA Structure analysis, but also analysis of RNA post-transcriptional modifications mapping experiments (such as m1A-seq, m6A-seq, 2OMe-seq, and Pseudo-seq).
Examines epigenomic and transcriptomic next generation sequencing (NGS) data. Octopus-toolkit can be used for antibody- or enzyme-mediated experiments and studies for the quantification of gene expression. It can accelerate the data mining of public epigenomic and transcriptomic NGS data for basic biomedical research. This tool provides a private and a public mode: one to process the user’s own data, and the other to analyze public NGS data by retrieving raw files from the GEO database.
Assists users in manipulating high-throughput sequencing (HTS) data and formats. Picard is a Java toolkit that provides a set of command line scripts. It comprises Java-based utilities that manipulate SAM files, and a Java API for creating new programs that reads and writes SAM files. Both SAM text format and SAM binary (BAM) format are supported. It also works with next generation sequencing (NGS).
Indexes position sorted files in TAB-delimited formats such as GFF, BED, PSL, SAM and SQL export, and quickly retrieves features overlapping specified regions. Tabix features include few seek function calls per query, data compression with gzip compatibility and direct FTP/HTTP access.
A software suite for programmers and end users that facilitates research analysis and data management using BAM files. BamTools provides both the first C++ API publicly available for BAM file support as well as a command-line toolkit. The BamTools C++ API/library has been successfully integrated into a variety of applications. It provides the BAM file support for several utilities in the BEDtools suite.
A tool for constructing the FM-index for a collection of DNA sequences. ropeBWT works by incrementally inserting one or multiple sequences into an existing pseudo-BWT position by position, starting from the end of the sequences. This algorithm can be largely considered a mixture of BCR and dynamic FM-index. Nonetheless, ropeBWT2 is unique in that it may implicitly sort the input into reverse lexicographical order (RLO) or reverse-complement lexicographical order (RCLO) while building the index.
Accelerates the locating operation of FM-indexes for genomic data. FMtree is a locating algorithm that permits to build a conceptual multiway tree. By utilizing this multiway tree, FMtree is able to calculate the non-sampled positions block-by-block. It can also be applied to any implementation of FM-indexes without modification. This algorithm is cache-friendly and avoids many unnecessary operations.
A program that can chop a BAM index (BAI) file into small pieces. The program outputs a list of BAI files each indexing a specified genomic interval. The output files are much smaller in size but maintain compatibility with existing software tools. We show how preprocessing BAI files with chopBAI can lead to a reduction of I/O by more than 95% during the analysis of 10Kbp genomic regions, eventually enabling the joint analysis of more than 10,000 individuals. As sequencing is becoming more and more common, chopBAI will be equally useful for analyzing large sequencing cohorts of other species where the BAI indexing scheme allows for fast access to small subsets of reads.
A highly hardware-acceleration friendly k-ordered FM-index for exact string matching, overlap graph construction for de novo assembly, and more. sBWT is a Burrows–Wheeler transform (BWT) based fast indexer/aligner specialized in parallelized indexing and searching for next-generation sequencing data. In our tests, the implementation achieves significant speedups in indexing and searching compared to other BWT based tools and can be applied to a variety of domains.
Allows users to reformate and filter bioinformatics files. JVARKIT aims to simplify the grammar employed to filter bioinformatic file, for rendering possible to write a loop or a custom function. JVARKIT is a set of more than 100 java-based tools for bioinformatics.
Stores the k-words corresponding to the edges of a de Bruijn subgraph in a compact manner. kFM-index is a data structure that enables random access to vertices and edges. It avoids the direct storage of k-words and pointers, which make it compact. The software purposes to assist users in representing the k-mer composition of the sequences.
Provides utility functions implementing commonly used genomic operations. bedr is a formal BED-operations framework that offers a formal R interface to interact with BEDTools and BEDOPS. In addition to sort operations, it also supports identification of overlapping regions which can be collapsed to avoid downstream analytical challenges. This method is compatible with the ubiquitous BED tools paradigm and integrates with R-based workflows.
An open-source software using Clojure, which is a functional programming language that works on the Java Virtual Machine. Cljam can process and analyze SAM/BAM files in parallel and at high speed. The execution time with cljam is almost the same as with SAMtools. The Clojure code of cljam has fewer lines and an equivalent performance compared with SAMtools and Picard, which are similar tools.
Permits to grab arbitrary lines from a BGZIP compressed file. Grabix provide random access into text files that have been compressed with bgzip. This tool creates its own index of the file, then users can extract arbitrary lines from the file with the grab command or can choose random lines with the random command.
Permits to get access to high-throughput sequencing data (HTS) formats. Htsjdk does not support latest Variant Call Format Specification, for example VCFv4.3 and BCFv2.2. It can be useful to manipulate data in HTS fields.
Indexes a fasta file database. dbifasta needs a flat file database of one or more files, and builds EMBL CD-ROM format index files. The resulting index-file format is used by the software on the EMBL database CD-ROM distribution and by the Staden package in addition to EMBOSS, and appears to be the most generally used and publicly available index file format for these databases.
Uses to designe multi-thread sort/merge tools for BAM files. NovoSort reduces run times from multi-threading and by combining sort & merge in one step. It uses a stable sort/merge algorithm that will not change the order of alignments with the same sort key and can optionally create BAM index file. This is a two phase sort merge, the first phase sorts as many reads as possible in memory and then writes segments of sorted records to temporary disk files. The second phase merges the sorted fragments to produce the final sorted file.
Generates FASTA index for FASTA files. Fastahack is an application for indexing and extracting sequences and subsequences from FASTA files. The included library provides a FASTA reader and indexer that can be embedded into applications which would benefit from directly reading subsequences from FASTA files. This resource also uses the C function fseek64 to extract sequence and subsequence. It permits fastest-possible extraction and makes fastahack a useful method for bioinformatician who need to quickly extract many subsequences from a reference FASTA sequence.
Allows manipulation of SAM, FASTA, binary variant call (BCF) and compressed indexed tab-delimited (tabix) files by providing an interface to the “samtools”, “bcftools”, and “tabix” utilities. Rsamtools is an R package that also offers facility for file access such as record counting, index file creation, and filtering to create new files containing subsets of the original. It can be used as a starting point for creating R objects suitable for a diversity of workflows.
Allows users to obtain PacBio BAM files and their associated indices. pbbam is mainly composed of a core C++ library which permits to create, query and edit files corresponding to the PacBio Bam files specification. Besides, the software can be configured to accept additional languages and command-line utilities. It can also integrate CMake-based projects.
PhD ès Neurosciences, I worked 8 years on the brain and its diseases. I then specialized in bioinformatics (NGS, epigenetics) and worked in CEA and GENETHON before to join OMICX and help OMICtools community.
Gene fusion detection in Plants
Fusion transcripts (i.e., chimeric RNAs) resulting from gene fusions are well known in case of human. But, in plants, this phenomenon is not yet explored. We are planning to discover the fusion transcripts/gene fusions in different type of plants by using RNA-Seq datasets. Further, we are planning to understand the mechanism of gene fusion formation and significance of fusions in plants.
Whole genome and transcriptome sequencing data analysis of Plants
In this era of Next Generation Sequencing (NGS), there is huge amount of sequencing data available in the public domain. Any novel finding from these available datasets is major challenge for a computational biologist. We are interested in the analysis of whole genome and transcriptome sequencing data of different plants to fetch out the useful information from those datasets, with the help of bioinformatics tools. Currently, we are planning to study the gene clusters of secondary metabolite pathways in different plants.
Development of webservers, databases and computational pipelines for plant research
Development of database is necessary to compile and share the information with scientific community. We are dedicated to develop useful databases and webserver for plant research.
Another area of interest is to develop automated pipelines and tools for the analysis of high throughput genomics data, generated by NGS technologies.
Professional & Academic Background
Staff Scientist II (May 2017- present): National Institute of Plant Genome Research (NIPGR), New Delhi, India
Postdoctoral Research Associate (2015-2017): University Of Virginia, Charlottesville, VA, USA
Research Scientist (2014-2015): Sir Ganga Ram Hospital, New Delhi, India
PhD Bioinformatics (2009-2014): Bioinformatics Centre, Institute of Microbial Technology (IMTECH), Chandigarh under Jawaharlal Nehru University (JNU), New Delhi, India
M.Sc. Life Sciences (2007-2009): Jawaharlal Nehru University (JNU), New Delhi, India
B.Sc. Biotechnology (2004-2007): Jamia Millia Islamia (JMI), New Delhi, India
Awards and Fellowships
Junior and Senior Research Fellowship (2009-2014): Council of Scientific and Industrial Research (CSIR), New Delhi, India
GATE (Graduate Aptitude Test in Engineering): Qualified in years 2008 and 2009
Scientific Contributions/ Recognitions
Associate editor: Journal of Translational Medicine.
Editorial Board Member of Journal: Theoretical Biology and Medical Modelling.
Reviewer: PloS One, BMC Genomics, BMC Bioinformatics, BMC Biology, BMC Biotechnology, Frontiers in Physiology and several other journals.
Web Resources/ Databases (Developed/ Contributed)
A Platform for Designing Genome-Based Personalized Immunotherapy or Vaccine against Cancer (http://www.imtech.res.in/raghava/cancertope/)
GenomeABC: A webserver for benchmarking of genome assemblers. (http://crdd.osdd.net/raghava/genomeabc/).
Genomics web portal page. (http://crdd.osdd.net/raghava/genomesrs/).
Map/Alignment module of CancerDr: Cancer Drug Resistance Database. (http://crdd.osdd.net/raghava/cancerdr/).
Short reads and contigs alignment module of PCMDB: Pancreatic cancer methylation database. (http://crdd.osdd.net/raghava/pcmdb/).
Burkholderia sp. SJ98 database. (http://crdd.osdd.net/raghava/genomesrs/burkholderia/).
Rhodococcus imtechensis RKJ300 database. (http://crdd.osdd.net/raghava/genomesrs/rkj300/).
Genotrick: A pipeline for whole genome assembly and annotation of Genomes (http://crdd.osdd.net/raghava/genomesrs/genotrick/)
Development of Debian packages in OSDDlinux: A Customized Operating System for Drug Discovery. (http://osddlinux.osdd.net/).
A Web-Based Platform for Designing Vaccines against Existing and Emerging Strains of Mycobacterium tuberculosis. (http://crdd.osdd.net/raghava/mtbveb/).