Data warehousing bioinformatics tools | Genome annotation
To facilitate the integration and querying of genomics data, a number of generic data warehousing frameworks have been developed. They differ in their design and capabilities, as well as their intended audience.
Integrates BioMart data resources with data analysis software in Bioconductor. BiomaRt can annotate a wide range of gene or gene product identifiers with information such as gene symbol, chromosomal coordinates, Gene Ontology and Online Mendelian Inheritance in Man (OMIM) annotation. Furthermore, biomaRt enables retrieval of genomic sequences and single nucleotide polymorphism information, which can be used in data analysis. Fast and up-to-date data retrieval is possible as the package executes direct SQL queries to the BioMart databases. The biomaRt package provides a tight integration of large, public or locally installed BioMart databases with data analysis in Bioconductor creating a powerful environment for biological data mining.
Gives access to many free software tools for sequence analysis. EMBOSS aims to serve the molecular biology community. It permits the creation and the release of software in an open source spirit. This tool is useful for sequence analysis into a seamless whole. It is free of charge and is available in open source.
Allows construction of online genome databases. Tripal is an open source toolkit built on the Drupal content management system and that stores data in the recommended standardized biological database schema, Chado. This software consists of a core set of modules that encompass a variety of common biological and genetic data types, such as organisms, sequence features and genetic markers which include attributes such as genus and species, nucleotide residues and marker locations, respectively.
Provides a framework for genomic data storage, display and analysis, and offers integration of existing and novel genome analysis tools. xGBD furnishes a packaged solution to many types of research applications. In particular, it is well suited for small to moderately sized research groups desiring local access to genomic data or an out-of-the-box system for analyzing emerging data. xGDB differs from and is complementary to database systems such as GMOD, EnsEMBL, and GenBank.
Provides user interface and database connection code for annotation data packages using SQLite data storage. AnnotationDbi is the virtual base class for all annotation packages. It contains a database connection and is meant to be the parent for a set of classes in the Bioconductor annotation packages. The package includes: organism, platform, homology and system-biology.
Provides extensive automatically generated and configurable RESTful web services. InterMine is especially designed to integrate and analyse complex biological data. It can be used in other application in order to find and filter data; export it in a flexible and structured way; to upload, use, manipulate and analyse lists. The tool enables the user to create biological databases accessed by sophisticated web query tools.
Provides an easy to use and intuitive way to explore and store genome data and gene predictions. Badger can be used as a central hub for genome projects allowing project members to search and access data as and when it is available. The database can hold multiple species, each with multiple genome versions and each genome with multiple gene prediction sets.
Provides a plug-in for Pathway Tools, an integrated systems biology software to create, maintain and query Pathway/Genome Databases (PGDBs). ACIB PGDB Toolbox is fully integrated into the graphical user interface (GUI) and menu. It extends the application’s functionality by the ability to create multiple sequence alignments, systematically annotate insertion sequence (IS) elements and analyse their activity by cross-species comparison tools. Microarray probes can be automatically mapped to target genes, and expression data obtained with these arrays can be transformed into input formats needed to visualize them in the various omics viewers of Pathway Tools. The plug-in API itself allows developers to integrate their own functions into the Pathway Tools menu.
Offers comprehensive support from management of personal digital research resources to their sharing in open-access neuroinformatics databases. Concierge introduces a desktop application for managing personal digital research resources. The metadata stored in this application can also be uploaded (together with the primary resource) to neuroinformatics databases without additional efforts. This interaction between personal and shared neuroinformatics databases is expected to enhance digital research resource circulation.
Enables the management of samples metadata and next-generation sequencing (NGS) data pre-processing, quality assessment and visualization. MAV-seq allows users to (i) manage research, experiments, samples and NGS metadata; (ii) control access to the centralized and distributed storage and high-performance computing resources; (iii) automate and standardize quality checking, pre-processing and analysis of NGS data with visualization and report generation of obtained results.
Enables storage and management of large volumes of single nucleotide polymorphism (SNP) data. TheSNPpit is a database system that can handle panels of any size, including those derived from whole genome scans. The software allows the definition of subsets of original data at basically no storage cost, and subsets can then be exported by only specifying their names. Exports can be integrated into pipelines.
Interlinks multiple tools and datasets into a unified database system. GKDB is an application intending to incorporate and merges data related with evolutionary, expression, physical location or functional annotation. This program is composed of several customizable modules which can be adapted to various tasks, including the definition of a set of candidate genes or the retrieving of genes sharing a conserved protein domain in a given tissue.
Allows searching, exploration and ranking-aware combination of distributed bio-data. Bio-SeCo is an application that enables explorative search and automatic ranking-aware integration of heterogeneous biomedical-molecular data provided by the individual services registered in the framework. The registered services, in the user interface, can be used and combined, according to their connection patterns defined at service registration time, to explore and globally search the data that they provide.
A bioinformatics platform that integrates clinical data, NGS data and whole-slide bioimages from tissue sections. POS is a web-based platform that is scalable, flexible and expandable. The underlying database is based on a data warehouse schema, which is used to integrate information from different sources.
Manages and shares sets of genes. GSB is a web platform which aims to provides a set of utilities for a given team intending to ease common work. The application includes functions for assisting users in (i) annotating a pre-existing set of genes coupled with a confidence rating system; (ii) sharing results with a defined community and; (iii) importing datasets and exporting results towards a local computer in various formats, including FASTA files.
A modified version of Spark SQL, a distributed SQL execution engine, that handles efficiently joins that use genomic intervals as keys. With this modification, Spark SQL serves such joins more than 50× faster than its existing brute force approach and 8× faster than similar distributed implementations. Thus, Spark SQL can replace existing practices to retrieve genomic data.
Stores multiple biological sequences of variable alphabet size, with customizable character transformations, wildcard support, and an assortment of internal representations optimized for different distributions of wildcards and sequence lengths. GtEncseq offers a flexible and easily configurable alphabet transformation which is useful for many applications. It offers unique features like accessing the same sequence using different reading directions and access to a sequence virtually concatenated with its reverse complement.
Automates real-time data import process for the i2b2 data warehouse. HIStream allows users to employ the warehouse as a plug and play system without the need of standardizing the totality of each imported clinical information system. The application aims to operate on data streams and allows i2b2 to be used in real-time. Additionally, users can define rules with logic formulas with temporal extensions.