Computational protocol: High-Throughput Sequencing, a Versatile Weapon to Support Genome-Based Diagnosis in Infectious Diseases: Applications to Clinical Bacteriology

Similar protocols

Protocol publication

[…] HTS technologies are currently able to sequence millions or even billions of bases per run. Once sequences are obtained, the next steps of storing, analyzing and interpreting this huge amount of data requires bioinformatics skills and tools. The cost of sequencing dropped faster than what would have been expected by Moore’s law. Unfortunately, this rapid decrease has not been matched by a comparable decrease in the cost of the computational infrastructure required to mine the data []. Every new project in the field will require proportionally less money for the sequencing part, but will have to allocate more resources to the bioinformatics management and analysis of the data. That is the reason why bioinformatics, if not already, will become the bottleneck for a complete and rapid exploitation and interpretation of HTS data. The evolution of HTS technologies implied the parallel development of specific and adapted bioinformatics tools. The scientific community makes a lot of tools dedicated to HTS data available (more than 600 tools are listed at []). Several tools have to be cleverly linked in order to obtain a functional pipeline to produce final results. Choosing the appropriate tool set, depending on the sequencing technology and the application, can become a real brainteaser. However, some complete and specific pipelines for viral pathogen discovery from HTS data are already available (see ; and for a review, see []). In the field of pathogen discovery, bioinformatics tools can be categorized into two groups depending on the application: the pathogen identification and the pathogen characterization. In the case of identification, the aim is to distinguish closed strains in order to rapidly choose a suitable treatment, whereas for characterization, the pathogen genome is studied in-depth in order to highlight some gene transfers and infra-specific variations. For both of these applications, a step of mapping against a reference genome is often necessary. Mapping algorithms can be used to localize the reads onto the genome or to filter out reads from the host. A lot of mappers are available: Fonseca et al. listed more than 80 mappers [] (see for examples of mappers used in pathogen studies). In the case of pathogen identification, the mapping step is often sufficient to identify the pathogen or a closed strain and to obtain relevant information to choose a treatment. However, the pathogen may need to be better characterized, for example identifying gene transfers or an emerging pathogen. In this case, other bioinformatics tools are required. De novo assembly algorithms align and merge reads to obtain longer fragments in order to reconstruct the original sequence without a reference sequence. Assembly is useful for studying an emerging pathogen or identifying gene transfers by assembling reads that were not mapped onto a reference genome. Most assemblers are specific to one or a subset of sequencing technologies (see ). In single-cell sequencing, the MDA amplification leads to non-uniform read coverage, as well as elevated levels of sequencing errors and chimeric reads. Some recent assemblers have been developed to deal with these specifications (see ). In the field of pathogen discovery, another important task is to annotate the sequences obtained from mapping or assembly, which is often done by comparison, consisting of searching sequence similarity within current databases. For example, PATRIC (PathoSystems Resource Integration Center) is the Bacterial Bioinformatics Resource Center, an information system designed to support the biomedical research community’s work on bacterial infectious diseases []. ViPR, the Virus Pathogen Resource, is a publicly available resource that supports the research of viral pathogens []. A famous tool for similarity finding is BLAST [], which is widely used in annotation tools and pipelines. Some annotation pipelines for prokaryotic sequences annotation are also available (see ). In the case of public health applications, identification of specific sequences, such as virulence and resistance genes, is essential to adapt the medical treatment. Drug resistance genes can be detected using BLAST against specific databases, such as ARDB for antibiotic resistance gene screening [] or using dedicated tool, such as resFinder []. For the typing of bacteria species, the MLST scheme [] can be used to identify sequence type directly from reads or from the obtained assembled sequence. Several free on-line tools exist for the identification of prophage sequences within bacterial genome (e.g., PHAST []). For plasmid identification, one possible way is to BLAST the sequence against a plasmid database, such as the PATRIC plasmid database. Another possible analysis is to identify SNPs and DIPs (deletion insertion polymorphisms) suitable for downstream phylogenetic analyses (see [] for a survey of tools for variant analysis). One part of culture-independent sequencing analysis is based on metagenomic approaches and requires specific bioinformatics tools. The analysis of metagenomics data represents a big challenge, as it relies on identifying each individual organism in a mixed sample. Two main metagenomic approaches are used in microbial community analysis: 16S rRNA and whole genome shotgun metagenomics. In 16S rRNA metagenomic approaches, the main step is to assemble overlapping reads and to reduce the dataset complexity by determining operational taxonomic units (OTU) clustering (for example, using UCLUST /USERACH [] or CD-HIT []). In whole genome shotgun approaches, reads are assembled using, for example, MIRA, MetaVelvet or IDBA-UD (see ). In both approaches, the final step is to perform taxonomic classification and compute diversity metrics, which can be done using ARB [] and the SILVA database [] or the Greengenes database []. Some programs integrate all the analysis steps (e.g., QIIME [] and mothur [] for 16S metagenomics and MEGAN [] for whole genome shotgun metagenomics). The analysis of HTS data requires high-performance computational resources. Even if CPU speeds and memory capacities have increased, the huge amount of data to be handled in HTS analyses often requires adequate computational solutions. Several solutions are widely used for HTS programs, such as computer clustering and cloud computing. A computer cluster consists of a set of connected computers with a centralized management approach. With cloud computing, researchers have the option of simply paying for their computing requirements, rather than building and maintaining their own physical computing infrastructure. Most academic bioinformatics tools for HTS are technology-dependent, open-source and designed to be run in a UNIX environment with command lines (see ). Computational skills are often necessary, and only a few of them offer a graphical user interface to make them easy to use []. To fill this gap, some frameworks have been developed with the famous example of Galaxy [], an open web-based platform for accessible and reproducible results for genomic research. Some tools are available as web-based resources, which make them easy to use (see ). Another way to make the tools accessible to a broader community is the use of virtual machines (VMs). A VM is a software implementation of a machine (i.e., a computer) that executes programs like a physical machine. A snapshot of a given configuration can be taken and distributed to the community. For example, Cloud BioLinux is a publicly accessible VM containing more than 135 bioinformatics packages, along with documentation, a desktop interface and graphical software applications []. Bioinformatics is playing and will undoubtedly play a central role in the development of HTS in medicine. One of the many examples showing how bioinformatics impacts the medical management of infectious diseases is the German outbreak caused by the entero-hemorrhagic O104:H4 E. coli strain in 2011 [,]. The genome sequence was rapidly available through HTS, and at the same time, the microbial characterization, including the clinically antibiotic susceptibility profile, was provided. The authors show that the antibiotic profile can be computationally identified. The German outbreak also used crowd-sourcing as a power tool to fight against pathogens. The genome sequence was released in open-access, and the scientific community was asked for help to annotate the genome. The crowd-sourcing analysis allowed the obtaining of the first annotated version of the genome in a few days. The routine use of HTS data in clinical microbiology will depend on the availability of bioinformatics tools, which have to be integrated, i.e., ready and easy-to-use. In addition to bioinformatics solutions brought by the academic scientific community, IVD companies offer and will develop specific bioinformatics tools (most likely proprietary) for these targeted applications. With benchtop sequencers, many laboratories and clinical centers can invest in these HTS technologies, whose informatics and bioinformatics is more and more within reach. HTS, as well as the surrounding analytical systems seem to have entered a phase of maturity and are now generalized; nevertheless, some barriers still must be overcome before this (r)evolution finds application in routine genome-based diagnosis. […]

Pipeline specifications