Computational protocol: Computational methods for predicting genomic islands in microbial genomes

Similar protocols

Protocol publication

[…] Methods based on gene sequence composition are often designed to detect LGT, or laterally transferred genes , and only a few methods are specifically developed to detect GIs. The methods for LGT detection can be utilized to identify GIs by combing clusters of laterally transferred genes, but they are supposed to be less sensitive, since some genes inside a GI may not show atypicality to allow the whole GI being captured. Here we mainly discuss specific methods for GI detection.Some GI detection methods combine multiple discrimination criteria, such as Karlin's method and PAI-IDA . Karlin's method and PAI-IDA predict GIs and PAIs by evaluating multiple compositional features (GC content, dinucleotide frequencies, codon usage, and amino acid usage). Karlin's method is a single-threshold method, while PAI-IDA uses iterative discriminant analysis. Both methods use a sliding window to scan the genome, and sequences or genes inside each window are used for computation.Other methods use only a single discrimination criterion, such as IslandPath-DINUC , and SIGI-HMM . IslandPath-DINUC uses a single-threshold method to predict GIs as multiple consecutive genes with only dinucleotide bias. SIGI-HMM predicts GIs and putative donor of laterally transferred genes based solely on the codon usage bias of individual gene. As an extension of SIGI , an earlier method based on scores derived from codon frequencies, SIGI-HMM substitutes the previous heuristic method with Hidden Markov Model (HMM) to model the laterally transferred genes and native genes as different states.Methods based on gene sequence composition are generally easy to implement and apply. But what they indeed find are compositionally atypical genomic regions in terms of certain criteria. So there are many false positives and false negatives. Native regions may easily be detected as false positives owing to their atypical composition for reasons other than LGT, such as highly expressed genes . At the same time, ameliorated GIs or GIs originated from genomes with similar composition may not be detected. But the false positives can be reduced by eliminating well-known non-GIs. For example, by filtering out putative highly expressed genes based on codon usage, SIGI-HMM was reported to have the highest precision in a previous evaluation .For methods performing comparisons with the genomic average, laterally transferred regions may contaminate the genome and reduce the accuracy of predictions . Furthermore, the predicted boundaries of GIs are not precise, since the boundaries between laterally transferred genes and native genes can be compositionally ambiguous . Additionally, these methods at the gene level require reliable gene annotations. Thus, they may not be applied to newly sequenced genomes, which have no or incomplete annotations. [...] The increase of newly sequenced genomes without complete annotations necessitates GI prediction based on DNA sequences alone. Without the aid of gene boundaries, the large genome has to be segmented by other measures. According to genome segmentation approaches, methods based on DNA sequence composition can be classified into two major kinds: window-based methods and windowless methods.Window-based single-threshold methods are commonly used for GI detection. These methods use a sliding window to segment the whole genome sequence into a set of smaller regions. There are several representative programs, including AlienHunter , Centroid , INDeGenIUS , Design-Island and GI-SVM . The major differences among them are in: the size of the sliding window, the choice of the discrimination criterion and similarity measure, and the determination of the threshold.Both AlienHunter and GI-SVM use a fixed-size overlapping window of fixed step size. AlienHunter is the first program for GI detection on raw genomic sequences. It measures segment atypicality via relative entropy based on interpolated variable order motifs (IVOM). The threshold can be obtained by either k-means clustering or standard deviation (when there are fewer samples). GI-SVM is a recent method using either fixed or variable order k-mer frequencies. It detects atypical windows via one-class SVM with spectrum kernel. An automatic threshold can be obtained from one dimensional k-means clustering.Centroid partitions the genome by a non-overlapping window of fixed size. The average of k-mer frequency vectors for all the windows is seen as the centroid. Based on the Manhattan distances from each frequency vector to the centroid, outlier windows are selected by a threshold derived from standard deviation. INDeGenIUS is a method similar to Centroid. But it uses overlapping windows of fixed size and computes the centroid via hierarchical clustering.Design-Island is a two-phase method utilizing k-mer frequencies. It incorporates statistical tests based on different distance measures to determine the atypicality of a segment via pre-specified thresholds. In the first phase a variable-size window is used to obtain initial GIs, whereas in the refinement phase a smaller window of fixed size is used to scan over these putative GIs for getting final GI predictions.Some of these methods are designed to alleviate the problem of genome contamination. Design-Island excludes the initially obtained putative GIs when computing parameters for the entire genome in the second phase. GI-SVM measures the atypicality of all the windows simultaneously via one-class SVM, and only some windows contribute to the genomic signature.To deal with the imprecise GI boundaries that result from a large step size, AlienHunter uses HMM to further localize the boundaries between predicted GIs and non-GIs. But most other programs do not consider this issue.The few windowless methods mainly include GC Profile , and MJSD .GC Profile is an intuitive method to calculate global GC content distribution of a genome with high resolution. The abrupt drop in the profile indicates the sharp decrease of GC content and thus the potential presence of a GI. This method was later developed into a web-based tool which is used for analyzing GC content in genome sequences . However, other features have to be used together with GC Profile for GI prediction due to the poor discrimination power of GC content.MJSD is a recursive segmentation method based on Markov Jensen-Shannon divergence (MJSD) measure. The genome is recursively cut into two segments by finding a position where the sequences to its left and to its right have statistically significant compositional differences. Subsequently, each segment is compared against the whole genome to check its atypicality via a predefined threshold.Methods based on DNA sequence composition have the similar advantages and disadvantages as methods based on gene sequence composition.Specifically, window-based methods can be highly sensitive with appropriate implementations. For example, AlienHunter was reported to have the highest recall in previous evaluation , and GI-SVM was recently shown to have even higher sensitivity than AlienHunter . But their precisions are quite low due to the limited input information. They are also inherently incapable of identifying the precise boundaries between regions with compositional differences .In contrast, windowless methods can delineate the boundaries between GIs and non-GIs more accurately . GC Profile has successfully discovered a few reliable GIs in several genomes . But it seems subjective to access the abruptness of jump in the GC profile, and only GIs with low GC content can be detected. MJSD is better at predicting GIs of size larger than 10 kb , but the procedure to determine segment atypicality still suffers from the contamination of the whole genome. [...] The presence of compositional bias is usually not sufficient to assure the foreign origin of putative GIs. Thus, it is necessary to develop methods based on multiple GI-related structural features. According to the approaches of integrating different features, methods based on GI structure can be divided into direct integration methods and machine learning methods.The direct integration methods adopt a series of filters to get more reliable GIs. But some integrated features are only used for validation, since it is difficult to systematically use them for prediction given the extreme GI structural variation. There are mainly two representative programs: IslandPath and Islander .IslandPath is the first program integrating multiple features (GC bias, dinucleotide bias, the presence of tDNAs and mobility-related genes) to aid GI detection. But IslandPath only annotates and displays these features in the whole genome, leaving it to the user to decide whether a region is a GI or not. Based on these computed features, a GI can be identified as multiple consecutive genes with both dinucleotide bias and the presence of mobility-related genes (IslandPath-DIMOB) .Islander incorporates a method to accurately detect tDNA-borne GIs. Islander seeks specific tDNA signature to find candidate GIs. Several filters are used to exclude potential false positives, such as regions without integrase genes. Recently, the filtering algorithms are refined via incorporating more precise annotations available now .Several machine learning approaches based on constructed GI datasets have been proposed, including Relevance Vector Machine (RVM) , GIDetector , and GIHunter . The major differences among them are in the choices of training datasets, GI-related features, and learning algorithms.RVM is the first machine learning method to study structural models of GIs. It is based on the datasets constructed from comparative genomics methods. Eight features of each genomic region are used to train GI models: IVOM score, insertion point, GI size, gene density, repeats, phage-related protein domains, integrase protein domains and non-coding RNAs.GIDetector utilizes the same features and training datasets as RVM, but it implements decision tree based ensemble learning algorithm. GIHunter uses the similar algorithm as GIDetector, but adopts slightly different features and datasets. GI size and repeats are replaced by highly expressed genes and average intergenic distance. The training datasets are replaced by IslandPick datasets. The predictions of GIHunter for thousands of microbial genomes are available online at http://www5.esu.edu/cpsc/bioinfo/dgi/index.php.Methods utilizing GI structure can generate more robust predictions. For example, the high reliability of GIs inserted at tDNA sites leads to very few false positives in the predictions from Islander . But these methods depend on accurate identification of multiple related features, such as tRNA genes, mobility-related genes, and virulence factors.Direct integration methods are straightforward, but many GIs may be filtered out due to the lack of certain features. For example, IslandPath-DIMOB was shown to have very low recall in spite of high accuracy and precision .Conversely, machine learning approaches can systematically integrate multiple GI features to improve GI prediction. This can be partly reflected by the high recall and precision of GIHunter . However, the performance of supervised methods is closely related to the quality of training datasets. [...] Methods based on several genomes detect GIs based on their sporadic phylogenetic distribution. They compare multiple related genomes to find regions present in a subset but not all the genomes. The comparison procedure often involves analyzing results from sequence alignment tools , such as local alignment tool BLAST , and whole-genome alignment tool MAUVE .BLAST and MAUVE can be used to find unique strain-specific regions (GI candidates), whereas MAUVE can also be used to find conserved regions. For example, Vernikos and Parkhill performed genome-wide comparisons via all-against-all BLAST, and then applied manual inspection to find reliable GIs for training GI structural models . They also differentiated gene gain from gene loss via a maximum parsimony model obtained from MAUVE alignments. Despite the tediousness of manual analysis, there are only two automatic methods based on several genomes: tRNAcc and IslandPick .The tRNAcc method utilizes alignments from MAUVE to find GIs between a conserved tRNA gene and a conserved downstream flanking region across the selected genomes. It was later integrated into MobilomeFINDER , an integrative web-based application to predict GIs with both computational and experimental methods. Complementary analysis is also incorporated in tRNAcc to provide additional support, including GC Profile, strain-specific coding sequences derived from BLAST analysis, and dinucleotide differences. But appropriate genomes to compare have to be selected manually.To facilitate genome selection, IslandPick builds an all-against-all genome distance matrix and utilizes several cut-offs to select suitable genomes to compare with the query genome, making it the first completely automatic comparative genomics method. The pairwise whole-genome alignments are done by MAUVE to get large unique regions in the query genome. After being filtered by BLAST to eliminate genome duplications, these regions are considered as putative GIs.Due to the inaccuracies of composition-based methods, methods based on several genomes are preferred if there are appropriate genomes for comparison . But uncertainties still exist in their predictions. Firstly, the results are dependent on the genomes compared with the query genome . Secondly, it is hard to distinguish between gene gain via LGT and gene loss . Thirdly, genomic rearrangements can cause difficulties in accurate sequence alignments . In addition, the applications of methods based on several genomes are limited, since the genome sequences of related organisms may not be available for some query genomes. [...] Different kinds of methods often predict non-overlapping GIs and complement each other . To make the best of available methods, ensemble methods have been proposed to combine different methods.One way of combination is to merge the predictions from multiple programs. This approach is implemented in IslandViewer and EGID . IslandViewer is a web-based application combining three programs: SIGI-HMM, IslandPath-DIMOB, and IslandPick. It provides the first user-friendly integrated interface for visualizing and downloading predicted GIs. Newer versions of IslandViewer include further improvements , , such as improving efficiency and flexibility, incorporating additional gene annotations, and adding interactive visualizations. But the underlying integration method is mainly a union of predictions from individual programs. Unlike IslandViewer, EGID uses a voting approach to combine predictions from five programs: Alienhunter, IslandPath, SIGI-HMM, INDeGenIUS, and PAI-IDA. A user-friendly interface for EGID is provided in the program GIST .Another way of combination is to filter the predictions from one method by other methods. This approach is common for PAI prediction, since it is critical to utilize multiple features to discern PAIs from other GIs. Several PAI detection programs adopt this approach, including PAIDB , PredictBias and PIPS . These programs often combine composition-based methods, comparative genomics methods, and homology-based methods.Both PAIDB and PredictBias firstly identify putative GIs based on compositional bias. For PAIDB, the putative GIs homologous to published PAIs (overlapping with PAI-like regions obtained from homology searches) are seen as candidate PAIs. SIGI-HMM and IslandPath-DIMOB are later integrated into PAIDB for GI predictions . To overcome the dependency on known PAIs, PredictBias constructs a profile database of virulence factors (VFPD). If the putative GIs (or eight contiguous genes) have a pre-specified number of significant hits to VFPD, they are seen as potential PAIs. PredictBias also integrates comparative analysis to validate the potential PAIs.PIPS integrates multiple available tools for computing PAI-associated features. It filters out the initial predictions from comparative genomics analysis via empirical logic rules on selected features (GC content, codon usage, virulence factors and hypothetical proteins).Combining the predictions of several programs is supposed to perform better than individual programs. Actually, IslandViewer was shown to increase the recall and accuracy without much sacrifice of precision , and EGID was reported to yield balanced recall and precision .The available ensemble methods are mostly characterized by user-friendly interfaces, but the combination procedures do not seem to be sophisticated enough. Some valuable predictions made by one method may be discarded in the ensemble method. For example, PredictBias was shown to have lower sensitivity and accuracy than PIPS on two bacterial strains , which reflects the effects of different integration strategies on the performances to some extent. [...] Thanks to low-cost high-throughput sequencing, an increasing number of microbial genomes are being sequenced. However, many of these genomes are in draft status. So there is a need to predict GIs in incomplete genomes. Currently, there are only two programs for this purpose: GI-GPS and IslandViewer 3 . Both programs firstly assemble the sequence contigs into a draft genome, and then use methods similar to those for predicting GIs in complete genomes.GI-GPS is a component of GI-POP, a web-based application integrating annotations and GI predictions for ongoing microbial genome projects. GI-GPS uses an assembler within GI-POP for genome assembly. Then an SVM classifier with radial basis function kernel is applied to segments obtained from a sliding window of fixed size along the genome. The classifier is trained on IslandPick datasets and selected GIs from PAIDB. GI-GPS utilizes compositional features in model training to tolerate potential errors in the assembled genome. The predictions from the classifier are filtered by homologous searches to keep only sequences with MGE evidence. Then the boundaries of filtered sequences are refined by repeats and tRNA genes.IslandViewer 3 maps the annotated contigs to a completed reference genome to generate a concatenated genome. Then it uses this single genome as input to the normal IslandViewer pipeline.GI-GPS and IslandViewer 3 make it feasible to predict GIs for draft genomes. But they are still simplistic and limited. For example, IslandViewer 3 is restricted to the genome which has very few contigs and reference genomes of closely related strains of the same species . Furthermore, it seems inappropriate to apply methods similar to those developed for complete genomes, since draft genome sequences do not have as high quality as whole genome sequences. […]

Pipeline specifications