Computational protocol: Detecting laterally transferred genes: use of entropic clustering methods and genome position

Similar protocols

Protocol publication

[…] Compositional properties of genes rarely lie as points about a single defining set of parameters; rather, they fall along a range of parameters (for example, of codon usage bias). At high stringency (significance threshold), the JS clustering algorithm may cause native genes, or the genes from a donor organism, to be sorted into more than one class representing this spectrum; relaxing the stringency may raise the misclassification error and lead to the undesirable merger of classes of genes. Gene-context information can be used to identify classes of genes that may have originated from the same source organism. If a gene belongs to class ci whereas the two flanking genes are grouped in class cj, we define this adjacency as a link between classes ci and cj. To quantify the significance of this link, we define P(ci↔cj) as, 6 where N(ci→cj) is the total number of connections from class ci to cj and L(cx) is the number of genes in class cx. If P(ci↔cj) exceeds an established threshold, the genes comprising the two classes are physically associated within the genome, perhaps due to common origin; the genes from these two entropic classes are assigned to a single logical class.In the next post-processing step, we again use the genome context information of genes to refine the composition of gene classes. Here, a gene is reassigned to the class of its neighbors only if it plausibly lies within that class. Specifically, if a gene belongs to logical class ci whereas the immediate neighbors of this gene are grouped in logical class cj, this gene is reassigned to class cj, if and only if it is either not atypical or only slightly atypical with respect to class cj (determined by slightly relaxing the stringency) as inferred within a hypothesis testing framework. [...] Other parametric methods for foreign gene detection were coded as follows. Karlin () suggested dinucleotide bias as a genome signature, ρXY = fXY/fXfY, assessed through the odds ratio, fXY is the frequency of the dinucleotide XY and fX is the frequency of the nucleotide X. If the dinucleotide average relative abundance difference between gene g and genome G (average over all genes) defined as exceeds an established threshold, the gene is classified as foreign. The Karlin's Codon Usage Difference () between gene g and genome G was quantified as , fc is the frequency of codon c normalized in the respective synonymous codon group a, Pa is the normalized frequency of amino acid a. If B(g|G) exceeds an established threshold, g is classified as a foreign gene.Hayes and Borodovsky () developed a k-means gene clustering algorithm using Kullback–Leibler distance, , as a measure of codon usage difference between gene g and cluster C to decide the algorithm convergence (na is the size of the ath group of synonymous codons, fc denotes the normalized frequency of codon c as described above). Initial seeds for typical and atypical clusters were obtained from GeneMark predictions, each gene was reassigned to the cluster with the closest cluster center determined through D, cluster centers were recomputed and this process was repeated until convergence. Our recently developed AIC-based gene clustering algorithm is similar in spirit to our proposed JS divergence based gene clustering method, gene classes are populated in a hierarchical agglomerative clustering fashion, however, here clustering is decided in a model selection framework. We used a generalized version of AIC, , as a stopping criterion for clustering [ is the maximum likelihood, K is the number of free parameters, n is the sample size and n0 is the tuning parameter ()]. Garcia-Vallve et al. () used multiple metrics, namely G+C content, codon and amino acid usage to compile putative horizontally transferred genes in their HGT-DB database. The machine-learning method Wn-SVM uses a one-class support vector machine for identifying alien genes (). Alien-Hunter detects putative alien genes using variable order motif distributions (). […]

Pipeline specifications

Software tools GeneMark, Alien_hunter
Databases HGT-DB
Application Genome annotation
Organisms Escherichia coli K-12