Computational protocol: Factors Influencing the Diversity of Iron Uptake Systems in Aquatic Microorganisms

Similar protocols

Protocol publication

[…] The set of non-redundant (Uniref 100) protein sequences for the genes listed in Table were downloaded from Uniprot and HMM-ModE profiles were created as described earlier (Srivastava et al., ). The HMM-ModE protocol allows the construction of HMM profiles with increased specificity by using negative training sequences.The training sequences for each protein were first clustered using the Markov Clustering Algorithm (MCL) (Enright et al., ). For each subgroup of each protein, the training sequences were aligned with MUSCLE (Edgar, ) and HMMs were generated using hmmbuild from the HMMER2 package (Eddy, ). The discrimination threshold of each protein HMM was optimized by an n-fold cross-validation exercise. The training sequences for each were divided into n test sets such that each sequence is part of at least one test set. For each test set t, the remaining (n-1) sets were combined to form the train set and used to build an HMM. The sequences in t were scored using this HMM by hmmsearch program to get a True Positive (TP) score distribution. False positives (FP) were identified from the negative training set (in this case the entire UniProt database excluding the training sequences for the gene in question). The sensitivity, specificity, and Matthews Correlation Coefficient (MCC) distribution for each of n sets was calculated (Hannenhalli and Russell, ). The optimal discrimination threshold was identified as the mid-point of the MCC distribution averaged over the n sets. Further increase in specificity was obtained by modifying the emission probabilities of the gene HMM by using the FP alignment as described earlier (Srivastava et al., ).These HMM-ModE profiles with their optimized threshold were used with the program hmmsearch to scan the protein sequences from the marine microbial genomes as well as from the GOS metagenomes. All the above steps were performed using customized Perl scripts that are available for download from https://sites.google.com/site/dhwanidesai/home/bioinformatics. [...] The 16S rRNA gene sequences for the genomes were retrieved from GenBank along with the E. coli 16S rRNA sequence. These were aligned using MUSCLE and imported into the ARB software (Ludwig et al., ). A Maximum Likelihood tree was calculated with the FastDNAML (Olsen et al., ) implementation in ARB using a filter for base 800 to base 1300 encompassing and extending on both sides, the v6 hypervariable region in E. coli. An in-house script was used to calculate the average phylogenetic distance of a gene as follows:where Dp(g) is the set of pairwise phylogenetic distances between all pairs of genomes where gene g occurs.For the genes discriminating between taxa or locations, the nucleotide sequences were retrieved from GenBank and the Maximum Likelihood estimations of average pairwise non-synonymous by synonymous mutation (dN/dS) ratios were calculated using CodeML (runmode = −2) from the PAML package (Yang, ). […]

Pipeline specifications

Software tools MUSCLE, HMMER, ARB, fastDNAml, PAML
Databases UniRef
Applications Phylogenetics, Metagenomic sequencing analysis, Nucleotide sequence alignment
Chemicals Heme, Iron, Vitamin B 12