Computational protocol: Genome bias influences amino acid choices: analysis of amino acid substitution and re-compilation of substitution matrices exclusive to an AT-biased genome

Similar protocols

Protocol publication

[…] To understand the role of nucleotide bias on the amino acid and codon usage of an organism, we selected the GC- and AT-rich genomes of M. tuberculosis and P. falciparum, respectively for comparison. A complete list of P. falciparum proteins was obtained from the ftp site ( at NCBI ( The incomplete annotations (putative, predicted and hypothetical) were filtered out to give a final set of 302 proteins. A protein BLAST (version 2.2.10) was performed with this set of annotated proteins as query against the M. tuberculosis proteins at an E-value of 10−5. An exclusive set of 88 protein hits was achieved for which the amino acid composition was calculated per protein using a Perl code. The statistical t-test for correlated samples was performed for each amino acid fraction obtained from this set for these organisms. A simple Perl script was written for calculating codon frequencies coding each amino acid. The corresponding coding regions of the 88 protein orthologs were used for the purpose. The ptt and ffn files at NCBI's ftp site were used to retrieve the same for both the genomes. [...] All the statistical tests used here like ANOVA and t-test was performed using VassarStats, a website for statistical computation ( [...] Our approach was to use a fully annotated protein set from P. falciparum and its orthologs (mostly BLAST hits having similar annotation were picked up) for the study of amino acid substitutions. For this, a complete list of P. falciparum proteins was obtained from the ftp site ( of NCBI ( The incomplete annotations like ‘hypothetical’, ‘probable’ and ‘predicted’ were filtered out and a set of 302 proteins was obtained. Distantly related orthologs were picked manually for this set using the genomic BLAST (blastp search against both microbial and eukaryotic genomes) at NCBI (E < 1) from 10–20 taxa representing all three domains of life. Organisms chosen as subjects were distantly related to P. falciparum.For manual selection of the orthologs, first, only those proteins were selected that had annotation similar to the query sequence. Second, annotated hits were picked up irrespective of the order of their E-values, to get distant orthologs. Third, overrepresentation of subject hits to a particular taxonomic group was avoided, and, lastly, in case of hypothetical hits that were picked up (to represent a particular taxa that lacked an annotated hit), E-value near to zero (<10−5) and length similar to the query was considered. However, the third option was rarely used and the total hypothetical proteins constituted only 6–7% of the total sequences used to build the matrices.Clustering was performed to remove redundancy in the ortholog protein set with BLASTCLUST program from the blast-2.2.10 package. Sequences were clustered at 90% identity over 80% of the sequence length. Proteins that showed few (proteins that gave ortholog hits to less than 10 organisms) or biased representation (proteins that gave hits to a biased group of organisms only, e.g. proteins showing hits to only Plasmodium genus) of orthologs to a particular kingdom were eliminated, reducing the working set to only 265. […]

Pipeline specifications

Software tools DELTA-BLAST, VassarStats, BLASTP, BLASTclust
Applications Miscellaneous, Protein sequence analysis, Amino acid sequence alignment
Organisms Plasmodium falciparum
Diseases Ataxia Telangiectasia