Computational protocol: The evolution of protostome GATA factors: Molecular phylogenetics, synteny, and intron/exon structure reveal orthologous relationships

Similar protocols

Protocol publication

[…] To identify putative GATA conserved domains, whole genome traces were downloaded to a local database and searched using two previously described Platynereis GATA factors and TblastN with each individual genome. Genome sequence from T. castaneum (Tcas_2.0) and A. mellifera (Amel4.0) was obtained from the Baylor College of Medicine Human Genome Sequencing Center []. I. scapularis (iscapularis.TRACE-WIKEL.june07) and A. gambiae (AgamP3) genome sequence was obtained from the VectorBase []. D. pulex, C. capitata, and L. gigantea sequence data (v.1.0) was obtained from the US Department of Energy Joint Genome Institute []. S. mediterranea (v.3.1) sequence data was produced by the Genome Sequencing Center at Washington University School of Medicine in St. Louis [].The TblastN hits from the genomic trace archives were validated and grouped using subsequent blast analyses. First, TblastN hits were validated by blastx against the Genbank NR genome, with a positive hit showing highest similarity to GATA sequences in other organisms. Validated hits were then clustered, using blastn to search for like hits in the organism's trace archive, using these to group all positive traces and remove duplicates from the list of positive TblastN hits. The best deuterostome TblastN hit from each of the blastx analyses was recorded, and used for reciprocal best hit BLAST analysis to assign the initial orthology to known deuterostome classes. This process was repeated until no additional exons could be identified.To assemble the individual exons, we used two distinct methods. In cases where a genome assembly was publicly available, contigs containing these exons were identified by blastn and compared to define the assembled exon structure for individual genes. In cases where no genome assembly was available, we attempted to first connect these exons by searching for traces with overlap between two exons. In the case where no single trace could be identified to connect two exons, we performed chromosome walks on the individual exon using the Tracembler program []. These larger contigs, which was based upon overlapping sequence and also mate-pair relationships, were then used to determine linkage between genes. [...] A sequence file was made for each of the GATA genes using the conserved dual-zinc finger domain, consisting of the two zinc finger exons and the N-terminal portion of the following exon. These sequences were aligned using Clustalw [], and then manual improvements were made in MacVector (see Additional File ). Maximum likelihood analysis was conducted using PhyML-aLRT [,] using a JTT model of evolution, and branch support given by the aLRT CHI2-based parametric statistic. Bayesian Inference was conducted using the MrBayes v3.1 [], using the JTT model of evolution. The results are a consensus of two-converged runs of 3,000,000 generations, and branch supports given as posterior probabilities. Neighbor joining distance-based analyses was conducted using the MacVector program (v7.2.3) [], and the support given by bootstrap percentiles of 10000 replicates. For the nematode GATA factors, the complete sequence for each factor was aligned using Clustalx, a tree was generated using PhyML-aLRT, and includes support from both the PhyML-alrt CHI2-based parametric statistic and Bootstrap percentiles from a Neighbor Joining analysis in MacVector. […]

Pipeline specifications

Software tools TBLASTN, BLASTX, BLASTN, Tracembler, Clustal W, MacVector, PhyML, MrBayes
Databases VectorBase
Applications Phylogenetics, Amino acid sequence alignment
Organisms Caenorhabditis elegans, Drosophila melanogaster