Computational protocol: Sequence Motifs in MADS Transcription Factors Responsible for Specificity and Diversification of Protein-Protein Interaction

Similar protocols

Protocol publication

[…] Protein structures were modeled using Modeller 8.2 , with the structures 1EGW and 1N6J as templates, using the automodel module and generating 1000 structures. The best one according to objective function was selected. SNP data were obtained from www.1001genomes.org (data for 80 ecotypes). Intron/exon structures were defined using the software tool Scipio . To compare the observed overlap of predicted motifs with SNPs or their observed distance from intron/exon boundaries with random expectation, random motif instances were generated by randomly choosing a number of motif locations in each protein equal to the predicted number of motif locations. This was repeated 1000 times for each sequence.Conservation of motif occurrences was assessed as follows. First, MIKC MADS sequences were obtained from the genomes of rice , poplar , grape vine , maize (www.maizesequence.org), Medicago truncatula (www.medicago.org), papaya and sorghum . For rice and poplar we used the MIKC MADS domain protein sequences as provided in the respective publications; for the other genomes, sequences were obtained from the full set of coding sequences using the profile HMM software HMMER with the PFAM models for the MADS-domain (SRF-TF) and the K-domain (K-box). Next, putative orthologs for the Arabidopsis protein sequences were identified by aligning each sequence to each Arabidopsis sequence using MUSCLE and using sequence identity as the criterion in a bi-directional best hit approach. Subsequently, for each motif occurrence in a particular Arabidopsis protein, its conservation was calculated as the fraction of characters in the motif which were identical in the homologous regions in its orthologs; the same was calculated for all Arabidopsis proteins which did not have a motif occurrence at that particular location in the sequence.To obtain insight into the dynamics of interaction motifs upon duplications, we analyzed a set of 1,459 MIKC MADS domain protein sequences from 257 species (obtained from Interpro by requiring the presence of both a MADS and K-box domain, IPR002487 and IPR002100, respectively). From these, we obtained pairs of putative duplicates, which we defined simply as two proteins from the same species both having their highest sequence similarity with members of the same clade in Arabidopsis (as defined in ).We focused on indels because occurrence of an insertion or deletion could be interpreted as a signature of disruption of the interaction motif. For each pair of protein sequences, indel positions in their sequence alignments were probed by looking for stretches of length d with high sequence identity, and one insertion/deletion occurring. d was set to 6, and the cutoff for identity was set to 5 (i.e. all positions except the indel were required to be identical). Subsequently, the overlap between those indels and the predicted interaction motifs was assessed.To perform intramolecular correlated mutation analysis of AP1, sequences of MADS proteins were obtained using blastp on the NR database, filtering with hmmsearch to retain only sequences with a MADS-domain and a K-domain, and assigning sequences as putative AP1 orthologs using a best-hit criterion. These sequences were aligned with MUSCLE . Subsequently, the CAPS algorithm was used to obtain correlated mutations, using a reasonably stringent cutoff of 0.4 on the Pearson correlation coefficient that is returned between pairs of sites with this algorithm. […]

Pipeline specifications

Software tools MODELLER, HMMER, MUSCLE, BLASTP
Databases Pfam
Applications Protein structure analysis, Nucleotide sequence alignment
Organisms Arabidopsis thaliana, Homo sapiens