Computational protocol: Antigen Diversity in the Parasitic Bacterium Anaplasma phagocytophilum Arises from Selectively-Represented, Spatially Clustered Functional Pseudogenes

Similar protocols

Protocol publication

[…] Ninety-four msp2 pseudogene DNA sequences were obtained for the New York state, US human-origin strain A. phagocytophilum HZ (NC 007797). Ninety-one of these sequences could be aligned using Prankster ( using the Hasegawa-Kishino-Yano (HKY85) substitution model and program defaults. Phylogenetic analysis was conducted using the program PAUP* to construct a parsimony tree with default values. Gaps were considered missing. The unrooted tree was edited and displayed using FigTree v1.2 ( determine the 1∶1 amino acid sequence identities between the pseudogenes and expression cassette sequences, 199 complete A. phagocytophilum msp2 expression site gene sequences were included in the study. The source of these sequences including host origin are described in three earlier publications , , . (Although more expression cassette sequences were available, only one per animal was used for the analysis in this paper). Expression cassettes and the 94 HZ pseudogenes were translated to amino acids, and then a Fasta file was analyzed with MatGat V2.02 (Matrix Global alignment tool) using the BLOSUM 50 model. These 94 pseudogenes were chosen because they contained 5′ or 3′ conserved regions and a LAKT motif that allowed for analysis. The percentage identities determined by MatGat for an all-against-all comparison were exported to a Microsoft (Redmond, WA) Excel spreadsheet. Summary statistics of percent identities were obtained in the program R (R-Development Core Team, The maximum identity for each pseudogene against all possible expression cassettes was determined and the mean, standard deviation, mode, and range of maximum identities calculated. These statistics give the likelihood that any particular pseudogene was used in any expression. Maximum identities were discretized for analysis as nearly perfect (99–100% identity between a pseudogene and an expression cassette), high identity (90–98%), moderate (70–89%), low moderate (60–69%), low (40–59%), and not used (<40%). A transposed summary also was created to evaluate the maximum identity of all pseudogenes against each given expression cassette, to capture the likelihood that any given cassette could have sampled from particular vs. multiple pseudogenes. In order to determine whether pseudogenes that had high or nearly perfect identity with expression cassettes were more likely to match with particular types of hosts compared with pseudogenes with moderate and low moderate identities, a chi-square contingency test was performed with the following hosts: deer, human, horse, carnivore, European animals, and woodrats.For spatial analysis of the distribution of pseudogenes on the two genomes, we used a Wald-Wolfowitz runs test in the R package “lawstat” with units in the analysis = genes. An ANOVA test was used to compare mean percent identities of pseudogenes with expression cassettes in the three identified spatial clusters on the genome. A Spearman rank correlation coefficient was calculated in order to assess whether maximum identities of pseudogenes near the expression site were higher than for those more distantly positioned pseudogenes. For all tests, a value of P≤0.05 was considered evidence of statistical significance. […]

Pipeline specifications

Software tools FigTree, MatGAT
Application Phylogenetics
Organisms Anaplasma phagocytophilum
Diseases Infection