Computational protocol: Ancient gene duplications have shaped developmental stage-specific expression in Pristionchus pacificus

Similar protocols

Protocol publication

[…] Raw reads were aligned to the P. pacificus Hybrid1 genome assembly using tophat (version 2.0.3) []. Cufflinks (version 2.0.1) was used to quantify expression levels as FPKM values in all ten libraries for all predicted genes []. We used the ‘prcomp’ function of the R package ‘stats’, to perform a PCA for all genes with Cufflinks FPKM values >0. In parallel, we used the program Cuffdiff (version 2.0.1) in blind mode, i.e. without information about biological replicates, to predict significant differential expression across all pairwise comparisons (FDR <0.01). For the Cuffdiff results, we used a hierarchical clustering approach, implemented in R, to complement the results obtained from the PCA.To identify clusters of coregulated genes, we applied the biclustering algorithm SAMBA [], implemented in the Expander package (version 6.3.1), to the pairwise results from Cuffdiff and run in blind mode. We generated an n×m matrix with n genes and m=45 comparisons, with the individual entries indicating whether a gene was significantly differentially expressed (-1 : = downregulation and 1: = upregulation) or not. We then loaded the matrix as relative expression data into Expander and ran SAMBA using default settings, resulting in 29 biclusters.FPKM values for all genes (Cufflinks), fold changes and FDR corrected p-values (Cuffdiff) for all genes that were found to be significantly differentially expressed in at least one comparison, and the assignments of genes to expression biclusters (SAMBA) are shown in Additional file . [...] We used the program hmmsearch (option -E 0.001) to search for known protein domains in the set of 30,884 predicted P. pacificus protein sequences (version TAU []), as defined by the PFAM database. The search program hmmsearch and the PFAM domain database were both obtained from the HMMER package (version 3.0).In order to identify homologs for P. pacificus genes, we downloaded protein sequences for C. elegans, C. briggsae, C. angaria, Haemonchus contortus, Meloidogyne hapla, Brugia malayi, Bursaphelenchelus xylophilus, Ascaris suum, and Trichinella spiralis from Wormbase WS230 and Heterorhabditis bacteriophora sequences from Wormbase WS231. Furthermore we downloaded protein sequences for Loa loa and Wuchereria bancrofti from the filarial worms sequencing project, Broad Institute of Harvard and MIT (, Meloidogyne incognita protein sequences from the M. incognita resources website (, Panagrellus redivivus sequences from a website provided by Jagan Srinivasan, and protein sequences for Dirofilaria immitis from Homologs for P. pacificus were identified by searching these data sets for BLASTP (version 2.2.28+) hits with e-value <0.001. This resulted in 20,999 (68 %) P. pacificus proteins with homologs in other nematode species and 9885 (32 %) P. pacificus proteins without homologs (orphan genes).We predicted one-to-one pairs between P. pacificus and C. elegans by using a variant of the widely employed methodology of best-reciprocal hits [, , ]. More precisely, we first defined inparalogs and then assigned best-reciprocal hits as one-to-one orthologs, only if neither the C. elegans nor the P. pacificus protein had any inparalog. Hereby, inparalogs were defined by following a similar methodology as implemented in the Inparanoid method [], i.e. by identifying intraspecies BLASTP pairs that are more closely related than the best inter-species pairs. This procedure predicted 5985 one-to-one orthologous pairs. We evaluated the quality of one-to-one orthology predictions using a data set of 107 C. elegans genes for which orthology relationships were manually investigated using alternative versions of P. pacificus gene predictions, TBLASTN searches to complement incomplete gene models, and subsequent phylogenetic analysis including all potential paralogous sequences. Out of 57 C. elegans genes with manually identified one-to-one orthologs 42 were also correctly predicted by our automatic method. For 50 C. elegans genes, for which manual analysis could not identify one-to-one orthologs, 48 did not have predicted one-to-one orthologs by our automatic method. False and missing orthology assignments could be attributed in most cases to missing or only incomplete gene predictions. Although our set of manually annotated C. elegans genes is not a random subset and therefore is not representative for the whole genome, our results indicate that even though a potentially large fraction of true orthology relationships were missed, most of the predicted one-to-one orthologs are indeed correct. The evaluation results and the manually curated sequences for the P. pacificus orthologs of 57 C. elegans are presented in the Tables S4 and S5 and Data S1 (Additional file ).The set of P. pacificus orphan genes and genes with homologs in other nematode species (excluding P. pacificus one-to-one orthologs with C. elegans), were further subdivided into singleton sequences and genes with putative paralogs by computing an adjacency matrix out of BLASTP hits within P. pacificus and extracting all connected components. Proteins that were members of connected components of size greater than one, were classified as “with paralogs”, or “singletons” otherwise. In the case of genes with homologs in other nematode species, the category “with paralogs” (N=11,919) represents multiple P. pacificus genes with many-to-many, many-to-one, and many-to-zero orthology relationships [] with respect to C. elegans genes, while singletons (N=3095) represent one-to-many orthology relationships. Similarly, the 9885 P. pacificus orphan genes were divided into 4820 genes with paralogs and 5065 singletons. [...] Multiple sequence alignment for HSP20 and HSP70 protein sequences were computed using Clustal Omega tool []. In order to test which model of amino acid substitution better explain the evolution of these proteins we have used Prottest server []. For both analyzed gene families, the LG substitution model was identified as best model by Prottest. The final maximum likelihood trees (Figs. and ) were constructed using LG substitution model, as implemented in the R Library Phangorn []. […]

Pipeline specifications

Software tools TopHat, Cufflinks, HMMER, BLASTP, TBLASTN, Clustal Omega, ProtTest, Phangorn
Databases InParanoid Meloidogyne
Applications Phylogenetics, RNA-seq analysis, Amino acid sequence alignment
Organisms Caenorhabditis elegans, Pristionchus pacificus, Parazen pacificus