Computational protocol: Tissue Specific Evolution of Protein Coding Genes in Human and Mouse

Similar protocols

Protocol publication

[…] We used RNA-seq data for mouse from the ENCODE project [] and for human from Fagerberg et al. []. For mouse, the raw reads in FASTQ format obtained from the ENCODE FTP server (ftp://hgdownload.cse.ucsc.edu/goldenPath/mm9/encodeDCC/wgEncodeCshlLongRnaSeq/) were processed with TopHat2 (version 2.0.9) and Cufflinks2 (version 2.1.1) [], using the gene models from Ensembl version 69 []. For human, processed data from Fagerberg et al. [] were retrieved from the ArrayExpress database (E-MTAB-1733) []. For mouse 22 tissues and 27 tissues for human were analyzed, of which 16 are homologous between the two species. Processed RNA-seq expression was further treated as follows (R script in ): multiplied by 106 (to avoid values under 1, which are negative after log transformation); log2 transformation; and quantile normalization, replacing zero values by log2(1.0001).We used either global parameters of expression: median expression, maximal expression, and specificity among all tissues; or expression in each tissue separately. Expression specificity τ was calculated as described in Yanai et al. []. Values of expression specificity close to zero indicate that a gene is broadly expressed, and close to one that it is specific of one tissue.Additional analysis was performed on microarray expression data from 22 human and 22 mouse tissues selected from the Bgee database [] (), as well as 8 human and 6 mouse tissues from Brawand et al. RNA-seq []. The corresponding results are presented in .As measures of evolutionary rate of protein-coding genes, we used the estimates from the branch-site model [] as precomputed in Selectome []: purifying selection estimated by the dN/dS ratio ω0 on the proportion of codons under purifying selection (noted "Omega" in the figures), evidence for positive selection estimated by the log-likelihood ratio ΔlnL of H1 to H0 (models with or without positive selection), and the proportion of neutrally evolving sites p1. In Selectome, computations are done on multiple sequence alignments filtered to remove ambiguously aligned regions []; all filtered and unfiltered alignments can be downloaded at http://selectome.unil.ch/cgi-bin/download.cgi. The evolutionary rate parameters were estimated from the Euteleostomi gene trees on the Murinae branch for mouse and the Homininae branch for human. We also present in supplementary materials () another estimation of evolutionary rate, using the exon based MI score [,].For all parameters the longest coding transcript was chosen as a representative of the gene, as the evolutionary rate data were available only for this transcript. Analysis was also redone for mouse using the most expressed transcript; results are presented in .Intron number, intron length, CDS length (coding sequence length) and GC content were taken from Ensembl 69 []. Essentiality data were manually mapped and curated (Walid Gharib, personal communication) for human from the OMIM database [] and for mouse from the MGI database [].Data of expression at different developmental stages were obtained from Bgee []. The parameter stage number indicates the number of stages in which the gene was reported as expressed. Mouse development was divided in 10 stages: 1. Zygote 2. Cleavage 3. Blastula 4. Gastrula 5. Neurula 6. Organogenesis 7. Fetus 8. Infant 9. Adolescent 10. Adult; and human in 8 stages: 1. Cleavage 2. Blastula 3. Organogenesis 4. Fetus 5. Infant 6. Adolescent 7. Early adult 8. Late Adult.Phyletic age and connectivity (protein-protein interactions) data were downloaded from the OGEE database [], as ordinal data. Phyletic age stages used are: 1. Mammalia 2. Chordata 3. Metazoa 4. Fungi/Metazoa 5. Eukaryota 6. Cellular organism.Recombination rate was calculated from Cox et al. 2009 data [].Correlation between the different parameters was performed in two ways: simple pairwise Spearman correlation and partial Spearman correlation (results for Pearson correlation are also presented in ). For partial correlation each pair of parameters were compared taking into account all other parameters: first a linear model according to all other parameters for each of the two analyzed parameters was calculated; then the Spearman correlation was calculated on the corresponding residuals. All R code is available as .Partial correlation was used to determine the correlation between two parameters excluding dependencies from other parameters. The principle of the partial correlation can be shown on a toy example (Table A in ). As example data for human height and leg length were simulated, so that either a) the length of both legs is calculated depending on height, or b) the length of the left leg is calculated from height, and the length of the right leg is calculated from the length of left leg. With simple correlation the two cases cannot be distinguished, as all three parameters correlate strongly with each other. With partial correlation we can distinguish the two cases: in case a) left leg and right leg length don’t correlate with each other if we exclude influence of the height, but in case b) we see a strong correlation between them, as expected, while right leg length no longer correlates with height.Expression, intron length, intron number, CDS length, τ, ω0, paralog number were log2 transformed before calculations. p1 and ΔlnL were normalized by taking the fourth root [,]. For parameters containing zeros a small value was added before log transformation, chosen as the minimal non zero value of the parameter (except for RNA-seq, see detailed treatment above). We note that despite the transformations, the distributions of several parameters remain quite far from normal, with notably peaks of null values. While this probably has some impact on the analysis, we minimize it by using Spearman correlation; truncating distributions would not allow to analyze globally the influence of all parameters. All distributions, before and after transformation, are shown in Figure A in . Altogether 9553 protein-coding genes for human and 9485 protein-coding genes for mouse were analyzed.All the analysis was performed in R [] using Lattice [] and plyr []. For the representation of the data Cytoscape version 2.8.2 [] with library RCytoscape [] and Circos version 0.62–1 [] were used. […]

Pipeline specifications

Software tools TopHat, Cufflinks, RCytoscape, Circos
Databases OMIM ArrayExpress Bgee Selectome
Applications RNA-seq analysis, Genome data visualization
Organisms Mus musculus, Homo sapiens, Saccharomyces cerevisiae