Computational protocol: Peeling Back the Evolutionary Layers of Molecular Mechanisms Responsive to Exercise-Stress in the Skeletal Muscle of the Racing Horse

Similar protocols

Protocol publication

[…] We generated RNA-sequence (RNA-seq) data in six horses before exercise (BE) and AE and described elsewhere. Briefly, samples of skeletal muscle and blood were taken from six Thoroughbred horses BE and AE. ‘BE’ samples were collected from the triceps brachii of the right leg and from the jugular vein and carotid artery of each horse. After a resting period of several hours, ‘AE’ samples were collected immediately after a 30-min trot from the same tissues of each individual. Thoroughbred horses usually canter for 17–18 min per day. For the purposes of this study, a 30-min trot was taken to be the equivalent to 17–18 min of cantering. Total RNAs from skeletal muscle and blood samples were isolated using TRIzol (Invitrogen) and the RNeasy RNA purification kit with DNase treatment (Qiagen). mRNA was isolated from the total RNA using oligo-dT beads and reverse transcribed into double-stranded cDNA fragments. Construction and sequencing of an RNA-seq library for each sample was carried out based on Illumina HiSeq2000 protocols in order to generate 90 pair-end reads. Twenty-four sets of transcriptome data were generated for muscle and blood from six horses both BE and AE. TopHat (version 1.4.1) was used to map the sequences to a horse reference genome and annotated using the EquCab2 database (http://hgdownload.cse.ucsc.edu/downloads.html#horse). [...] We used the R package edgeR, which is based on a negative binomial model, to examine differential expression of replicated count data. This is because RNA-seq data may exhibit more variability than expected in a Poisson distribution, because it is more widely dispersed in the genome. When a negative binomial model is used, the dispersion has to be estimated before a differentially expressed gene (DEG) analysis is carried out. EdgeR provides an estimation of this dispersion using a Cox-Reid profile-adjusted likelihood method. After negative binomial models are fitted and dispersion estimates are obtained, it is possible to proceed with the tests for determining differential expression using a generalized linear model (GLM) likelihood ratio test. GLMs specify probability distributions according to the mean–variance relationship; for example, the quadratic mean–variance relationship for a read count. The GLM likelihood ratio test is based on the idea of fitting negative binomial GLMs with Cox-Reid dispersion estimates. This automatically takes all known sources of variation into account. Significant DEGs were detected with a cut-off value of false discovery rate (FDR) <0.01, based on a paired design between ‘BE’ and ‘AE’. The equine Ensembl gene IDs were converted to official gene symbols by cross-matching with human Ensembl gene IDs and the official gene symbols. The official gene symbols of human homologues of equine genes were used for functional clustering and enrichment analyses using the Database for Annotation, Visualization, and Integrated Discovery (DAVID). The representation of functional groups in blood and skeletal muscle relative to the whole genome was investigated using the Expression Analysis Systematic Explorer (EASE) tool within the DAVID, of which the EASE is a modified Fisher's exact test used to measure enrichment of gene ontology (GO) terms. [...] Mapped reads were assembled using Cufflinks (version 1.3.0) to estimate the abundance of genes. Fragments per kilobase of exon per million fragments (FPKM) of each sample were calculated to estimate the expression levels of the genes. The FPKM value set for all genes in the muscle and blood tissue samples was normalized using Quantile normalization. Connectivity was calculated using the Weighted Gene Co-expression Network Analysis (WGCNA) R-package for both up-regulated and down-regulated genes (DRGs) having six points of FPKM for both ‘BE’ and ‘AE’. This connectivity, referred to as the degree of connectivity, is the sum of the connection strengths with other genes or networks. Following calculation of the degree of connectivity, a gene co-expression network was generated and the genes clustered onto a topological overlap matrix (TOM), based on their dissimilarity. Genes with a high connectivity to each other clustered at the same module. Modules were formed based on one of the exercise conditions, and the extent to which the module was preserved was calculated for a different condition in order to identify any reciprocal disruption of gene expression. An Eigengene-based connectivity was calculated for up-regulated genes (URGs) of the modules, AE, in both muscle and blood. The cut-off value, module membership values >0.95, and gene significance (GS) values >20 were applied to identify the genes with a high GS and high intra-modular connectivity in each module. [...] Whole-blood samples were collected from 14 Thoroughbred racing stallions from the Korean Racing Authority and from four male and two female Jeju domestic ponies (Equus caballus) from the Jeju Provincial Livestock Institute, Korea. Blood (10 ml) was drawn from the carotid artery and was treated with heparin to prevent clotting. A genomic DNA quality check was carried out using fluorescence-based quantification on an agarose gel, a standard electrophoresis on a 0.6% agarose gel, and via a pulsed-field gel, using 200 ng of DNA. Manufacturers' instructions were followed to create a paired library of 500-bp fragments. This consisted of the following: purified genomic DNA fragments of <800 bp, fragments with blunt ends, fragments with 5′ phosphorylated ends, fragments with a 3′-dA overhang, some with adaptor-modified ends, purified ligation product, and a genomic DNA library. Following this, we generated a sequence data using HiSeq 2000 (Illumina, Inc.). Using the Burrows-Wheeler Aligner with the default setting, pair-end sequence reads were mapped to the reference horse genome (equCab2). We used the following open-source software packages; Picard Tools, SAMtools, and the Genome analysis toolkit, for downstream processing and variant calling. Substitution calls were made with GATK UnifiedGenotyper. All calls with a Phred-scaled quality of <20 were filtered out. For each chromosome, we inferred the phased haplotype and imputed the missing alleles for the entire set of Thoroughbred populations simultaneously using BEAGLE. After phasing, observed heterozygosity and minor allele frequency (MAF) were calculated from autosome set using PLINK (version 1.07). Average observed heterozygosities were 0.285 in Thoroughbred horses and 0.345 in Jeju ponies. Average MAFs were 0.207 in Thoroughbred horses and 0.220 in Jeju ponies.We additionally genotyped 11 Thoroughbred horses using EquineSNP50 Genotyping BeadChip (Illumina, Inc.). Common loci in both SNP chip and DNA re-sequencing data were extracted and compared using VCFtools. We identified 98.278–99.572% of genotype concordance (Supplementary Table S2). [...] The iHS was calculated using the rehh R package and all SNPs from the horse genomes with the same ancestral state as that from the Jeju domestic ponies. Values were accepted when core SNPs were located in genes. We selected a significant iHS at P-value where Φ(x) represents the Gaussian cumulative distribution function of iHS, which is >2, and iHS <0. Conventional FST values were calculated for genes using Arlequin 3.5 based on pairwise differences between the haplotypes of Thoroughbred racing horses and those of Jeju domestic ponies. The gene region was derived from the phased haplotype of the horse genomes, using genomic information (Ensembl Genes67 and equCab2). The cut-off point for FST is 95% quantile of the empirical distribution of FST. [...] We downloaded the protein and reference mRNA sequences for humans, mouse, dog, pig, cow, and horse from ENSEMBL. The 1 : 1 orthologs of all six species were made using Mestortho. Following this, 9000 1 : 1 orthologs were found and used to estimate the synonymous and non-synonymous substitution rates in mammals. Phylogenetic trees were obtained using Timetree. Orthologous gene sets were aligned using PRANK with the default settings, and poorly aligned sites were eliminated using Gblocks. The maximum likelihood method (codeml of PAML 4) was used to estimate the dN (the rate of non-synonymous substitutions), dS (the rate of synonymous substitutions), and ω (the ratio of non-synonymous substitutions to the rate of synonymous substitutions), with F3 × 4 codon frequencies under the branch model (model = 2, NSsites = 0) and the basic mode l (model = 0, NSsites = 0). Orthologs with dS > 3 or ω > 5 were filtered. After this process, 8417 orthologs remained. A log-likelihood ratio test was performed to compare these models, and we applied an FDR correction. [...] To ascertain the FPKM values of the URGs in skeletal muscle tissue, signed correlation values [adjacency values: (0.5 × (1 + correlation))soft-thresholding power] were calculated using WGCNA, and a weighted network adjacency matrix was drawn up. We applied a cut-off value of >0.6 to the adjacency, which signifies 0.86 of R2 and 0.92 of the correlation to a linear line fitting with in a power-law distribution. This resulted in a scale free topology of the network. Those genes directly connected to the module core genes and that were associated with the dN/dS, FST, and iHS statistics, which were selected to form the single-depth connection network for the core genes. The correlation network plot was generated using the network visualization tool, Cytoscape (version 2.83). A KEGG pathway enrichment test was performed using EASE, with a cut-off value of <0.1 for the selected and core genes associated with dN/dS, FST, and iHS. […]

Pipeline specifications

Software tools TopHat, edgeR, DAVID, EASE, Cufflinks, WGCNA, BWA, Picard, SAMtools, GATK, PLINK, VCFtools
Applications RNA-seq analysis, GWAS
Organisms Equus caballus, Homo sapiens
Diseases Nervous System Diseases