Computational protocol: Murine HPV16 E7-expressing transgenic skin effectively emulates the cellular and molecular features of human high-grade squamous intraepithelial lesions

Similar protocols

Protocol publication

[…] The publically available microarray data from the den Boon et al. manuscript was downloaded from the GEO database (GSE63514) and used to test for enrichment of the K14E7 signatures in cervical cancer and CIN cohorts. Procedures involved in murine skin grafting, sample collection, RNA extraction, cDNA library preparation, sequencing and data pre-processing for the mouse skin RNA-sequencing (RNA-seq) data have been previously described , . Primary data are available via NCBI sequence read archive (SRA ID: SRP113560, BioProject accession ID: PRJNA395772). The ‘UP’ and ‘DOWN’ gene signatures (gene sets) of K14E7 mice were curated using the top-ranked ~200–300 genes up/down-regulated in K14E7 skin versus age-matched C57BL/6 skin from the RNA-seq analysis. The ranking of genes was performed using standardised differential gene expression testing based on Empirical Bayes moderated t-statistics embedded within the Linear Models for Microarray Data (LIMMA) R package . The gene expression heat maps were generated using HeatMapImage module within the GenePattern public server . [...] Gene symbols from the murine gene sets (GRCm38.p5/mm10) were converted to human orthologous symbols (GRCh38.p5) for the gene set tests with Gene Set Enrichment Analysis (GSEA) . GSEA is a popular analysis tool that can be used to assess whether samples in a microarray or an RNA-seq experiment would display correlation with biological phenotypes defined by a priori classified gene sets/signatures, such as gene sets contained in the Molecular Signatures Database (MSigDB) . It is a competitive gene set test that uses sample permutation/re-sampling to detect the enrichment of genes at the top or bottom of gene sets, correlating the enrichment between either of two phenotypes. This is typically performed a large database of gene sets where a false discovery rate (FDR) is applied to the nominal p-value calculations to adjust for multiple testings. In this paper, we used GSEA on a small list of gene sets (n=2 for K14E7 ‘UP’ and ‘DOWN’ gene sets, and n=4 ‘UP’ and ‘DOWN’ gene sets from K14E7 grafting data) to broadly evaluate if the mouse K14E7 gene sets would be enriched in the human microarray data. To supplement this approach, we used two additional gene set tests: Correlation Adjusted MEan RAnk (CAMERA) and Rotation (ROAST) gene set tests . These two methods are included in the LIMMA R package and can be applied to data fitted to a linear model, such as RNA-seq count data analysed using the LIMMA-Voom pipeline , , displaying improved statistical power even for small data sets . CAMERA is a competitive gene set test that evaluates whether ranking of genes in a gene set is highly ranked relative to all other genes not in the gene set, while accounting for inter-gene correlation in a linear model fit . ROAST is a self-contained gene set test that tests whether genes in the gene set are differentially expressed, returning the proportion of genes that are up- or down-regulated in the gene set . A pre-ranked GSEA analysis was also performed where we tested for enrichment of CIN3 or Cancer ‘UP’ and ‘DOWN’ gene sets in a pre-ranked list of genes from K14E7 skin versus C57BL/6 skin, ranked according to t-statistics values assigned by LIMMA. The CIN3 and cancer gene signatures were curated using the top ~200–300 significant differentially expressed genes identified via LIMMA.CAMERA and ROAST were also used for gene set testing for enrichment of immunologic gene sets curated at the MSigDB. The gene sets were downloaded from the Walter and Eliza Hall Institute of Medical Research Bioinformatics Resources website: http://bioinf.wehi.edu.au/software/MSigDB/ where complete human and converted murine versions of the MSigDB C7 gene sets (annotated with Entrez Gene IDs) are available. [...] The CIBERSORT gene expression deconvolution package was used to estimate the immune cell composition in the grafting RNA-seq data . Read counts (per million), equalised for library size using edgeR , were used as input for the analysis. The LM22 signature was used as the immune cell gene signature and the settings for the run were: 1000 permutation with quantile normalisation disabled. Two-way ANOVA with Tukey's multiple corrections was used to infer the statistical significance of the predicted immune cell populations where P<0.05 was considered significant. […]

Pipeline specifications

Software tools limma, GenePattern, GSEA, voom, CIBERSORT, edgeR
Databases MSigDB
Application RNA-seq analysis