Computational protocol: An atlas of transcriptional, chromatin accessibility, and surface marker changes in human mesoderm development

Similar protocols

Protocol publication

[…] Obtained RNA-seq reads were trimmed for base call quality (PHRED score >=21) and for adapter sequences (using Skewer), and then were subsequently processed using a slightly-modified version of the ENCODE long RNA-seq pipeline for quantification of mRNA expression (https://www.encodeproject.org/rna-seq/long-rnas/). Specifically, reads were aligned to hg38 using STAR 2.4 (ref. ); gene-level expression was then quantified using RSEM 1.2.21 (ref. ). We only kept samples with at least 10,000,000 uniquely mapping reads and with at least 50% of reads uniquely mapping, which meant rejecting one sample (from sclerotome) out of 34. The numbers and percentages of uniquely mapping reads for each sample are listed in . The full parameter settings used can be found in our versions of STAR_RSEM.sh and STAR_RSEM_prep.py (see Code Availability below).To facilitate global comparisons of gene expression levels across cell types, we first took the log2TPM (transcripts per million) values for each gene, before filtering out all genes where there was a difference of less than 2 (in log2TPM units, i.e., a 4-fold difference in expression) between the cell types with the highest and lowest expression. Next, we used ComBat with non-parametric priors (as implemented through the sva R package) to correct for batch effects. This sometimes left small negative values for the expression of some genes, which we set to 0. The R Markdown script implementing this batch correction is bulkDataViz.Rmd.For ease of use, we also prepared a spreadsheet with TPM values for each gene, augmented with the following information on each gene: 1) whether the gene product is present on the cell surface (GO code GO:0009986); 2) for each pair of adjacent conditions, whether the gene was differentially expressed between those conditions; and 3) the shrunken log-fold-change for that gene between those conditions. We provide (1) as a convenience to help in finding potential surface markers that were not included in our high-throughput screen (e.g., because an antibody was not available). (2) and (3) were calculated by DESeq2 (ref. ) using batch information; genes were called as differentially expressed at a false discovery rate (FDR) of 0.1.The raw data from the bulk-population RNA-seq can be found in [Data Citation 1]. A spreadsheet of TPM values can be found in [Data Citation 2]. The annotated spreadsheet, as described in the previous paragraph, is in [Data Citation 3]. […]

Pipeline specifications

Software tools skewer, STAR, RSEM, ComBat, DESeq2
Application RNA-seq analysis