Computational protocol: Intrinsic transcriptional heterogeneity in B cells controls early class switching to IgE

Similar protocols

Protocol publication

[…] A reference transcriptome file was constructed by appending the following sequences to the mouse GRCm38 genome reference transcripts: External RNA Controls Consortium (ERCC) spike-in sequences, Iγ1-eGFP and Iε-tdTom transcripts, ncRNAs (germline transcripts) from the Iγ1 (), Iε (), Iα (), Iγ2a (), Iγ2b (), and Iγ3 () promoters, the Iμ–Cμ transcript (GenBank accession no. AH005309.2), and switched transcripts composed of the Iμ promoter with sequences from the Cγ1 (ENSMUST00000194304)-, Cε (ENSMUST00000137336)-, Cα (ENSMUST00000178282)-, Cγ2a (ENSMUST00000103416)-, Cγ2b (ENSMUST00000103418)-, and Cγ3 (ENSMUST00000103423)-constant regions. This reference transcriptome was used to construct an index for Salmon (v0.6.; , preprint) in quasi-mapping mode. Transcript-level expression measured as TPM (Tables S1 and S2) was then estimated using Salmon (v0.6.0) and converted to gene-level expression by passing a map for conversion of transcript identifiers to gene identifiers using Salmon’s −g flag.Cells were excluded (28 from batch 1 and 23 from batch 2) from further analyses if: >80% of reads mapped to ERCC sequences or had <2,000 expressed (TPM ≥ 1) genes; <70% of reads mapped to any reference sequence; or >10% of reads mapped to mitochondrial genomes (see Fig. S4, A and B; and Table S3). After quality control, TPM values for the ERCC spike-ins were removed from the dataset for each cell, and the values for the remaining endogenous transcripts were rescaled so that they summed to one million.For stringent mapping, a reference transcriptome was constructed containing solely the Iγ1-GFP and Iε-Tom transcripts; the ncRNAs from the Iγ1, Iε, Iα, Iγ2a, Iγ2b, and Iγ3 promoters; and the Iμ–Cμ transcript and the switched transcripts Iμ-Cγ1, Iμ-Cε, Iμ-Cα, Iμ-Cγ2a, Iμ-Cγ2b, and Iμ-Cγ3. An index for Bowtie (v1.1.2; ) was created using the bowtie-build indexer. Reads were mapped against this index using Bowtie (v1.1.2) and the options–best –m 1 –X 2000 to ensure that only the best uniquely mapping reads were reported as aligned. Unaligned reads were removed from the resulting output SAM file using samtools view with the following options: -h -f 3. Only reads uniquely mapped to each transcript sequence were counted using a custom Python script. Plots indicating mapping coverage of these reads were generated using bedtools genomecov to count read depth at each base within the transcripts followed by visualization of these depths using a custom Python script. This stringent mapping restricts detection to reads that map uniquely to one of the transcripts. It is likely that this lowers the detection sensitivity and hence underestimates the numbers of transcript-positive cells, but it ensures high confidence in those transcripts that are detected.Markers of B cell activation were determined by analyzing existing transcriptomic data (), which provides gene expression values (as fragments per kilobase million) for one sample of resting B cells and one of undivided B cells 24 h after activation. 626 genes indicative of activation were chosen as those with at least a fourfold increase in expression in the activated sample and with a minimum fragments per kilobase million of 2 in that sample. The activation score for each cell in this study was calculated as the sum of TPM values for the activation marker genes normalized to the maximum activation score from each batch (see Tables S3 and S4).Cell cycle assignment was performed using Cyclone (Tables S3 and S4; ). Gene expression values measured in TPM were used as input. This generated G1, G2M, and S-phase scores for each cell. Cells were assigned to a particular cell cycle stage as previously described () such that if either the G1 or G2M score was above 0.5, the cell was assigned to the stage with the highest score. If neither of these scores was above 0.5, the cell was assigned to S phase.HVGs and LVGs were determined using BASiCs (data as described in ). Estimated read counts from Salmon (rounded to integer values) were used as input. HVGs were detected at a variance contribution threshold of 89% leading to an optimized evidence threshold of 0.5145 and giving an estimated false discovery rate of 5.71% and an estimated false negative rate of 5.7%. LVGs were detected at a variance contribution threshold of 40% with an optimized evidence threshold of 0.8115 and estimated false discovery rate and estimated false negative rate both of 2.28%. The value sigma indicates the proportion of the total variability that is due to biological heterogeneity.Hierarchical clustering of cells was performed and visualized using the clustermap function provided by the Seaborn Python package with the Ward method of clustering. Genes were selected for use in hierarchical clustering if they were expressed (TPM ≥ 1) in at least one cell. Validation of the clusters was performed using the clusterboot function from the fpc R package. The data were resampled by bootstrapping, and Jaccard similarities were calculated between the original clusters and the most similar clusters in the resampled data. Mean Jaccard similarities were calculated from 5,000 iterations.Sequence data described in this manuscript can be accessed at Experiment ArrayExpress (accession no. E-MTAB-4825). […]

Pipeline specifications

Software tools Bowtie, SAMtools, BEDTools, BASiCS, ClusterMap, Seaborn
Databases ArrayExpress
Applications Miscellaneous, scRNA-seq analysis
Organisms Mus musculus
Diseases Asthma, Hypersensitivity, Parasitic Diseases