Computational protocol: A cloud-based workflow to quantify transcript-expression levels in public cancer compendia

Similar protocols

Protocol publication

[…] We used the Google Genomics Pipeline service to control execution of tasks on preemptible computing nodes. As a preemptible node became available, the service (1) created a VM on the node, (2) deployed the relevant software container to the VM, (3) copied data files to the secondary disk drive, (4) executed the software container, (5) copied output files to persistent storage, and (6) destroyed the VM after the sample finished processing (or was preempted). To help facilitate this process, we used a software framework provided by the Institute for Systems Biology. Via a command-line interface, this framework facilitated the process of submitting samples to the Google Genomics Pipeline for processing. In addition, the framework monitored each sample’s status and resubmitted samples that had been preempted (at 60-second intervals). [...] We created three different Docker containers to house the software required to process each combination of data source and cloud configuration. The first container was used by the Google Cluster Engine to process BAM files from CCLE. The cluster-based configuration required us to use the container to copy input files to and from the computing nodes. After copying the files, the container used Sambamba (version 0.6.0) to sort the BAM files by name and then Picard Tools (version 2.1.1, SamToFastq module) to convert the BAM files to FASTQ format. In accordance with kallisto’s documentation, we used the “OUTPUT_PER_RG” flag in Picard Tools to ensure that paired-end reads were placed in separate output files. The FASTQ files were then used as input to kallisto (version 0.43.0), which pseudoaligned the reads to the GENCODE reference transcriptome (version 24) and quantified transcript-expression levels. Based on the kallisto authors’ recommendation, we used 30 bootstrap samples; we also used the “–bias” flag to account for variance and bias. We used the parallelization features in Sambamba and kallisto to enable faster processing.The second software container is similar to the first but was modified for use with the Google Genomics Pipeline service. Because this service handles copying the data files between Google Cloud Storage and the computing nodes, these tasks were not performed by the container. We also added a read-trimming step using Trim Galore! (version 0.4.1), a wrapper around Cutadapt (version 1.10). This tool trims adapter sequences and low quality bases/reads. To process multiple FASTQ files (or pairs of FASTQ files for paired-end reads) in parallel, we used GNU Parallel (version 20141022).The third software container was designed specifically for the TCGA data. It extracts FASTQ files from a tar archive (whether compressed or not), performs quality trimming, and executes kallisto. Where applicable, it uses the pigz tool (version 2.3.1) to decompress the input files in parallel.All three containers use the sar module of the sysstat program (version 11.2.0) to log each machine’s vCPU, memory, disk, and network activity throughout the course of data processing. The containers copied these data to persistent storage, prior to the job’s completion. We changed the time of each entry in the logs to a corresponding percentage of total job time, to allow the activity metrics to be summarized consistently across all jobs. […]

Pipeline specifications