Computational protocol: Investigating reproducibility and tracking provenance – A genomic workflow case study

Similar protocols

Protocol publication

[…] We have implemented an end-to-end complex variant calling workflow based on the Genome Analysis Tool Kit (GATK) [] recommended best practices, using three different exemplars to workflow definition approaches: Galaxy [], Cpipe [] and CWL []. The GATK best practice variant discovery workflow was selected because it provides clear, community advocated step-by-step recommendations for executing variant discovery analysis with high throughput sequencing data on human germline samples. The next section will broadly discuss the classified approaches typically followed for workflow design and implementation and justify our choices for the systems used in this case study. [...] Several automated bioinformatics-specific pipelines such as Cpipe [], bcbio-nextgen [] and others [, ] have been developed using command line tools to support genomic data analysis. These pipelines are driven and supported by individual laboratories, which have developed customized pipelines for processing data. This approach has resulted in considerable variability in the methods used for data interpretation and processing. The advantages of these pipelines include editing pipelines on remote servers without requiring access to GUI so that they are easily administered through source code management tools []. However, the command line based pipeline frameworks such as bpipe [], Snakemake [] and Ruffus [] used to develop these systems are not flexible enough to support integration of new user-defined steps and analysis tools. Working with such systems requires expertise with command-line programming and broad computational knowledge as these systems extensively use individual scripts to tie together different components of the pipelines. These scripts control variables, dependencies and conditional logic for the efficient processing of the data and hence are often difficult to be reproduced. These systems assume the provision of the same physical or virtualized infrastructure used to run the initial analysis, including scripts, test data, tools, reference data and databases. The implementation overheads of such pipelines include configuration and installation of software packages, parameter setting alteration, debugging and input/output interfacing. In summary, considerable effort and excessive amount of time is required to create, understand and reproduce a ready-to-use pipeline. [...] Cpipe belongs to the category of bioinformatics specific prebuilt pipeline. It was deployed on the National eResearch Collaboration Tools and Resources (NeCTAR) research cloud. The instructions on the official Cpipe GitHub page [] were followed to setup the pipeline.The instance launched for executing cpipe had 16cores and 64GB RAM. The automated mechanism to document and convey compute requirement for a specific customized analysis is not defined. Rather the prebuilt pipelines presume availability of sufficient compute power to deal with data intensive steps such as sequence alignment.To cater for the storage requirement of the pipeline, 1000GB volume was mounted to the cloud instance. Similar to compute requirement, there is no automated mechanism for explicitly recording storage requirement. As the genomic sequence analysis involves dealing with huge input and intermediate datasets (including whole genome reference data), the prebuilt pipelines assume availability of sufficient capacity to deal with data storage requirements.The installation script provided with cpipe compiled tools such as BWA and downloaded databases such as Variant Effect Predictor (VEP) and human reference sequence files. The prebuilt pipelines connect to online resources to download and compile tools and reference datasets used in the analysis. FTP clients and SSH transfer tools are used for moving datasets over distributed resources. The availability of high performance networking infrastructure is assumed to move bulk data using wide area network (WAN).The base software dependencies for underlying programming frameworks such as Java and Python were required to execute tools in cpipe. The prebuilt pipelines assume that users are responsible to solve base software dependencies for the pipeline; otherwise the pipeline would fail to execute.Cpipe requires downloading and pre-processing the reference data set to generate secondary files since the indexing step is not explicitly defined as part of the pipeline but included in a separate script. The pre-built pipelines expect users to perform pre-processing steps and hence assume availability of input data files to be made available before execution of the pipeline.Cpipe uses a copyrighted tool, ANNOVAR, for annotating variant calls. The prebuilt pipelines deploying copyrighted or proprietary tools, instead of open source software, assume users to ensure availability of all such licensed resources.Cpipe requires a specific directory structure in order to execute the analysis on any sample. As the prebuilt pipelines are customized to support explicit analysis requirements, these assume availability of a specific analysis environment with a set directory structure, having tools and datasets appropriately located to support seamless execution of the pipeline. Files and tools are expected to be placed according to particular file system hierarchy since paths are hard coded in the scripts. The instance launched for executing cpipe had 16cores and 64GB RAM. The automated mechanism to document and convey compute requirement for a specific customized analysis is not defined. Rather the prebuilt pipelines presume availability of sufficient compute power to deal with data intensive steps such as sequence alignment.To cater for the storage requirement of the pipeline, 1000GB volume was mounted to the cloud instance. Similar to compute requirement, there is no automated mechanism for explicitly recording storage requirement. As the genomic sequence analysis involves dealing with huge input and intermediate datasets (including whole genome reference data), the prebuilt pipelines assume availability of sufficient capacity to deal with data storage requirements.The installation script provided with cpipe compiled tools such as BWA and downloaded databases such as Variant Effect Predictor (VEP) and human reference sequence files. The prebuilt pipelines connect to online resources to download and compile tools and reference datasets used in the analysis. FTP clients and SSH transfer tools are used for moving datasets over distributed resources. The availability of high performance networking infrastructure is assumed to move bulk data using wide area network (WAN).The base software dependencies for underlying programming frameworks such as Java and Python were required to execute tools in cpipe. The prebuilt pipelines assume that users are responsible to solve base software dependencies for the pipeline; otherwise the pipeline would fail to execute.Cpipe requires downloading and pre-processing the reference data set to generate secondary files since the indexing step is not explicitly defined as part of the pipeline but included in a separate script. The pre-built pipelines expect users to perform pre-processing steps and hence assume availability of input data files to be made available before execution of the pipeline.Cpipe uses a copyrighted tool, ANNOVAR, for annotating variant calls. The prebuilt pipelines deploying copyrighted or proprietary tools, instead of open source software, assume users to ensure availability of all such licensed resources.Cpipe requires a specific directory structure in order to execute the analysis on any sample. As the prebuilt pipelines are customized to support explicit analysis requirements, these assume availability of a specific analysis environment with a set directory structure, having tools and datasets appropriately located to support seamless execution of the pipeline. Files and tools are expected to be placed according to particular file system hierarchy since paths are hard coded in the scripts. [...] CWL aims for a standardized approach to workflow definition. It was cloned and installed following the instructions from the GitHub repository [].A reference implementation of CWL designed specifically for Python 2.7 was cloned and installed following the directions from the GitHub repository manual. The availability of the specific underlying language and its particular version for reference implementation (Python in this case) is assumed for successful installation and functioning of the reference implementation.Working with CWL was challenging as compared to Cpipe and Galaxy because it is an ongoing, constantly developing community effort and tool wrappers for most of the required tools for this study were not available. Implementing the GATK workflow in CWL required the knowledge of Yet Another Markup Language (YAML) and JavaScript Object Notation (JSON) for development of a number of CWL definition files including YAML tool wrappers, JSON job files containing the input parameters and YAML test files for conformance tests (Fig. -Additional file ). It is assumed that any user wanting to utilise these definition files along with the workflow definition should have basic understanding of YAML and JSON. In addition, if a newer version or different tool is required for any step, the user is expected to develop the definition files for which in depth knowledge of underlying languages is required. Therefore, the standardized approaches on the one hand provide users with the freedom to declare every aspect of the workflow but on the other hand assume the implicit knowledge of underlying languages leading to steep learning curves for naïve users.The workflow implementation used tools such as BWA, GATK and Picard Toolkit which were provided through container-based Docker images including all required software packages. This step required installation of Docker which again was assumed to be available on the system executing the workflow. Although CWL encourages use of Docker, it also facilitates the local installation of required tools which should not be preferred as it will lead to localised solutions that fail to execute elsewhere. In both cases, certain assumptions were made regarding availability of the underlying tool and their link with the tool definition. Hence, the standardized approaches despite making efforts to explicitly declare every step of the workflow assume the underlying software availability for enactment of a workflow which is not always the case.As genomic workflows usually involve working with large datasets, the availability of compute and storage resources is assumed to be managed by users to successfully enact workflows. A reference implementation of CWL designed specifically for Python 2.7 was cloned and installed following the directions from the GitHub repository manual. The availability of the specific underlying language and its particular version for reference implementation (Python in this case) is assumed for successful installation and functioning of the reference implementation.Working with CWL was challenging as compared to Cpipe and Galaxy because it is an ongoing, constantly developing community effort and tool wrappers for most of the required tools for this study were not available. Implementing the GATK workflow in CWL required the knowledge of Yet Another Markup Language (YAML) and JavaScript Object Notation (JSON) for development of a number of CWL definition files including YAML tool wrappers, JSON job files containing the input parameters and YAML test files for conformance tests (Fig. -Additional file ). It is assumed that any user wanting to utilise these definition files along with the workflow definition should have basic understanding of YAML and JSON. In addition, if a newer version or different tool is required for any step, the user is expected to develop the definition files for which in depth knowledge of underlying languages is required. Therefore, the standardized approaches on the one hand provide users with the freedom to declare every aspect of the workflow but on the other hand assume the implicit knowledge of underlying languages leading to steep learning curves for naïve users.The workflow implementation used tools such as BWA, GATK and Picard Toolkit which were provided through container-based Docker images including all required software packages. This step required installation of Docker which again was assumed to be available on the system executing the workflow. Although CWL encourages use of Docker, it also facilitates the local installation of required tools which should not be preferred as it will lead to localised solutions that fail to execute elsewhere. In both cases, certain assumptions were made regarding availability of the underlying tool and their link with the tool definition. Hence, the standardized approaches despite making efforts to explicitly declare every step of the workflow assume the underlying software availability for enactment of a workflow which is not always the case.As genomic workflows usually involve working with large datasets, the availability of compute and storage resources is assumed to be managed by users to successfully enact workflows. […]

Pipeline specifications

Software tools Cpipe, Galaxy, GATK, BWA, Picard
Application WES analysis