Pipeline publication

[…] hey are slow to parse, and they do not support random access to subsets of the reads, a feature that is critical to support parallelization of the alignment of the reads to a reference genome., Most analyses require aligning read data to a reference genome, a process that yields HTS alignment data. When computed, HTS alignment data has traditionally been stored in a variety of file formats. Early text formats were quickly abandoned in favor of binary formats, and among these, the BAM (Binary Alignment/Map) format has become very popular and is now widely used . A key problem with BAM is that the BAM format cannot be seamlessly adapted to support new applications. For instance, developers of TopHat fusion were not able to extend the BAM format to store information about gene fusions, and instead had to create a variant of the SAM text format (http://tophat.cbcb.umd.edu/fusion_manual.html#output). Developers of new analysis software based on BAM cannot seamlessly extend BAM for new applications because various programs that read/write BAM, developed worldwide, would need to be manually modified for each change to the specification, a process that is all but practical. Another key weakness of the BAM format is that BAM files require approximately the same amount of storage as the unaligned reads, for each alignment represented in the format., The CRAM format, developed for the European Nucleotide Archive, was developed to try and compress HTS alignments better than can be achieved with BAM. CRAM achieves strong compression of alignment files using custom developed compression approaches parameterized on characteristics of simulated HTS alignment data . A key innovation of CRAM was the recognition that different applications need to preserve different subsets of the data contained in a BAM file. Preserving different subsets of the information can yield substantial storage savings for these applications that do not require all the data. However, CRAM shares a key weakness with BAM. Namely, it is unable to seamlessly support changes to the data format. Changes to extend the file format require manual coding and careful design of custom compression approaches for the new data to be stored. An additional problem is that CRAM cannot compress HTS alignments when they are not already sorted by genomic position. This is a significant problem because alignments are first determined in read order before they can they be sorted by genomic position. This problem limits the usefulness of the CRAM format to HTS archives, and prevents its use as a full replacement of the BAM format. We believe that these weaknesses are serious drawbacks because the HTS field is progressing very rapidly, sequencing throughput increases exponentially and new experimental advances often require extensions to the data schemas used to store and analyze the new types of data., In summary, current approaches are unable to strongly compress HTS data while supporting the full life-cycle of the data, from storage of sequenced reads to parallel processing of the reads and alignments during data analysis to archiving of study results. In this manuscript, we present a comprehensive approach that addresses thes […]

Pipeline specifications

Software tools TopHat-Fusion, TopHat, CRAM