Computational protocol: TIGER: tiled iterative genome assembler

Similar protocols

Protocol publication

[…] Based on aforementioned ideas, we developed "Tiled Iterative GEnome assembleR," or Tiger. Here "tile" is a synonym of "set" or "cluster", representing the tiled computation nature in the assembly process. The conceptual flow is illustrated in Figure .Step 1. Reads partitioning: We first partition the input reads into multiple read tiles. In our current implementation, the input reads are randomly partitioned evenly into N subsets, which can be determined by users based on the available resources and the total size of the input reads.Step 2. Read assembly: Read tiles are assembled individually, using an off-the-shelf assembler, such as Velvet, embedded into Tiger. Depending on the available system memory, the assembly of read tiles can be done independently in serial or in parallel on a shared or distributed memory computer cluster. There is no communication between the assemblies of different read tiles.For the embedded assembler requiring specifying a k-mer size, k-mer sizes are decided either manually by users or automatically through the auto-k-mer scheme in Tiger. For the manual k-mer designation, a k-mer size is used in all read tile assemblies for all Tiger iterations. Otherwise, the auto-k-mer scheme randomly picks k-mer sizes within a given range and records the best k-mer size in the assembly results. The best k-mer size and the randomly picked ones will be considered in the subsequent assemblies. User-specified k-mer size can be introduced into this k-mer history database but may not be used again if the first attempt is not good. The number of used reads in the assembly, the total length of the contigs, and the resultant N50s are used to evaluate whether a k-mer size can help produce the best result without knowing the target genome. This avoids the problem of picking a contig set with high N50 and low coverage and enables Tiger to find a good direction in the iterative process and to converge to high quality results.Since Step 2 is the first time to assemble the initial read tiles, the contigs can be short and may cause long running time in the later iterations. We address this issue by merging the contig sets and feed the merged contig set to Velvet with the LONGSEQUENCE flag enabled. Velvet may further elongate the contigs by treating the input contigs as long reads. The new contig set is used when it is better than the merged contig set. The output contig set is scaffolded by SSPACE []. The scaffolded contig set is the input to Step 3. The purpose of this scaffolding process is to leverage paired-end information to bridge contigs which may be from different assemblies. This is beneficial for better clustered contigs at Step 3. The scaffolding process also helps resolve duplicated contigs from different assemblies.Step 3. Contig clustering: The overall contig clustering algorithm is depicted in Figure . A graph that models the contig connectivity intensity is built from the merged contig set. This graph is called the contig connectivity graph. Graph vertices are the contigs. Vertex weights are contig lengths. Edge weights are defined based on the contig overlapping degree with each other. Note that the contig connectivity graph is much smaller than the DBG so it uses much smaller amount of memory. We then apply a graph-partitioning tool, METIS [], to partition the graph into contig clusters. METIS is known to be fast and memory-efficient in processing millions of graph vertexes.The contig lengths (vertex weights) are given less importance than the contig overlapping degrees (edge weights) in the graph partitioning process. This is because we want the partitioned contig connectivity sub-graphs to be more edge-oriented instead of being vertex-oriented. But we still need to consider the vertex weights for the situations where there exist many short contigs with little connectivity in-between. This is very common for the assembly results in the first few iterations on assembling randomly partitioned read tiles. These short contigs ought to be distributed to all clusters evenly. This not only preserves their existence in the following Tiger iterations but also reduces their influence on the rest of the clustered contigs.By focusing graph partitioning on edge intensity, overlapping contigs will be grouped together and would be rebuilt as one long complete contig later at Step 5. These contigs are used to produce well-clustered reads in the read clustering process at Step 4. That is, this contig clustering step makes crucial contribution to the quality of results of the later steps.Building of a contig connectivity graph can be time-consuming if a traditional sequence alignment method is used, like the Smith-Waterman algorithm []. Since the degree of overlap between contigs need not be determined exactly for our purposes, we apply a heuristic algorithm based on the image recognition algorithm using vocabulary trees in [], with inverse document frequency scoring dropped. We begin by extracting consecutive sequences of equal length (called words) from each of the contigs in the set. The extracted words are used to build a map (or inverted file) from the words to the contigs containing them. A contig connectivity graph is then built from the map with the edge weights being set to the number of words in common between the vertices (contigs) connected by that edge. Since the connectivity graph stores only the contig connectivity information, the memory usage of this step is much lower than that at the read assembly step. Regarding runtime, building a connectivity graph dominates the whole step. Building the word map can be done in parallel but we leave it as future work. Overall the runtime is still much shorter than Step 4 and 5.Step 4. Read clustering: The entire input read set is aligned to the contig sets from Step 3. The reads having high similarity to each contig set are collected as one read cluster. Each read is collected once for each cluster. This means a read can appear in multiple read clusters if similar contigs are not clustered together. This process guarantees any read potentially contributing to a contig set will be collected. The read-to-contig alignment is done by Bowtie []. For paired-end reads, if one of a read pair aligns to the given contig cluster, both reads are collected. This step provides the opportunity to extend and/or bridge the contigs. This clustering process can be done in parallel or in serial on a shared or distributed memory computer cluster. No communication is needed between read tiles. The required memory is also proportional to the size of a read tile.Step 5. Read assembly: We assemble the read tiles, the same as we do at Step 2. But the assembly of the merged contigs from all read tile assemblies may not be performed. If the assembly of the merged contigs is not improving, it is skipped in later iterations to save time. Based on our experience, we found it is useful to have this additional assembly in the first few iterations.Step 6. Post processing: If we have reached the given number of iterations, we will just exit. Otherwise, go to Step 3. Step 3, 4, and 5 form an iterative process.To sum up, the rationale behind our framework is that the improvement on contig quality of the current iteration can be carried over to the next iteration through more accurate read clustering. An optimal clustering solution will be achieved if only reads contributing to a contig are clustered for assembling the contig. This approach differentiates our algorithm from the previous work and provides our framework the capability of improving an existing contig set further. […]

Pipeline specifications

Software tools Velvet, SSPACE, Bowtie
Application De novo sequencing analysis
Organisms Panthera tigris, Homo sapiens