Similar protocols

To access compelling stats and trends, optimize your time and resources and pinpoint new correlations, you will need to subscribe to our premium service.

Subscribe

Pipeline publication

[…] dataset. First, we focused on understanding the limitations of our approach and the effects of the various design decisions that were made during the discovery step in select_STR_reads. To accomplish this goal, we ran the pipeline but did not impose any restrictions on the minimum length of the extended contigs prior to calculation of performance metrics. For the second simulation, we required that the extended contigs satisfy minimum total length of 500nt, and minimum non-repeat flanks of 200nt (similar to the requirements imposed for our analyses with real datasets) before calculation of performance metrics. In each of these simulations, Illumina short-read sequences were simulated using pIRs (). The average coverage of the sequences was varied between 5-fold to 39-fold, and BaitSTR was run using k-mer lengths between 9 and 31 bps for each of those coverage values. In each of those runs, we required a flanking sequence around the STR to be at least equal to the k-mer length used in that run. The extended contigs from extend_STR_reads were then aligned back to the reference genome using BLAT (). The alignments were processed to calculate the mapping locations of identified STRs on the reference sequence using an in-house custom script, and the true positives and false-positives were subsequently calculated using BEDtools (). False-positives are defined as (i) chimeric contigs where extension resulted in incorrect local assembly, (ii) contigs incorrectly aligned back to the reference, (iii) collapsed repeats that could masquerade as polymorphic segments or (iv) STRs detected by the pipeline that were not explicitly introduced during genome simulation. Importantly, the fourth type are legitimately present at random in a simulated genome, and so are not strictly false-positives concerning pipeline specificity. For instance, 12 out of 410 (1 048 576) possible 10nt DNA sequences are 2-mer STRs repeated five times, predicting ∼23 instances of random 10nt 2-mer STRs (95% CI: 14–33) in a 2 Mb random sequenc […]

Pipeline specifications

Software tools pIRS, BLAT, BEDTools