Computational protocol: Repetitive part of the banana (Musa acuminata) genome investigated by low-depth 454 sequencing

Similar protocols

Protocol publication

[…] Following a removal of linker/primer contaminations and artificially duplicated reads, the remaining 477,699 reads (average length of 206 nucleotides) were used for repeat analysis. The analysis was performed as described by Macas et al. (2007) [], employing TGICL [] and a set of custom-made BioPerl scripts for similarity-based clustering and assembly of reads. The clustering parameters used by a tclust program (part of TGICL) were set to consider pairwise similarity of two reads significant if it involved an overlap of at least 150 nucleotides with 90% or better similarity, representing at least 55% and 70% of the length of longer and shorter read respectively (OVL = 150 PID = 90 LCOV = 55 SCOV = 70). The reads within individual clusters were assembled into contigs using TGICL run with the -O '-p 80 -o 40' parameters, specifying overlap percent identity and minimal length cutoff for cap3 assembler. Repeat type identification was done using blastn and blastx [] sequence-similarity searches of assembled contigs against GenBank, and by detection of conserved protein domains, using RPS-BLAST []. Tandem repeats within contig sequences were identified using dotter []. The classification of LTR retrotransposons into distinct lineages and clades was done using phylogenetic analyses of their RT sequences []. Alignment of RT sequences was carried out with ClustalX [] and the phylogenetic trees were calculated using neighbour-joining method. The trees were drawn and edited using the FigTree program.Microsatellite sequences were identified using Tandem Repeats Finder [] and TRAP [] programs, while a BioPerl script was used to identify ISBP loci []. Identification and classification of repetitive sequences within BAC clones was done via PROFREP web server [] utilizing repeat-specific databases of 454 reads prepared in this study. The server performs BLAST-based searches against databases of whole-genome or repeat-specific 454 reads and generates plots of similarity hits along the query sequence (number of hits is proportional to copy number of the query in the genome). […]

Pipeline specifications

Software tools TGICL, BioPerl, BLASTN, BLASTX, Dotter, Clustal W, FigTree, TRF
Applications Phylogenetics, Transcription analysis, Genome data visualization
Organisms Musa acuminata, Homo sapiens