Computational protocol: Satellite DNA: An Evolving Topic

Similar protocols

Protocol publication

[…] The existence of repetitive DNAs in the genomes of eukaryotes was first unveiled in 1961 by Kit et al. [] and by Seuoka et al. []. These pioneering works revealed that genomic DNA from mouse exhibits two buoyant DNA bands in density-gradient ultracentrifugation of DNA using cesium chloride. One of these two bands, the minor component, representing about 10% of the genome, was called satellite DNA. The DNA of this band underwent higher rates of renaturation than the rest of the nuclear mouse DNA or than the simplest genomes composed mostly of unique sequences, a proof that the satellite peak of the mouse genome was composed of repeated nucleotide sequences [,,,]. Britten and colleagues [,,,] developed the Cot analysis, based on the principles of DNA renaturation kinetics according to which the DNA sequences re-associates at a rate that is directly proportional to the number of times it occurs in the genome. They demonstrated that a variable proportion of every eukaryotic genomic DNA is composed of repetitive elements and that eukaryote genome sequences can be divided into highly repetitive, moderately repetitive, or single-copy classes of DNA sequences according to their reiteration frequency. Cot analysis thus represented a powerful tool by which highly repetitive, moderately repetitive and single/low-copy DNA can be selectively and efficiently fractionated, cloned, and characterized determining their complexity, composition and abundance []. These two methods dominated analysis of repeated DNA sequences during the 1960s and 1970s []. These methods were substituted by the isolation from restriction endonuclease treatment of genomic DNA. The method simplified and popularized the study of satDNA, assisted to separate satDNA families from other repetitive sequences, and aided in uncovering cryptic satDNAs []. Following digestion of genomic DNA with a restriction enzyme and electrophoresis of the DNA fragments generated on agarose gels, satDNA sequences are revealed as a prominent band against the background smear []. In this way, the individual members of a repeated set of sequences are available for cloning after the prominent band is excised from the agarose gel, the agarose slice is melted and the DNA purified (). These DNA fragments isolated from agarose gels have been found to be the source of numerous studies of satDNA families from a large number of eukaryotic species. Roughly, quantification and global genomic organization may be addressed by using dot-blot and Southern blot hybridization techniques (). In dot-blot analyses, defined amounts of total genomic DNA as well as the defined amounts of the unlabeled probe (repeat units of the satellite DNA family) are denatured and immobilized on nylon and hybridized with the labeled probe. The relative amounts of the satDNA family in the genome are estimated then by comparisons between densitometric scans of hybridization signals in genomic DNAs and those obtained for the reconstruction standards. The dot-blot hybridization was also commonly used for the detection of satDNA families of one species within the genomes of related species, a method that even permitted the use of this technique in an applied way for the establishment of phylogenetic relationships by the cladistic association of species in monophytletic groups. The organization of a repetitive DNA family in a genome may be analyzed by Southern blot hybridization (). Three typical patterns can be observed after hybridization [,]. Southern blot hybridization patterns give a rough indication of sequence variation within and between species and contain certain useful phylogenetic signals [].In addition to Southern blot hybridization, the developmet of in situ techniques of hybridization was a breakthrough in the characterization of a satDNA (). First attempts using radiolabeled probes were promising [] and revealed that satellite sequences were located in the heterochromatin, in this case, the centromeric heterochromatin of mouse chromosomes. Notwithstanding, the real revolution, popularization, and potential of this technique came from the use of nonradioactive methods [] and above all with the advent of fluorescent in situ hybridization (FISH), a powerful tool that may combine the use of different labeled probes with different fluorochromes [,].The sequence of a satDNA family can be obtained by the entire set of monomeric units purified from melted excised agarose gels. Unambiguous consensus sequences are frequently obtained with uncloned sets [,,]. Although this method allowed determining a rough estimate of satDNA variation [,], the sequencing of cloned members of the set has been the common procedure during the last four decades. The advent and the advantages of the polymerase chain reaction (PCR) technique extended its use for the isolation of repeats of a satDNA family from one species or from several related species, both by the selection of one primers pair or the combined use of several primers pair in order to uncover the whole variant type sequences found in each genome []. Sequencing of repeats of satDNA families obtained by both cloning and PCR, combined with Southern blot and in situ hybridization, has provided a wealth of information about the organization and location of satDNA repeats, the length and copy number of satDNA repeats, the satDNA repeat variability, the functional role of satDNA sequences, and the satDNA evolution as well as its (moderate) use in phylogenetic analysis and in taxonomic studies [,,,,,,]. Meanwhile, over decades, satDNA sequences were left out of the great genome projects since difficulties arose in the assembly of contigs containing repeat sequences. It is obvious that a genomic perspective in the isolation and analysis of satDNA repeats would overcome the bias of the sequence data obtained by cloning or PCR methods. Thus, for example, there are many monomers that escape to cloning when they are isolated from prominent bands, which were excised from agarose gels, containing the monomer subset obtained after complete digestion. In addition, this procedure omits the cloning of the multimers resulted from undigested repeats lacking the site for the restriction enzyme utilized. This bias in the isolation procedure would be maintained in subsequent PCR experiments which use primers designed from information gathered by the aforementioned procedure. Furthermore, there are added difficulties when a satDNA family is poorly represented in a genome. Obstacles also arise when one is dealing with species having high genome size or with small quantities of satDNA. Further, one may raise the question of how many satDNA families are present within a genome. Some of these families may be overlooked in a routine examination conducted using restriction enzymes for their identification. In this context, the integration of Cot analysis, DNA cloning, and high-throughput sequencing was proposed as an attractive methodology that facilitates genome characterization [].Thus, satDNA analysis has found an ally in high-throughput sequencing of genomes using Next-Generation Sequencing (NGS). NGS and high-throughput in silico analysis of the information contained in NGS reads have transformed the study of repetitive DNA [,]. An efficient pipeline called RepeatExplorer [,] has been developed which allows for the de novo identification of repetitive DNA families in species lacking a reference genome [,,,,]. RepeatExplorer follows a similarity-based read clustering approach that allows detection of repetitive sequences, which are identified as groups of frequently overlapping sequence reads in all-to-all read comparisons [,,,,]. The clustering procedure employs graph-based methods that transform read similarities to a virtual graph, where reads are represented as nodes and their similarities by edges connecting the nodes. The identification of communities of densely connected nodes allows for the identification of various families of repetitive DNA sequences (). The reads within the sequence clusters can be assembled to generate contigs that represent the repeats they contain [,,,,]. The combination of NGS and computer analysis favours an in-depth global genomic analysis on the repetitive content of genomes and gives us the opportunity to uncover satDNA families whose isolation was elusive by other methods [,,,,,,,,]. In a further step in the use of RepeatExplorer, Ruiz-Ruano et al. [] have implemented a bioinformatic toolkit (satMiner) which allows for the identification of satDNA families that are extremely rare in the genome. This pipeline consists of several rounds of RepeatExplorer clustering separated by filtering out the reads containing already known satellites, thus increasing the likelihood of finding new rare satellite families. The method is highly reliable for species with high genome size or with small quantities of satDNA. In a further improvement of RepeatExplorer, Novak et al. [] have developed Tandem Repeat Analyzer (TAREAN), a computational pipeline for unsupervised identification of satellite repeats from unassembled sequence reads. The pipeline uses low-pass whole genome sequence reads and performs their graph-based clustering. Resulting clusters, representing all types of repeats, are then examined for the presence of circular structures characteristic for tandem repeats. Reads from these clusters are then decomposed to k-mers and fractions of the most frequent k-mers are used for reconstructing representative monomer sequences for each satellite repeat [].RepeatExplorer, satMiner, and TAREAN are being extensively used in the analysis of the repetitive DNA content of many plant and some animal species [,,]. The development of all these computer tools is opening new opportunities to uncover the core details of satDNA evolution and to gain insights on the different repeat families making up a given genome, their relative abundance and variability, as well as their roles in different genetic and genomic processes [,]. Further, this new perspective greatly contributes to the development of comparative genomics and of phylogenomics [,]. However, the graph-based genomics approaches are not the only ones. Wei et al. [] have developed a method called k-Seek that analyzes unassembled Illumina sequence reads for identify and quantify short tandemly repeating sequences (kmers) of 2–10 bp, repeat lengths that are usual among satDNAs in Drosophila. While the existence of a reference genome is not needed for the use of RepeatExplorer, satMiner, or TAREAN pipelines, or for the use of k-Seek, the existence of a reference genome has facilitated the development of other computer programs for the analysis of the satDNA content of model species such as Caenorhabditis elegans or Tribolium castaneum. In fact, there are up to 25 conventional programs designed to retrieve tandem repeats (usually, short tandem repeats) from complete or or nearly complete genomes, which were not intended for processing the billions of short reads generated by Illumina or 454 sequencing in an operative time [,]. Even so, Subirana and Messeguer [] have recently developed SATFIND and used it for the identification and analysis of the satellite families in Caenorhabditis []. Also, Pavlek et al. [] have recovered the use of the Tandem Repeat Finder (TRF) algorithm [] through the Tandem Repeats Database (TRDB) [] for the identification of satDNA families of Tribolium castaneum. […]

Pipeline specifications