Computational protocol: GenomeLandscaper: Landscape analysis of genome-fingerprints maps assessing chromosome architecture

Similar protocols

Protocol publication

[…] The weighted distance matrix (Fig. ) can be transformed into a phylogenetic tree (Fig. ) and a phylogenetic network (Fig. ) by conventional software–. We construct a phylogenetic tree (Fig. , left) using FastME software based on our weighted distance matrix. We construct a bootstrap consensus tree (Fig. , right) using the MEGA6 package under the minimum-evolution (ME) model. Two un-rooted trees are approximate to one another, indicating that our alignment-free and bootstrap-free method has an adequate approximation (Fig. ). Such an approximation has an advantage in analysing a number of large genomes (e.g., Mbp, millions of base pairs) and divergent sequences (e.g., variations in size, gap, and divergence). If necessary, our method can be applied to analyse each set of disturbed sequences that are created by traditional bootstrap approaches–; but the traditional bootstrap approaches may not work when disturbing large genomes and divergent sequences due to algorithmic and computational constraints. [...] We re-locate the 1.30-Mbp target segment (Supplementary Dataset ) against the newest assembly GRCh38p10 (GCF_000001405.36). The UCSC Human BLAT analysis indicates a match in centromeric region (Fig. ). The NCBI Genome Data Viewer (Fig. ) illustrates that it fits in a broad region of GRCh38p10.chrY with three blocks (Fig. ): Block I (227.10 Kbp, from 10,316,945 bp to 10,544,039 bp), Block II (100.15 Kbp, from 10,594,040 bp to 10,694,192 bp), and Block III (848.71 Kbp, from 10,744,193 bp to 11,592,902 bp) (Fig. ). Such three blocks constitute the assigned centromeric and pericentromeric region (1.18 Mbp, from 10,316,945 bp to 11,592,902 bp) of GRCh38p10.chrY (Fig. ), which roughly equals the assigned gap placeholder (3.04 Mbp, from 10,248,904 bp to 13,291,760 bp) of GRCh37p13.chrY.Figure 5To justify our findings (Fig. ), we survey the literatures but find no documents describing how such a megabase-sized target segment was assembled step by step. We track out that Block I (227.10 Kbp) was documented to be the DYZ3 alpha satellite array in a centromeric database that was created from the HuRef WGS reads library, whereas Block II (100.15 Kbp) and Block III (848.71 Kbp) are unclear (Fig. ) about their assembling processes that caused dramatic changes from GRCh37p13.chrY to GRCh38p1.chrY (and throughout GRCh38p7) (Figs  and ). We conclude that the 1.30-Mbp target segment (Supplementary Dataset ) does locate in the centromeric and pericentromeric region of GRCh38p1.chrY (and throughout GRCh38p10.chrY) (Figs , and ), which encourages us to trace its sub-constituents that contributed to the observed turning-changed long sharp straight line on the GGFM (Figs  and ). [...] BLAST search against NCBI nr/nt database with the 1.30-Mbp segment (Supplementary Dataset ) shows no hits over the entire megabase-sized sequence, but traces homologous BACs (>150.0 Kbp) (Fig. ). With cover >20% and identity >80%, we choose 15 BACs to compose a dataset for phylogenetics analysis (Fig. ). The results demonstrate that the traced 15 BACs are divergent homologues stemmed from the human (H. sapiens) autosomal chr16, chr10, chr9 and chr7 as well as from the chimpanzee (P. troglodytes) autosomal chr15 and sex chrY (Fig. ), and imply that these BACs might be shared or contaminated. Given that the 1.30-Mbp target segment (Supplementary Dataset ) presents debuting in GRCh38p1.chrY (rather than in GRCh37p13.chrY) (Figs  and ), but absents from HuRef.chrY and YH.chrY that did not use BACs for sequencing and assembling, they are unlikely shared.Figure 6Note that our method is 2,880 times faster to create a distance matrix for such chosen 15 BACs. Our method took only 1 minute to calculate genome fingerprints and create a weighted distance matrix, but the MEGA6 package took 48 hours to complete base-to-base alignments and calculate a pair-wise distance matrix (i.e., the MEGA distance matrix). We use the MEGA6 package to construct traditional bootstrap consensus trees, both the ME (minimum-evolution) tree (Fig. ) and the NJ (neighbour-joining) tree (Fig. ) have a low confidence at arguable sub-branches (e.g., containing FP565576.7, AC_138511.2, AC_113435.1, AC_137800.3 and AC_136625.2). In contrast, we use our weighted distance matrix to construct an NJ tree (Fig. ) by NEIGHOR.exe (from the Phylip package) and a FastME tree (Fig. ) by FastMe software (an update version of ME). Our trees (Fig. ) demonstrate a better resolution at the questionable sub-branches observed on the opposite MEGA trees (Fig. ). Further, we construct two phylogenetic networks (Fig. ) using SplitsTree4 software, based on our weighted distance matrix and the MEGA distance matrix, respectively. They are approximate to one another, but ours (Fig. ) has a better resolution for discriminating the major discrepancies (e.g., FP565576.7, AC185982.3 and FO203515.7) observed on the phylogentic trees (Fig. ). Accordingly, we track out that FP565576.7 (114.29 Kbp, H. sapiens chr10 clone CH17-310O3) has a 35.74-Kbp segment of 7,149 copies of a 5-bp (TGGAA) unit (i.e., a cluster of (TGGAA)7149); AC185982.3 (179.50 Kbp, P. troglodytes chr15 clone CH251-487L24) has a 53.18-Kbp segment of 311 copies of a 171-bp unit; and FO203515.7 (147.99 Kbp, H. sapiens chr9 clone RP11-366F14) has a 9.00-Kbp segment of 1,801 copies of a 5-bp (TCATT) unit. These findings demonstrate that our method has a better resolution for taxa containing high copy numbers of repeats. We thus use our method to construct phylogenetic networks throughout the next sections when dealing with a number of large and divergent sequences, on which traditional approaches may not work. [...] We search the 1.30-Mbp target segment (Supplementary Dataset ) against the RepeatMasker/Repbase database and summarise about 2,720 hits of known repeats (Supplementary Table ), including (1) 1,156 interspersed repeats (in total 350.57 Kbp) such as DNA transposons, LTR retrotransposons, and non-LTR retrotransposons; (2) 1,396 tandem repeats (in total 584.95 Kbp) such as satellite DNA; and (3) 168 endogenous retrovirus (in total 58.17 Kbp). These data highlight the likelihood that such large-scale repeats (up to 76.43% of the 1.30-Mbp segment) are responsible for its likely misassembling. Given that most known repeats are short (Supplementary Table ), preventing us from exhaustively analysing them one by one, we intend to evaluate the chosen examples of predicted long repeats (e.g., LTR retrotransposons and satellite DNA). Hence we conduct de novo predictions of long repeats from the 1.30-Mbp segment (Supplementary Dataset ), trace homologues, and analyse evolutionary relationships.Using LTR-FINDER software, we predict 6 LTR retrotransposons (Fig. ) that are dispersed on a 98.54-Kbp cluster (from 211,266 bp to 309,809 bp) of the 1.30-Mbp segment (Supplementary Dataset ). We use each of them to do the BLAST search against the NCBI nr/nt database and select top 10 hits (if applicable) to compose a dataset for phylogenetics analysis. Such LTR retrotransposons are mono-centred on the phylogenetic network (Fig. ), coinciding with their close locations (Fig. ). These findings weaken the impacts of interspersed repeats, thus strengthen the impacts of tandem repeats to be elucidated.Figure 7 […]

Pipeline specifications

Software tools FastME, MEGA, BLAT, PHYLIP, SplitsTree, RepeatMasker, LTR_Finder
Databases Repbase GDV
Applications Phylogenetics, WGS analysis, Genome data visualization
Organisms Homo sapiens