Computational protocol: Genome wide analysis of LTR retrotransposons in oil palm

Similar protocols

Protocol publication

[…] A combination of manual approaches and automated programs (REPET package V.2.2-RC; []) were used to identify, classify and annotate repeated sequences from the largest scaffolds (size > 300 kbp) assembled for each of the studied genomes. The sequences that were investigated include 991 scaffolds amounting to a total of 730,618,412 bp of genome sequence from the O8-build and 846 scaffolds representing 1,068,102,326 bp from the P5-build, respectively. TE consensus nucleotide sequences were classified according to the Repbase database [] and named according to the classification proposed by Wicker et al. []: DHX (Helitron), DMX (Maverick), DTX (TIR Transposon), DXX (MITE) for Class II elements, and RIX (LINE), RLX (LTR Retrotransposon), RSX (SINE), RXX (unclassified or non-autonomous retrotransposons), RYX (DIRS) for Class I element. Consensus sequences assigned as LTR retrotransposons were further classified through the phylogenetic analysis of their reverse transcriptase (RT) amino-acid domains: putative RT coding domains were first identified in consensus nucleotide sequences using BLASTX [] and translated using Genewise [], then the resulting RT amino acid sequences (with a minimum length of 150 residues) and reference RTs from Gypsy Database 2.0 [] were aligned with ClustalW to construct a NJ tree that was finally edited with FigTree ( Repeatmasker [] was used with default parameters so that sequences with less than 80 % identity to the reference sequence were masked.The LTR_STRUC 1.1 algorithm [] was used with default parameters in order to detect full-length LTR retrotransposons among the complete dataset of Elaeis guineensis scaffolds (1,535,150,282 bp). The following structural definition was used for the full-length LTR retrotransposon, regardless of its transcriptional or transpositional ability: a repeat element that i) is delimited by highly similar 5’ and 3’ LTRs; ii) has generated a Target Site Duplication (TSD) on each border of the insertion site into the host genome; iii) includes putative primer binding site (PBS) and polypurine tract (PPT) sequences at the 5’ and 3’ the of its internal sequence, respectively. [...] All identified full-length LTR retrotransposons were clustered into families or group using the CD-HIT software [] with a minimum of 70 % of nucleotide identity and a minimum sequence coverage of 70 % between related elements. Within each family, the longest sequence displaying a high percentage of nucleotide sequence identity between both LTR regions was selected as the reference sequence. The copy number of each super-family was determined using Censor []. A copy is considered as complete if it covers a minimum of 70 % of the reference sequence with a minimum of 70 % of nucleotide identity. The density of retrotransposon distribution along pseudo-chromosomes was calculated using a home-made shell script, with a 1 Mbp sliding window (step size of 500 kbp) and plotted using CIRCOS [].The insertion times of the previously identified full-length LTR retrotransposons were estimated based on the sequence divergence between the 5’ and 3’ LTR of each element, as determined through successively aligning the sequences using Stretcher then implementing the Kimura 2-parameter method in Distmat (EMBOSS package). An average base substitution rate of 1.3E-8 was used in accordance with Ma and Bennetzen []. [...] The transcriptional analysis was carried out using data deposited into NCBI’s databases (Bioproject number PRJNA201497). Eight sets of oil palm (Elaeis guineensis) transcriptome data from different tissues were re-analyzed: root (SRX278062), leaf (SRX278048), shoot apex (SRX278055), young female flower (SRX278052), mature female flower (SRX278053), pollen (SRX278051), kernel (SRX278018) and mesocarp (SRX278017). Data quality was evaluated with FastQC [] and low quality reads were excluded with Cutadapt []. Reads were mapped against our full-length LTR retrotransposons reference library using the BWA-MEM package with default parameters []. Samtools [] was used to calculate the number of mapped reads (counts) for each reference sequence and normalization was performed using the EdgeR package []. The graphical representation of full-length LTR-retrotransposons expression in the different tissues was generated by the pheatmap R package []. […]

Pipeline specifications