Computational protocol: Genome wide discovery of long intergenic non-coding RNAs in Diamondback moth (Plutella xylostella) and their expression in insecticide resistant strains

[…] A stringent filtering pipeline was developed to discard transcripts with evidence for protein coding potential. The pipeline for P. xylostella lincRNAs discovery is summarized in . We identified 55,793 potential genes using the CLC Genomic Workbench transcript discovery algorithm. The genes that were annotated as known P. xylostella genes were discarded and 35,425 potential genes were also checked for any exon or intron overlap with other known P. xylostella genes. We selected 14,663 sequences, which were located more than 1 kb away from any other known transcripts, for finding putative open reading frames (ORF). All possible six frames were produced for all selected sequences and then the translated sequences were subjected to a domain search to identify any putative conserved protein domains through Pfam v27.0 database. We discarded 4,746 sequences with potential ORF above 100aa or conserved protein domains. Any possible similarity with other known proteins was found by using BLASTx algorithm against nr and Swiss port database (E-value cut off 10−5) for 9,917 of sequences. We also implemented an expression threshold on our data to strengthen the identification pipeline. Sequences with more than 10 mappable reads in at least three out of eight RNA-seq libraries were considered as valid sequences and were kept for the next step. 4,522 sequences were subjected to Coding Potential Calculator (CPC) tool, which is publically available on to check for any other potential coding regions. CPC is a Support Vector Machine-based classifier, which is able to assess the protein-coding potential of a transcript based on six biologically meaningful sequence features. Sequences with the score of above −1 were determined as putative protein coding genes and removed from the list. The data were also submitted to another coding potential assessment tool (CPAT), which uses a logistic regression model built with four sequence features: ORF size, ORF coverage, Fickett TESTCODE statistic and hexamer usage bias. We applied the coding probability threshold of 0.3, which led to discarding 27 sequences as putative coding RNAs. Finally, identical and overlapped sequences were removed from the Px lincRNAs’ profile, and 3844 potential lincRNAs were used for further study.To identify P. xylostella putative lincRNAs that are regarded as small RNA associated lincRNAs, we used the Blast algorithm to search for DBM pre-miRNA sequences in the predicted DBM lincRNA dataset. […]

Pipeline specifications

Software tools BLASTX, CPC, CPAT
Databases Pfam
Applications RNA-seq analysis, Transcription analysis
Organisms Plutella xylostella, Bacillus thuringiensis
Diseases Neoplasms
Chemicals Chlorpyrifos