Computational protocol: The folded k-spectrum kernel: A machine learning approach to detecting transcription factor binding sites with gapped nucleotide dependencies

[…] After filtering, we compared the individual gapped k-mers against a relevant motif data base (i.e., JASPAR’s core data base for insects) [] to elucidate the putative regulatory factors. For this task, we used the motif comparison tool Tomtom [] which calculates statistical measures (p-value) to quantify the similarity between two motifs. For each sequence, we obtained all the motif hits found by Tomtom and retained those with a significant p-value score < 10−3.To improve the filtering procedure, we further discard “insignificant” false positives by filtering out any gapped k-mer which does not lead to a significant JASPAR hit. That is, for each false positive input χjf,j=1,…,10t, we construct all enriched gapped k-mer motifs and search the motif database via Tomtom. If a gapped k-mer motif results in a significant hit to a JASPAR motif (p-value < 0.001), then we keep this gapped k-mer motif, assuming that it consistently appears on the genomic background; otherwise if the motif is not found in JASPAR, we discard that motif, regarding it as an “insignificant” false positive.The remaining gapped k-mers (features) are referred to as “false” and are used to remove the corresponding features found in the positive training sequence’s top-features. That is, we filter out the n-th gapped k-mer (set ri(n) = 0) if it belongs to one of the “gapped models” of the falsely-enriched features (corresponding to rjf(n)) in the respective scrambled copies j = i, …, i + 9. After removal, the remaining gapped k-mers (with enrichment scores rih(n)>0.005) at each input sequence are assumed to be “high-confidence” predictions, whereby we constrained the feature set to this collection, i.e., Ψkh=(∪iI(rih))∩Ψk where I(rih) represents the indices of the remaining features with rih(n)>0.005,n∈{1,…,N} (). […]

Pipeline specifications

Software tools Tomtom, BaMM!motif
Application ChIP-seq analysis
Organisms Middle East respiratory syndrome-related coronavirus, Drosophila melanogaster
Chemicals Nucleotides