Computational protocol: An Improved Canine Genome and a Comprehensive Catalogue of Coding Genes and Non Coding Transcripts

Similar protocols

Protocol publication

[…] After target gaps were identified in the assembly, we identified those that were covered by at least two fosmid templates. Gaps that met this coverage requirement were analyzed with primer3 , to find suitable paired primers. Finishing reads (see below) were then generated using primer walks and integrated into a “chunked” sub-assembly. The sub-assembly “chunk” is an approximately 2 kb blunt-end extract of the whole genome assembly surrounding each gap. This integration was done using the shotgun assembly module of GAP4. After all finishing reads were integrated into the chunk assembly, analysis using a novel data structure called a “read coverage signature” (RCS) was completed to determine gap closure or extension and to ensure that the chunk had not been misassembled (see explanation below). The consensus sequence from the chunks that satisfied the RCS analysis was subsequently patched into the whole genome assembly at the positions defined when the chunk was created. [...] We mapped annotations from the human genome (Gencode version 9) onto the new genome build using syntenic relationships between the two genomes in a two-step pipeline (Zamani et al. submitted). First, we used the synteny aligner Satsuma to establish syntenic relationships between the human and dog genome sequences. This information was then used to produce a rough mapping of annotated features from the query species to the target species within the syntenic regions. Candidate mappings were subsequently re-aligned using a local alignment strategy between feature boundaries. In a final step, we checked the intron-exon boundaries annotated in this manner for neighboring canonical splice sites and adjusted accordingly if such features were found within 5 nucleotides from the predicted boundaries. While all sequences mapped to canFam2.0 could also be located in canFam3.1, we note that there were 75 transcripts that could be mapped from human onto canFam3.1 due to the recovered sequences, but absent in the canFam2.0 build. Out of the ∼1,000 additional exons in canFam3.1, about 60% are first exons, regions that are known to be rich in GC content and were therefore absent from canFam2.0. Of the ∼18,300 loci that could not be mapped from human due to a lack of orthology, the most prominent group consists of retrotransposed pseudogenes (5,827), followed by uncharacterized novel genes with unconfirmed transcriptional support level (4,134). Together these groups account for more than half of the unmapped gene loci. Gene families with missing members include olfactory receptors (330), immunoglobulin and immunoglobulin -related genes (306), zinc finger proteins (204), microRNAs (209), uncharacterized gene families (143), and keratins and keratin-associated proteins (103). [...] We aligned all RNA-Seq reads against the unmasked, euchromatic portion of the dog reference genome (canFam3.1, obtained from EnsEMBL release 68) using the splice-junction mapper Tophat (version 2.0.5) with default parameters. We filtered the resulting read alignments using a quality cut-off of 15 and assembled them into transcript models individually per tissue using the cufflinks package (version 2.0.2) . We performed both steps without a reference annotation to avoid the introduction of biases. We then merged the transcriptome annotations from all tissues using the cuffmerge tool - as part of the cufflinks package - into one consensus annotation for each library preparation and then into one combined annotation across all samples (poly-A and DSN). […]

Pipeline specifications

Software tools Satsuma, TopHat, Cufflinks
Databases GENCODE
Applications RNA-seq analysis, Nucleotide sequence alignment
Organisms Mus musculus, Canis lupus familiaris, Homo sapiens