Computational protocol: Gene Body Methylation Patterns in Daphnia Are Associated with Gene Family Size

[…] Gene models were extracted from the 2011 frozen annotation version of the D. pulex reference genome downloaded from the DOE JGI Genome Portal. Given the fragmented state of the D. pulex reference genome, there is a probability that current gene numbers and gene copies within a family are inflated (). We therefore filtered these gene models to a conservative but representative gene list using the following criteria based on suggestions by . All gene models that occur within poorly covered regions or having gapped alignments were removed. In particular, all genes with 50 or more consecutive unidentified bases (labeled as N) were excluded. In addition, only gene models with protein sequences containing both a start and stop codon were retained. Finally, only D. pulex gene models that have a significant hit with a reciprocal blast (cutoff e-value 1e−05) against the available D. magna gene set were retained (, last accessed April 4, 2016). These filtering steps resulted in a conserved D. pulex gene set of 14,102 genes and a conserved orthologous D. magna gene set of 8,800 genes generated through the reciprocal blast. Genes within the D. pulex set have been transcriptionally validated through several microarray experiments (; ; ) while D. magna gene models have been validated using extensive RNAseq experiments (Orsini et al. submitted for publication). To evaluate potential bias in the conservative gene set we used BUSCO, a software developed by to provide quantitative measures of gene set completeness. This software uses single copy orthologs from OrthoDB, called benchmarks, to evaluate the completeness of a gene set. We used BUSCO to evaluate how representative the conserved gene sets were compared with the complete nonfiltered gene set as reported by in (last accessed April 4, 2016). We found 72% of the benchmark single-copy orthologs as defined by BUSCO in the conserved D. magna gene set and 69% in the conserved D. pulex gene set while 94% of the orthologs were present when using all available gene models (30,940 genes). By using a conserved gene set, rather than the full gene set, we reduce the chance of inflating gene copy numbers and gene family size to due errors in sequence assembly (). Cytosine-specific methylation levels for each gene body within the conservative set were obtained by overlapping these gene models through BEDtools 2.17.0 () with cytosine-specific methylation levels as determined above. The methylation level of a gene was inferred as sum of all methylation rates within the gene divided by the total number of cytosines covering the feature according to . […]

Software tools EvidentialGene, BUSCO, BEDTools
Databases OrthoDB