Genomicus: a database and a browser to study gene synteny in modern and ancestral genomes
Efficient search, mapping, and optimization of multi protein genetic systems in diverse bacteria
Summary: Comparative genomics remains a pivotal strategy to study the evolution of gene organization, and this primacy is reinforced by the growing number of full genome sequences available in public repositories. Despite this growth, bioinformatic tools available to visualize and compare genomes and to infer evolutionary events remain restricted to two or three genomes at a time, thus limiting the breadth and the nature of the question that can be investigated. Here we present Genomicus, a new synteny browser that can represent and compare unlimited numbers of genomes in a broad phylogenetic view. In addition, Genomicus includes reconstructed ancestral gene organization, thus greatly facilitating the interpretation of the data. Availability: Genomicus is freely available for online use at http://www.dyogen.ens.fr/genomicus while data can be downloaded at ftp://ftp.biologie.ens.fr/pub/dyogen/genomicus Contact: [email protected]
Developing predictive models of multi-protein genetic systems to understand and optimize their behavior remains a combinatorial challenge, particularly when measurement throughput is limited. We developed a computational approach to build predictive models and identify optimal sequences and expression levels, while circumventing combinatorial explosion. Maximally informative genetic system variants were first designed by the RBS Library Calculator, an algorithm to design sequences for efficiently searching a multi-protein expression space across a > 10,000-fold range with tailored search parameters and well-predicted translation rates. We validated the algorithm's predictions by characterizing 646 genetic system variants, encoded in plasmids and genomes, expressed in six gram-positive and gram-negative bacterial hosts. We then combined the search algorithm with system-level kinetic modeling, requiring the construction and characterization of 73 variants to build a sequence-expression-activity map (SEAMAP) for a biosynthesis pathway. Using model predictions, we designed and characterized 47 additional pathway variants to navigate its activity space, find optimal expression regions with desired activity response curves, and relieve rate-limiting steps in metabolism. Creating sequence-expression-activity maps accelerates the optimization of many protein systems and allows previous measurements to quantitatively inform future designs.
Jflow: a workflow management system for web applications
ReliableGenome: annotation of genomic regions with high/low variant calling concordance
Summary: Biologists produce large data sets and are in demand of rich and simple web portals in which they can upload and analyze their files. Providing such tools requires to mask the complexity induced by the needed High Performance Computing (HPC) environment. The connection between interface and computing infrastructure is usually specific to each portal. With Jflow, we introduce a Workflow Management System (WMS), composed of jQuery plug-ins which can easily be embedded in any web application and a Python library providing all requested features to setup, run and monitor workflows.
Availability and implementation: Jflow is available under the GNU General Public License (GPL) at http://bioinfo.genotoul.fr/jflow. The package is coming with full documentation, quick start and a running test portal.
The increasing adoption of clinical whole-genome resequencing (WGS) demands for highly accurate and reproducible variant calling (VC) methods. The observed discordance between state-of-the-art VC pipelines, however, indicates that the current practice still suffers from non-negligible numbers of false positive and negative SNV and INDEL calls that were shown to be enriched among discordant calls but also in genomic regions with low sequence complexity. Here, we describe our method ReliableGenome (RG) for partitioning genomes into high and low concordance regions with respect to a set of surveyed VC pipelines. Our method combines call sets derived by multiple pipelines from arbitrary numbers of datasets and interpolates expected concordance for genomic regions without data. By applying RG to 219 deep human WGS datasets, we demonstrate that VC concordance depends predominantly on genomic context rather than the actual sequencing data which manifests in high recurrence of regions that can/cannot be reliably genotyped by a single method. This enables the application of pre-computed regions to other data created with comparable sequencing technology and software. RG outperforms comparable efforts in predicting VC concordance and false positive calls in low-concordance regions which underlines its usefulness for variant filtering, annotation and prioritization. RG allows focusing resource-intensive algorithms (e.g. consensus calling methods) on the smaller, discordant share of the genome (20–30%) which might result in increased overall accuracy at reasonable costs. Our method and analysis of discordant calls may further be useful for development, benchmarking and optimization of VC algorithms and for the relative comparison of call sets between different studies/pipelines. RG was implemented in Java, source code and binaries are freely available for non-commercial use at https://github.com/popitsch/wtchg-rg/.
Supplementary data are available at Bioinformatics online.
FastRFS: fast and accurate Robinson Foulds Supertrees using constrained exact optimization
The estimation of phylogenetic trees is a major part of many biological dataset analyses, but maximum likelihood approaches are NP-hard and Bayesian MCMC methods do not scale well to even moderate-sized datasets. Supertree methods, which are used to construct trees from trees computed on subsets, are critically important tools for enabling the statistical estimation of phylogenies for large and potentially heterogeneous datasets. Supertree estimation is itself NP-hard, and no current supertree method has sufficient accuracy and scalability to provide good accuracy on the large datasets that supertree methods were designed for, containing thousands of species and many subset trees. We present FastRFS, a new method based on a dynamic programming method we have developed to find an exact solution to the Robinson-Foulds Supertree problem within a constrained search space. FastRFS has excellent accuracy in terms of criterion scores and topological accuracy of the resultant trees, substantially improving on competing methods on a large collection of biological and simulated data. In addition, FastRFS is extremely fast, finishing in minutes on even very large datasets, and in under an hour on a biological dataset with 2228 species. FastRFS is available on github at https://github.com/pranjalv123/FastRFS
Supplementary data are available at Bioinformatics online.
IDP ASE: haplotyping and quantifying allele specific expression at the gene and gene isoform level by hybrid sequencing
Allele-specific expression (ASE) is a fundamental problem in studying gene regulation and diploid transcriptome profiles, with two key challenges: (i) haplotyping and (ii) estimation of ASE at the gene isoform level. Existing ASE analysis methods are limited by a dependence on haplotyping from laborious experiments or extra genome/family trio data. In addition, there is a lack of methods for gene isoform level ASE analysis. We developed a tool, IDP-ASE, for full ASE analysis. By innovative integration of Third Generation Sequencing (TGS) long reads with Second Generation Sequencing (SGS) short reads, the accuracy of haplotyping and ASE quantification at the gene and gene isoform level was greatly improved as demonstrated by the gold standard data GM12878 data and semi-simulation data. In addition to methodology development, applications of IDP-ASE to human embryonic stem cells and breast cancer cells indicate that the imbalance of ASE and non-uniformity of gene isoform ASE is widespread, including tumorigenesis relevant genes and pluripotency markers. These results show that gene isoform expression and allele-specific expression cooperate to provide high diversity and complexity of gene regulation and expression, highlighting the importance of studying ASE at the gene isoform level. Our study provides a robust bioinformatics solution to understand ASE using RNA sequencing data only.
PanViz: interactive visualization of the structure of functionally annotated pangenomes
Supplementary data are available at Bioinformatics online.
Re annotation and re analysis of the Campylobacter jejuni NCTC11168 genome sequence
Campylobacter jejuni is the leading bacterial cause of human gastroenteritis in the developed world. To improve our understanding of this important human pathogen, the C. jejuni NCTC11168 genome was sequenced and published in 2000. The original annotation was a milestone in Campylobacter research, but is outdated. We now describe the complete re-annotation and re-analysis of the C. jejuni NCTC11168 genome using current database information, novel tools and annotation techniques not used during the original annotation. Re-annotation was carried out using sequence database searches such as FASTA, along with programs such as TMHMM for additional support. The re-annotation also utilises sequence data from additional Campylobacter strains and species not available during the original annotation. Re-annotation was accompanied by a full literature search that was incorporated into the updated EMBL file [EMBL: AL111168]. The C. jejuni NCTC11168 re-annotation reduced the total number of coding sequences from 1654 to 1643, of which 90.0% have additional information regarding the identification of new motifs and/or relevant literature. Re-annotation has led to 18.2% of coding sequence product functions being revised. Major updates were made to genes involved in the biosynthesis of important surface structures such as lipooligosaccharide, capsule and both O- and N-linked glycosylation. This re-annotation will be a key resource for Campylobacter research and will also provide a prototype for the re-annotation and re-interpretation of other bacterial genomes.
Genome dynamics and diversity of Shigella species, the etiologic agents of bacillary dysentery
The Shigella bacteria cause bacillary dysentery, which remains a significant threat to public health. The genus status and species classification appear no longer valid, as compelling evidence indicates that Shigella, as well as enteroinvasive Escherichia coli, are derived from multiple origins of E.coli and form a single pathovar. Nevertheless, Shigella dysenteriae serotype 1 causes deadly epidemics but Shigella boydii is restricted to the Indian subcontinent, while Shigella flexneri and Shigella sonnei are prevalent in developing and developed countries respectively. To begin to explain these distinctive epidemiological and pathological features at the genome level, we have carried out comparative genomics on four representative strains. Each of the Shigella genomes includes a virulence plasmid that encodes conserved primary virulence determinants. The Shigella chromosomes share most of their genes with that of E.coli K12 strain MG1655, but each has over 200 pseudogenes, 300∼700 copies of insertion sequence (IS) elements, and numerous deletions, insertions, translocations and inversions. There is extensive diversity of putative virulence genes, mostly acquired via bacteriophage-mediated lateral gene transfer. Hence, via convergent evolution involving gain and loss of functions, through bacteriophage-mediated gene acquisition, IS-mediated DNA rearrangements and formation of pseudogenes, the Shigella spp. became highly specific human pathogens with variable epidemiological and pathological features.
Phyx: phylogenetic tools for unix
The ease with which phylogenomic data can be generated has drastically escalated the computational burden for even routine phylogenetic investigations. To address this, we present phyx: a collection of programs written in C ++ to explore, manipulate, analyze and simulate phylogenetic objects (alignments, trees and MCMC logs). Modelled after Unix/GNU/Linux command line tools, individual programs perform a single task and operate on standard I/O streams that can be piped to quickly and easily form complex analytical pipelines. Because of the stream-centric paradigm, memory requirements are minimized (often only a single tree or sequence in memory at any instance), and hence phyx is capable of efficiently processing very large datasets.
phyx runs on POSIX-compliant operating systems. Source code, installation instructions, documentation and example files are freely available under the GNU General Public License at https://github.com/FePhyFoFum/phyx
Supplementary data are available at Bioinformatics online.
BioJava ModFinder: identification of protein modifications in 3D structures from the Protein Data Bank
We developed a new software tool, BioJava-ModFinder, for identifying protein modifications observed in 3D structures archived in the Protein Data Bank (PDB). Information on more than 400 types of protein modifications were collected and curated from annotations in PDB, RESID, and PSI-MOD. We divided these modifications into three categories: modified residues, attachment modifications, and cross-links. We have developed a systematic method to identify these modifications in 3D protein structures. We have integrated this package with the RCSB PDB web application and added protein modification annotations to the sequence diagram and structure display. By scanning all 3D structures in the PDB using BioJava-ModFinder, we identified more than 30 000 structures with protein modifications, which can be searched, browsed, and visualized on the RCSB PDB website. BioJava-ModFinder is available as open source (LGPL license) at (https://github.com/biojava/biojava/tree/master/biojava-modfinder). The RCSB PDB can be accessed at http://www.rcsb.org.
RNAscClust: clustering RNA sequences using structure conservation and graph based motifs
Clustering RNA sequences with common secondary structure is an essential step towards studying RNA function. Whereas structural RNA alignment strategies typically identify common structure for orthologous structured RNAs, clustering seeks to group paralogous RNAs based on structural similarities. However, existing approaches for clustering paralogous RNAs, do not take the compensatory base pair changes obtained from structure conservation in orthologous sequences into account. Here, we present RNAscClust, the implementation of a new algorithm to cluster a set of structured RNAs taking their respective structural conservation into account. For a set of multiple structural alignments of RNA sequences, each containing a paralog sequence included in a structural alignment of its orthologs, RNAscClust computes minimum free-energy structures for each sequence using conserved base pairs as prior information for the folding. The paralogs are then clustered using a graph kernel-based strategy, which identifies common structural features. We show that the clustering accuracy clearly benefits from an increasing degree of compensatory base pair changes in the alignments.
RNAscClust is available at http://www.bioinf.uni-freiburg.de/Software/RNAscClust.
Supplementary data are available at Bioinformatics online.
The RNASeq er API—a gateway to systematically updated analysis of public RNA seq data
The exponential growth of publicly available RNA-sequencing (RNA-Seq) data poses an increasing challenge to researchers wishing to discover, analyse and store such data, particularly those based in institutions with limited computational resources. EMBL-EBI is in an ideal position to address these challenges and to allow the scientific community easy access to not just raw, but also processed RNA-Seq data. We present a Web service to access the results of a systematically and continually updated standardized alignment as well as gene and exon expression quantification of all public bulk (and in the near future also single-cell) RNA-Seq runs in 264 species in European Nucleotide Archive, using Representational State Transfer. The RNASeq-er API (Application Programming Interface) enables ontology-powered search for and retrieval of CRAM, bigwig and bedGraph files, gene and exon expression quantification matrices (Fragments Per Kilobase Of Exon Per Million Fragments Mapped, Transcripts Per Million, raw counts) as well as sample attributes annotated with ontology terms. To date over 270 00 RNA-Seq runs in nearly 10 000 studies (1PB of raw FASTQ data) in 264 species in ENA have been processed and made available via the API. The RNASeq-er API can be accessed at http://www.ebi.ac.uk/fg/rnaseq/api. The commands used to analyse the data are available in supplementary materials and at https://github.com/nunofonseca/irap/wiki/iRAP-single-library.
Supplementary data are available at Bioinformatics online.
Hierarchical probabilistic models for multiple gene/variant associations based on next generation sequencing data
The identification of genetic variants influencing gene expression (known as expression quantitative trait loci or eQTLs) is important in unravelling the genetic basis of complex traits. Detecting multiple eQTLs simultaneously in a population based on paired DNA-seq and RNA-seq assays employs two competing types of models: models which rely on appropriate transformations of RNA-seq data (and are powered by a mature mathematical theory), or count-based models, which represent digital gene expression explicitly, thus rendering such transformations unnecessary. The latter constitutes an immensely popular methodology, which is however plagued by mathematical intractability. We develop tractable count-based models, which are amenable to efficient estimation through the introduction of latent variables and the appropriate application of recent statistical theory in a sparse Bayesian modelling framework. Furthermore, we examine several transformation methods for RNA-seq read counts and we introduce arcsin, logit and Laplace smoothing as preprocessing steps for transformation-based models. Using natural and carefully simulated data from the 1000 Genomes and gEUVADIS projects, we benchmark both approaches under a variety of scenarios, including the presence of noise and violation of basic model assumptions. We demonstrate that an arcsin transformation of Laplace-smoothed data is at least as good as state-of-the-art models, particularly at small samples. Furthermore, we show that an over-dispersed Poisson model is comparable to the celebrated Negative Binomial, but much easier to estimate. These results provide strong support for transformation-based versus count-based (particularly Negative-Binomial-based) models for eQTL mapping. All methods are implemented in the free software eQTLseq: https://github.com/dvav/eQTLseq
Supplementary data are available at Bioinformatics online.
SPRINT: an SNP free toolkit for identifying RNA editing sites
RNA editing generates post-transcriptional sequence alterations. Detection of RNA editing sites (RESs) typically requires the filtering of SNVs called from RNA-seq data using an SNP database, an obstacle that is difficult to overcome for most organisms. Here, we present a novel method named SPRINT that identifies RESs without the need to filter out SNPs. SPRINT also integrates the detection of hyper RESs from remapped reads, and has been fully automated to any RNA-seq data with reference genome sequence available. We have rigorously validated SPRINT’s effectiveness in detecting RESs using RNA-seq data of samples in which genes encoding RNA editing enzymes are knock down or over-expressed, and have also demonstrated its superiority over current methods. We have applied SPRINT to investigate RNA editing across tissues and species, and also in the development of mouse embryonic central nervous system. A web resource (http://sprint.tianlab.cn) of RESs identified by SPRINT has been constructed. The software and related data are available at http://sprint.tianlab.cn.
Supplementary data are available at Bioinformatics online.
pgRNAFinder: a web based tool to design distance independent paired gRNA
The CRISPR/Cas System has been shown to be an efficient and accurate genome-editing technique. There exist a number of tools to design the guide RNA sequences and predict potential off-target sites. However, most of the existing computational tools on gRNA design are restricted to small deletions. To address this issue, we present pgRNAFinder, with an easy-to-use web interface, which enables researchers to design single or distance-free paired-gRNA sequences. The web interface of pgRNAFinder contains both gRNA search and scoring system. After users input query sequences, it searches gRNA by 3' protospacer-adjacent motif (PAM), and possible off-targets, and scores the conservation of the deleted sequences rapidly. Filters can be applied to identify high-quality CRISPR sites. PgRNAFinder offers gRNA design functionality for 8 vertebrate genomes. Furthermore, to keep pgRNAFinder open, extensible to any organism, we provide the source package for local use. The pgRNAFinder is freely available at http://songyanglab.sysu.edu.cn/wangwebs/pgRNAFinder/, and the source code and user manual can be obtained from https://github.com/xiexiaowei/pgRNAFinder.
Supplementary data are available at Bioinformatics online.
Snaptron: querying splicing patterns across tens of thousands of RNA seq samples
As more and larger genomics studies appear, there is a growing need for comprehensive and queryable cross-study summaries. These enable researchers to leverage vast datasets that would otherwise be difficult to obtain. Snaptron is a search engine for summarized RNA sequencing data with a query planner that leverages R-tree, B-tree and inverted indexing strategies to rapidly execute queries over 146 million exon-exon splice junctions from over 70 000 human RNA-seq samples. Queries can be tailored by constraining which junctions and samples to consider. Snaptron can score junctions according to tissue specificity or other criteria, and can score samples according to the relative frequency of different splicing patterns. We describe the software and outline biological questions that can be explored with Snaptron queries. Documentation is at http://snaptron.cs.jhu.edu. Source code is at https://github.com/ChristopherWilks/snaptron and https://github.com/ChristopherWilks/snaptron-experiments with a CC BY-NC 4.0 license.
Supplementary data are available at Bioinformatics online.
nala: text mining natural language mutation mentions
The extraction of sequence variants from the literature remains an important task. Existing methods primarily target standard (ST) mutation mentions (e.g. ‘E6V’), leaving relevant mentions natural language (NL) largely untapped (e.g. ‘glutamic acid was substituted by valine at residue 6’). We introduced three new corpora suggesting named-entity recognition (NER) to be more challenging than anticipated: 28–77% of all articles contained mentions only available in NL. Our new method nala captured NL and ST by combining conditional random fields with word embedding features learned unsupervised from the entire PubMed. In our hands, nala substantially outperformed the state-of-the-art. For instance, we compared all unique mentions in new discoveries correctly detected by any of three methods (SETH, tmVar, or nala). Neither SETH nor tmVar discovered anything missed by nala, while nala uniquely tagged 33% mentions. For NL mentions the corresponding value shot up to 100% nala-only. Source code, API and corpora freely available at: http://tagtog.net/-corpora/IDP4+.
Supplementary data are available at Bioinformatics online.
graphkernels: R and Python packages for graph comparison
Measuring the similarity of graphs is a fundamental step in the analysis of graph-structured data, which is omnipresent in computational biology. Graph kernels have been proposed as a powerful and efficient approach to this problem of graph comparison. Here we provide graphkernels, the first R and Python graph kernel libraries including baseline kernels such as label histogram based kernels, classic graph kernels such as random walk based kernels, and the state-of-the-art Weisfeiler-Lehman graph kernel. The core of all graph kernels is implemented in C ++ for efficiency. Using the kernel matrices computed by the package, we can easily perform tasks such as classification, regression and clustering on graph-structured samples. The R and Python packages including source code are available at https://CRAN.R-project.org/package=graphkernels and https://pypi.python.org/pypi/graphkernels.
Supplementary data are available online at Bioinformatics.
TSIS: an R package to infer alternative splicing isoform switches for time series data
An alternative splicing isoform switch is where a pair of transcript isoforms reverse their relative expression abundances in response to external or internal stimuli. Although computational methods are available to study differential alternative splicing, few tools for detection of isoform switches exist and these are based on pairwise comparisons. Here, we provide the TSIS R package, which is the first tool for detecting significant transcript isoform switches in time-series data. The main steps of TSIS are to search for the isoform switch points in the time-series, characterize the switches and filter the results with user input parameters. All the functions are integrated into a Shiny App for ease of implementation of the analysis. The TSIS package is available on GitHub: https://github.com/wyguo/TSIS.
EDEN: evolutionary dynamics within environments
Unbiased Taxonomic Annotation of Metagenomic Samples
Metagenomics revolutionized the field of microbial ecology, giving access to Gb-sized datasets of microbial communities under natural conditions. This enables fine-grained analyses of the functions of community members, studies of their association with phenotypes and environments, as well as of their microevolution and adaptation to changing environmental conditions. However, phylogenetic methods for studying adaptation and evolutionary dynamics are not able to cope with big data. EDEN is the first software for the rapid detection of protein families and regions under positive selection, as well as their associated biological processes, from meta- and pangenome data. It provides an interactive result visualization for detailed comparative analyses. EDEN is available as a Docker installation under the GPL 3.0 license, allowing its use on common operating systems, at http://www.github.com/hzi-bifo/eden.
Supplementary data are available at Bioinformatics online.
The classification of reads from a metagenomic sample using a reference taxonomy is usually based on first mapping the reads to the reference sequences and then classifying each read at a node under the lowest common ancestor of the candidate sequences in the reference taxonomy with the least classification error. However, this taxonomic annotation can be biased by an imbalanced taxonomy and also by the presence of multiple nodes in the taxonomy with the least classification error for a given read. In this article, we show that the Rand index is a better indicator of classification error than the often used area under the
receiver operating characteristic (ROC) curve and
F-measure for both balanced and imbalanced reference taxonomies, and we also address the second source of bias by reducing the taxonomic annotation problem for a whole metagenomic sample to a set cover problem, for which a logarithmic approximation can be obtained in linear time and an exact solution can be obtained by integer linear programming. Experimental results with a proof-of-concept implementation of the set cover approach to taxonomic annotation in a next release of the TANGO software show that the set cover approach further reduces ambiguity in the taxonomic annotation obtained with TANGO without distorting the relative abundance profile of the metagenomic sample.
Abundance estimation and differential testing on strain level in metagenomics data
Current metagenomics approaches allow analyzing the composition of microbial communities at high resolution. Important changes to the composition are known to even occur on strain level and to go hand in hand with changes in disease or ecological state. However, specific challenges arise for strain level analysis due to highly similar genome sequences present. Only a limited number of tools approach taxa abundance estimation beyond species level and there is a strong need for dedicated tools for strain resolution and differential abundance testing. We present DiTASiC (Differential Taxa Abundance including Similarity Correction) as a novel approach for quantification and differential assessment of individual taxa in metagenomics samples. We introduce a generalized linear model for the resolution of shared read counts which cause a significant bias on strain level. Further, we capture abundance estimation uncertainties, which play a crucial role in differential abundance analysis. A novel statistical framework is built, which integrates the abundance variance and infers abundance distributions for differential testing sensitive to strain level. As a result, we obtain highly accurate abundance estimates down to sub-strain level and enable fine-grained resolution of strain clusters. We demonstrate the relevance of read ambiguity resolution and integration of abundance uncertainties for differential analysis. Accurate detections of even small changes are achieved and false-positives are significantly reduced. Superior performance is shown on latest benchmark sets of various complexities and in comparison to existing methods. DiTASiC code is freely available from https://rki_bioinformatics.gitlab.io/ditasic.
Supplementary data are available at Bioinformatics online.
Prediction and modeling of pre analytical sampling errors as a strategy to improve plasma NMR metabolomics data
Biobanks are important infrastructures for life science research. Optimal sample handling regarding e.g. collection and processing of biological samples is highly complex, with many variables that could alter sample integrity and even more complex when considering multiple study centers or using legacy samples with limited documentation on sample management. Novel means to understand and take into account such variability would enable high-quality research on archived samples. This study investigated whether pre-analytical sample variability could be predicted and reduced by modeling alterations in the plasma metabolome, measured by NMR, as a function of pre-centrifugation conditions (1–36 h pre-centrifugation delay time at 4 °C and 22 °C) in 16 individuals. Pre-centrifugation temperature and delay times were predicted using random forest modeling and performance was validated on independent samples. Alterations in the metabolome were modeled at each temperature using a cluster-based approach, revealing reproducible effects of delay time on energy metabolism intermediates at both temperatures, but more pronounced at 22 °C. Moreover, pre-centrifugation delay at 4 °C resulted in large, specific variability at 3 h, predominantly of lipids. Pre-analytical sample handling error correction resulted in significant improvement of data quality, particularly at 22 °C. This approach offers the possibility to predict pre-centrifugation delay temperature and time in biobanked samples before use in costly downstream applications. Moreover, the results suggest potential to decrease the impact of undesired, delay-induced variability. However, these findings need to be validated in multiple, large sample sets and with analytical techniques covering a wider range of the metabolome, such as LC-MS. The sampleDrift R package is available at https://gitlab.com/CarlBrunius/sampleDrift.
Supplementary data are available at Bioinformatics online.
BIOSSES: a semantic sentence similarity estimation system for the biomedical domain
The amount of information available in textual format is rapidly increasing in the biomedical domain. Therefore, natural language processing (NLP) applications are becoming increasingly important to facilitate the retrieval and analysis of these data. Computing the semantic similarity between sentences is an important component in many NLP tasks including text retrieval and summarization. A number of approaches have been proposed for semantic sentence similarity estimation for generic English. However, our experiments showed that such approaches do not effectively cover biomedical knowledge and produce poor results for biomedical text. We propose several approaches for sentence-level semantic similarity computation in the biomedical domain, including string similarity measures and measures based on the distributed vector representations of sentences learned in an unsupervised manner from a large biomedical corpus. In addition, ontology-based approaches are presented that utilize general and domain-specific ontologies. Finally, a supervised regression based model is developed that effectively combines the different similarity computation metrics. A benchmark data set consisting of 100 sentence pairs from the biomedical literature is manually annotated by five human experts and used for evaluating the proposed methods. The experiments showed that the supervised semantic sentence similarity computation approach obtained the best performance (0.836 correlation with gold standard human annotations) and improved over the state-of-the-art domain-independent systems up to 42.6% in terms of the Pearson correlation metric. A web-based system for biomedical semantic sentence similarity computation, the source code, and the annotated benchmark data set are available at: http://tabilab.cmpe.boun.edu.tr/BIOSSES/.
PatternLab for proteomics: a tool for differential shotgun proteomics
A goal of proteomics is to distinguish between states of a biological system by identifying protein expression differences. Liu et al. demonstrated a method to perform semi-relative protein quantitation in shotgun proteomics data by correlating the number of tandem mass spectra obtained for each protein, or "spectral count", with its abundance in a mixture; however, two issues have remained open: how to normalize spectral counting data and how to efficiently pinpoint differences between profiles. Moreover, Chen et al. recently showed how to increase the number of identified proteins in shotgun proteomics by analyzing samples with different MS-compatible detergents while performing proteolytic digestion. The latter introduced new challenges as seen from the data analysis perspective, since replicate readings are not acquired. To address the open issues above, we present a program termed PatternLab for proteomics. This program implements existing strategies and adds two new methods to pinpoint differences in protein profiles. The first method, ACFold, addresses experiments with less than three replicates from each state or having assays acquired by different protocols as described by Chen et al. ACFold uses a combined criterion based on expression fold changes, the AC test, and the false-discovery rate, and can supply a "bird's-eye view" of differentially expressed proteins. The other method addresses experimental designs having multiple readings from each state and is referred to as nSVM (natural support vector machine) because of its roots in evolutionary computing and in statistical learning theory. Our observations suggest that nSVM's niche comprises projects that select a minimum set of proteins for classification purposes; for example, the development of an early detection kit for a given pathology. We demonstrate the effectiveness of each method on experimental data and confront them with existing strategies. PatternLab offers an easy and unified access to a variety of feature selection and normalization strategies, each having its own niche. Additionally, graphing tools are available to aid in the analysis of high throughput experimental data. PatternLab is available at .
Phylogenetic convolutional neural networks in metagenomics
Convolutional Neural Networks can be effectively used only when data are endowed with an intrinsic concept of neighbourhood in the input space, as is the case of pixels in images. We introduce here Ph-CNN, a novel deep learning architecture for the classification of metagenomics data based on the Convolutional Neural Networks, with the patristic distance defined on the phylogenetic tree being used as the proximity measure. The patristic distance between variables is used together with a sparsified version of MultiDimensional Scaling to embed the phylogenetic tree in a Euclidean space. Ph-CNN is tested with a domain adaptation approach on synthetic data and on a metagenomics collection of gut microbiota of 38 healthy subjects and 222 Inflammatory Bowel Disease patients, divided in 6 subclasses. Classification performance is promising when compared to classical algorithms like Support Vector Machines and Random Forest and a baseline fully connected neural network, e.g. the Multi-Layer Perceptron. Ph-CNN represents a novel deep learning approach for the classification of metagenomics data. Operatively, the algorithm has been implemented as a custom Keras layer taking care of passing to the following convolutional layer not only the data but also the ranked list of neighbourhood of each sample, thus mimicking the case of image data, transparently to the user.
An information theoretic approach to the modeling and analysis of whole genome bisulfite sequencing data
DNA methylation is a stable form of epigenetic memory used by cells to control gene expression. Whole genome bisulfite sequencing (WGBS) has emerged as a gold-standard experimental technique for studying DNA methylation by producing high resolution genome-wide methylation profiles. Statistical modeling and analysis is employed to computationally extract and quantify information from these profiles in an effort to identify regions of the genome that demonstrate crucial or aberrant epigenetic behavior. However, the performance of most currently available methods for methylation analysis is hampered by their inability to directly account for statistical dependencies between neighboring methylation sites, thus ignoring significant information available in WGBS reads. We present a powerful information-theoretic approach for genome-wide modeling and analysis of WGBS data based on the 1D Ising model of statistical physics. This approach takes into account correlations in methylation by utilizing a joint probability model that encapsulates all information available in WGBS methylation reads and produces accurate results even when applied on single WGBS samples with low coverage. Using the Shannon entropy, our approach provides a rigorous quantification of methylation stochasticity in individual WGBS samples genome-wide. Furthermore, it utilizes the Jensen-Shannon distance to evaluate differences in methylation distributions between a test and a reference sample. Differential performance assessment using simulated and real human lung normal/cancer data demonstrate a clear superiority of our approach over DSS, a recently proposed method for WGBS data analysis. Critically, these results demonstrate that marginal methods become statistically invalid when correlations are present in the data. This contribution demonstrates clear benefits and the necessity of modeling joint probability distributions of methylation using the 1D Ising model of statistical physics and of quantifying methylation stochasticity using concepts from information theory. By employing this methodology, substantial improvement of DNA methylation analysis can be achieved by effectively taking into account the massive amount of statistical information available in WGBS data, which is largely ignored by existing methods. The online version of this article (10.1186/s12859-018-2086-5) contains supplementary material, which is available to authorized users.
SpliceDetector: a software for detection of alternative splicing events in human and model organisms directly from transcript IDs
In eukaryotes, different combinations of exons lead to multiple transcripts with various functions in protein level, in a process called alternative splicing (AS). Unfolding the complexity of functional genomics through genome-wide profiling of AS and determining the altered ultimate products provide new insights for better understanding of many biological processes, disease progress as well as drug development programs to target harmful splicing variants. The current available tools of alternative splicing work with raw data and include heavy computation. In particular, there is a shortcoming in tools to discover AS events directly from transcripts. Here, we developed a Windows-based user-friendly tool for identifying AS events from transcripts without the need to any advanced computer skill or database download. Meanwhile, due to online working mode, our application employs the updated SpliceGraphs without the need to any resource updating. First, SpliceGraph forms based on the frequency of active splice sites in pre-mRNA. Then, the presented approach compares query transcript exons to SpliceGraph exons. The tool provides the possibility of statistical analysis of AS events as well as AS visualization compared to SpliceGraph. The developed application works for transcript sets in human and model organisms.
QAPA: a new method for the systematic analysis of alternative polyadenylation from RNA seq data
Alternative polyadenylation (APA) affects most mammalian genes. The genome-wide investigation of APA has been hampered by an inability to reliably profile it using conventional RNA-seq. We describe ‘Quantification of APA’ (QAPA), a method that infers APA from conventional RNA-seq data. QAPA is faster and more sensitive than other methods. Application of QAPA reveals discrete, temporally coordinated APA programs during neurogenesis and that there is little overlap between genes regulated by alternative splicing and those by APA. Modeling of these data uncovers an APA sequence code. QAPA thus enables the discovery and characterization of programs of regulated APA using conventional RNA-seq. The online version of this article (10.1186/s13059-018-1414-4) contains supplementary material, which is available to authorized users.
Transcriptional program for nitrogen starvation induced lipid accumulation in Chlamydomonas reinhardtii
Algae accumulate lipids to endure different kinds of environmental stresses including macronutrient starvation. Although this response has been extensively studied, an in depth understanding of the transcriptional regulatory network (TRN) that controls the transition into lipid accumulation remains elusive. In this study, we used a systems biology approach to elucidate the transcriptional program that coordinates the nitrogen starvation-induced metabolic readjustments that drive lipid accumulation in Chlamydomonas reinhardtii. We demonstrate that nitrogen starvation triggered differential regulation of 2147 transcripts, which were co-regulated in 215 distinct modules and temporally ordered as 31 transcriptional waves. An early-stage response was triggered within 12 min that initiated growth arrest through activation of key signaling pathways, while simultaneously preparing the intracellular environment for later stages by modulating transport processes and ubiquitin-mediated protein degradation. Subsequently, central metabolism and carbon fixation were remodeled to trigger the accumulation of triacylglycerols. Further analysis revealed that these waves of genome-wide transcriptional events were coordinated by a regulatory program orchestrated by at least 17 transcriptional regulators, many of which had not been previously implicated in this process. We demonstrate that the TRN coordinates transcriptional downregulation of 57 metabolic enzymes across a period of nearly 4 h to drive an increase in lipid content per unit biomass. Notably, this TRN appears to also drive lipid accumulation during sulfur starvation, while phosphorus starvation induces a different regulatory program. The TRN model described here is available as a community-wide web-resource at http://networks.systemsbiology.net/chlamy-portal. In this work, we have uncovered a comprehensive mechanistic model of the TRN controlling the transition from N starvation to lipid accumulation. The program coordinates sequentially ordered transcriptional waves that simultaneously arrest growth and lead to lipid accumulation. This study has generated predictive tools that will aid in devising strategies for the rational manipulation of regulatory and metabolic networks for better biofuel and biomass production. The online version of this article (doi:10.1186/s13068-015-0391-z) contains supplementary material, which is available to authorized users.
Accurate and Robust Prediction of Genetic Relationship from Whole Genome Sequences
Computing the genetic relationship between two humans is important to studies in genetics, genomics, genealogy, and forensics. Relationship algorithms may be sensitive to noise, such as that arising from sequencing errors or imperfect reference genomes. We developed an algorithm for estimation of genetic relationship by averaged blocks (GRAB) that is designed for whole-genome sequencing (WGS) data. GRAB segments the genome into blocks, calculates the fraction of blocks sharing identity, and then uses a classification tree to infer 1st- to 5th- degree relationships and unrelated individuals. We evaluated GRAB on simulated and real sequenced families, and compared it with other software. GRAB achieves similar performance, and does not require knowledge of population background or phasing. GRAB can be used in workflows for identifying unreported relationships, validating reported relationships in family-based studies, and detection of sample-tracking errors or duplicate inclusion. The software is available at familygenomics.systemsbiology.net/grab.
FEELnc: a tool for long non coding RNA annotation and its application to the dog transcriptome
Whole transcriptome sequencing (RNA-seq) has become a standard for cataloguing and monitoring RNA populations. One of the main bottlenecks, however, is to correctly identify the different classes of RNAs among the plethora of reconstructed transcripts, particularly those that will be translated (mRNAs) from the class of long non-coding RNAs (lncRNAs). Here, we present FEELnc (FlExible Extraction of LncRNAs), an alignment-free program that accurately annotates lncRNAs based on a Random Forest model trained with general features such as multi k-mer frequencies and relaxed open reading frames. Benchmarking versus five state-of-the-art tools shows that FEELnc achieves similar or better classification performance on GENCODE and NONCODE data sets. The program also provides specific modules that enable the user to fine-tune classification accuracy, to formalize the annotation of lncRNA classes and to identify lncRNAs even in the absence of a training set of non-coding RNAs. We used FEELnc on a real data set comprising 20 canine RNA-seq samples produced by the European LUPA consortium to substantially expand the canine genome annotation to include 10 374 novel lncRNAs and 58 640 mRNA transcripts. FEELnc moves beyond conventional coding potential classifiers by providing a standardized and complete solution for annotating lncRNAs and is freely available at https://github.com/tderrien/FEELnc.
LncFunNet: an integrated computational framework for identification of functional long noncoding RNAs in mouse skeletal muscle cells
Long noncoding RNAs (lncRNAs) are key regulators of diverse cellular processes. Recent advances in high-throughput sequencing have allowed for an unprecedented discovery of novel lncRNAs. To identify functional lncRNAs from thousands of candidates for further functional validation is still a challenging task. Here, we present a novel computational framework, lncFunNet (lncRNA Functional inference through integrated Network) that integrates ChIP-seq, CLIP-seq and RNA-seq data to predict, prioritize and annotate lncRNA functions. In mouse embryonic stem cells (mESCs), using lncFunNet we not only recovered most of the functional lncRNAs known to maintain mESC pluripotency but also predicted a plethora of novel functional lncRNAs. Similarly, in mouse myoblast C2C12 cells, applying lncFunNet led to prediction of reservoirs of functional lncRNAs in both proliferating myoblasts (MBs) and differentiating myotubes (MTs). Further analyses demonstrated that these lncRNAs are frequently bound by key transcription factors, interact with miRNAs and constitute key nodes in biological network motifs. Further experimentations validated their dynamic expression profiles and functionality during myoblast differentiation. Collectively, our studies demonstrate the use of lncFunNet to annotate and identify functional lncRNAs in a given biological system.
LEON BIS: multiple alignment evaluation of sequence neighbours using a Bayesian inference system
A standard procedure in many areas of bioinformatics is to use a multiple sequence alignment (MSA) as the basis for various types of homology-based inference. Applications include 3D structure modelling, protein functional annotation, prediction of molecular interactions, etc. These applications, however sophisticated, are generally highly sensitive to the alignment used, and neglecting non-homologous or uncertain regions in the alignment can lead to significant bias in the subsequent inferences. Here, we present a new method, LEON-BIS, which uses a robust Bayesian framework to estimate the homologous relations between sequences in a protein multiple alignment. Sequences are clustered into sub-families and relations are predicted at different levels, including ‘core blocks’, ‘regions’ and full-length proteins. The accuracy and reliability of the predictions are demonstrated in large-scale comparisons using well annotated alignment databases, where the homologous sequence segments are detected with very high sensitivity and specificity. LEON-BIS uses robust Bayesian statistics to distinguish the portions of multiple sequence alignments that are conserved either across the whole family or within subfamilies. LEON-BIS should thus be useful for automatic, high-throughput genome annotations, 2D/3D structure predictions, protein-protein interaction predictions etc.
Domain analysis of symbionts and hosts (DASH) in a genome wide survey of pathogenic human viruses
In the coevolution of viruses and their hosts, viruses often capture host genes, gaining advantageous functions (e.g. immune system control). Identifying functional similarities shared by viruses and their hosts can help decipher mechanisms of pathogenesis and accelerate virus-targeted drug and vaccine development. Cellular homologs in viruses are usually documented using pairwise-sequence comparison methods. Yet, pairwise-sequence searches have limited sensitivity resulting in poor identification of divergent homologies. Methods based on profiles from multiple sequences provide a more sensitive alternative to identify similarities in host-pathogen systems. The present work describes a profile-based bioinformatics pipeline that we call the Domain Analysis of Symbionts and Hosts (DASH). DASH provides a web platform for the functional analysis of viral and host genomes. This study uses Human Herpesvirus 8 (HHV-8) as a model to validate the methodology. Our results indicate that HHV-8 shares at least 29% of its genes with humans (fourteen immunomodulatory and ten metabolic genes). DASH also suggests functions for fifty-one additional HHV-8 structural and metabolic proteins. We also perform two other comparative genomics studies of human viruses: (1) a broad survey of eleven viruses of disparate sizes and transcription strategies; and (2) a closer examination of forty-one viruses of the order Mononegavirales. In the survey, DASH detects human homologs in 4/5 DNA viruses. None of the non-retro-transcribing RNA viruses in the survey showed evidence of homology to humans. The order Mononegavirales are also non-retro-transcribing RNA viruses, however, and DASH found homology in 39/41 of them. Mononegaviruses display larger fractions of human similarities (up to 75%) than any of the other RNA or DNA viruses (up to 55% and 29% respectively). We conclude that gene sharing probably occurs between humans and both DNA and RNA viruses, in viral genomes of differing sizes, regardless of transcription strategies. Our method (DASH) simultaneously analyzes the genomes of two interacting species thereby mining functional information to identify shared as well as exclusive domains to each organism. Our results validate our approach, showing that DASH has potential as a pipeline for making therapeutic discoveries in other host-symbiont systems. DASH results are available at http://tinyurl.com/spouge-dash.
CUFID query: accurate network querying through random walk based network flow estimation
Functional modules in biological networks consist of numerous biomolecules and their complicated interactions. Recent studies have shown that biomolecules in a functional module tend to have similar interaction patterns and that such modules are often conserved across biological networks of different species. As a result, such conserved functional modules can be identified through comparative analysis of biological networks. In this work, we propose a novel network querying algorithm based on the CUFID (Comparative network analysis Using the steady-state network Flow to IDentify orthologous proteins) framework combined with an efficient seed-and-extension approach. The proposed algorithm, CUFID-query, can accurately detect conserved functional modules as small subnetworks in the target network that are expected to perform similar functions to the given query functional module. The CUFID framework was recently developed for probabilistic pairwise global comparison of biological networks, and it has been applied to pairwise global network alignment, where the framework was shown to yield accurate network alignment results. In the proposed CUFID-query algorithm, we adopt the CUFID framework and extend it for local network alignment, specifically to solve network querying problems. First, in the seed selection phase, the proposed method utilizes the CUFID framework to compare the query and the target networks and to predict the probabilistic node-to-node correspondence between the networks. Next, the algorithm selects and greedily extends the seed in the target network by iteratively adding nodes that have frequent interactions with other nodes in the seed network, in a way that the conductance of the extended network is maximally reduced. Finally, CUFID-query removes irrelevant nodes from the querying results based on the personalized PageRank vector for the induced network that includes the fully extended network and its neighboring nodes. Through extensive performance evaluation based on biological networks with known functional modules, we show that CUFID-query outperforms the existing state-of-the-art algorithms in terms of prediction accuracy and biological significance of the predictions.
Phaser crystallographic software
A description is given of Phaser-2.1: software for phasing macromolecular crystal structures by molecular replacement and single-wavelength anomalous dispersion phasing.
Phaser is a program for phasing macromolecular crystal structures by both molecular replacement and experimental phasing methods. The novel phasing algorithms implemented in Phaser have been developed using maximum likelihood and multivariate statistics. For molecular replacement, the new algorithms have proved to be significantly better than traditional methods in discriminating correct solutions from noise, and for single-wavelength anomalous dispersion experimental phasing, the new algorithms, which account for correlations between F
+ and F
−, give better phases (lower mean phase error with respect to the phases given by the refined structure) than those that use mean F and anomalous differences ΔF. One of the design concepts of Phaser was that it be capable of a high degree of automation. To this end, Phaser (written in C++) can be called directly from Python, although it can also be called using traditional CCP4 keyword-style input. Phaser is a platform for future development of improved phasing methods and their release, including source code, to the crystallographic community.
High Resolution Epigenomic Atlas of Human Embryonic Craniofacial Development
Defects in patterning during human embryonic development frequently result in craniofacial abnormalities. The gene regulatory programs that build the craniofacial complex are likely controlled by information located between genes and within intronic sequences. However, systematic identification of regulatory sequences important for forming the human face has not been performed. Here, we describe comprehensive epigenomic annotations from human embryonic craniofacial tissues and systematic comparisons with multiple tissues and cell types. We identified thousands of tissue-specific craniofacial regulatory sequences and likely causal regions for rare craniofacial abnormalities. We demonstrate significant enrichment of common variants associated with orofacial clefting in enhancers active early in embryonic development, while those associated with normal facial variation are enriched near the end of the embryonic period. These data are provided in easily accessible formats for both craniofacial researchers and clinicians to aid future experimental design and interpretation of noncoding variation in those affected by craniofacial abnormalities. Wilderman et al. report the global identification of gene regulatory sequences active in early human craniofacial development. Systematic comparisons with over 120 different human tissues and cell types reveal shared and craniofacial-specific enhancers. Craniofacial enhancers are enriched with genetic associations for both orofacial clefting risk and face shape.
Bioinformatic analysis of genotype by sequencing (GBS) data with NGSEP
Therecent development and availability of different genotype by sequencing (GBS) protocols provided a cost-effective approach to perform high-resolution genomic analysis of entire populations in different species. The central component of all these protocols is the digestion of the initial DNA with known restriction enzymes, to generate sequencing fragments at predictable and reproducible sites. This allows to genotype thousands of genetic markers on populations with hundreds of individuals. Because GBS protocols achieve parallel genotyping through high throughput sequencing (HTS), every GBS protocol must include a bioinformatics pipeline for analysis of HTS data. Our bioinformatics group recently developed the Next Generation Sequencing Eclipse Plugin (NGSEP) for accurate, efficient, and user-friendly analysis of HTS data. Here we present the latest functionalities implemented in NGSEP in the context of the analysis of GBS data. We implemented a one step wizard to perform parallel read alignment, variants identification and genotyping from HTS reads sequenced from entire populations. We added different filters for variants, samples and genotype calls as well as calculation of summary statistics overall and per sample, and diversity statistics per site. NGSEP includes a module to translate genotype calls to some of the most widely used input formats for integration with several tools to perform downstream analyses such as population structure analysis, construction of genetic maps, genetic mapping of complex traits and phenotype prediction for genomic selection. We assessed the accuracy of NGSEP on two highly heterozygous F1 cassava populations and on an inbred common bean population, and we showed that NGSEP provides similar or better accuracy compared to other widely used software packages for variants detection such as GATK, Samtools and Tassel. NGSEP is a powerful, accurate and efficient bioinformatics software tool for analysis of HTS data, and also one of the best bioinformatic packages to facilitate the analysis and to maximize the genomic variability information that can be obtained from GBS experiments for population genomics. The online version of this article (doi:10.1186/s12864-016-2827-7) contains supplementary material, which is available to authorized users.
Transmembrane protein topology prediction using support vector machines
Alpha-helical transmembrane (TM) proteins are involved in a wide range of important biological processes such as cell signaling, transport of membrane-impermeable molecules, cell-cell communication, cell recognition and cell adhesion. Many are also prime drug targets, and it has been estimated that more than half of all drugs currently on the market target membrane proteins. However, due to the experimental difficulties involved in obtaining high quality crystals, this class of protein is severely under-represented in structural databases. In the absence of structural data, sequence-based prediction methods allow TM protein topology to be investigated. We present a support vector machine-based (SVM) TM protein topology predictor that integrates both signal peptide and re-entrant helix prediction, benchmarked with full cross-validation on a novel data set of 131 sequences with known crystal structures. The method achieves topology prediction accuracy of 89%, while signal peptides and re-entrant helices are predicted with 93% and 44% accuracy respectively. An additional SVM trained to discriminate between globular and TM proteins detected zero false positives, with a low false negative rate of 0.4%. We present the results of applying these tools to a number of complete genomes. Source code, data sets and a web server are freely available from . The high accuracy of TM topology prediction which includes detection of both signal peptides and re-entrant helices, combined with the ability to effectively discriminate between TM and globular proteins, make this method ideally suited to whole genome annotation of alpha-helical transmembrane proteins.
Digital DNA DNA hybridization for microbial species delineation by means of genome to genome sequence comparison
The pragmatic species concept for Bacteria and Archaea is ultimately based on DNA-DNA hybridization (DDH). While enabling the taxonomist, in principle, to obtain an estimate of the overall similarity between the genomes of two strains, this technique is tedious and error-prone and cannot be used to incrementally build up a comparative database. Recent technological progress in the area of genome sequencing calls for bioinformatics methods to replace the wet-lab DDH by in-silico genome-to-genome comparison. Here we investigate state-of-the-art methods for inferring whole-genome distances in their ability to mimic DDH. Algorithms to efficiently determine high-scoring segment pairs or maximally unique matches perform well as a basis of inferring intergenomic distances. The examined distance functions, which are able to cope with heavily reduced genomes and repetitive sequence regions, outperform previously described ones regarding the correlation with and error ratios in emulating DDH. Simulation of incompletely sequenced genomes indicates that some distance formulas are very robust against missing fractions of genomic information. Digitally derived genome-to-genome distances show a better correlation with 16S rRNA gene sequence distances than DDH values. The future perspectives of genome-informed taxonomy are discussed, and the investigated methods are made available as a web service for genome-based species delineation.
Predicting antimicrobial peptides with improved accuracy by incorporating the compositional, physico chemical and structural features into Chou’s general PseAAC
Antimicrobial peptides (AMPs) are important components of the innate immune system that have been found to be effective against disease causing pathogens. Identification of AMPs through wet-lab experiment is expensive. Therefore, development of efficient computational tool is essential to identify the best candidate AMP prior to the in vitro experimentation. In this study, we made an attempt to develop a support vector machine (SVM) based computational approach for prediction of AMPs with improved accuracy. Initially, compositional, physico-chemical and structural features of the peptides were generated that were subsequently used as input in SVM for prediction of AMPs. The proposed approach achieved higher accuracy than several existing approaches, while compared using benchmark dataset. Based on the proposed approach, an online prediction server iAMPpred has also been developed to help the scientific community in predicting AMPs, which is freely accessible at http://cabgrid.res.in:8080/amppred/. The proposed approach is believed to supplement the tools and techniques that have been developed in the past for prediction of AMPs.
“Plug and play” investigation of the human phosphoproteome by targeted high resolution mass spectrometry
Systematic approaches to study cellular signaling require new phosphoproteomic techniques that reproducibly measure the same phosphopeptides across multiple replicates, conditions, and time points. Here we present a method to mine information from large-scale, heterogeneous phosphoproteomics datasets to rapidly generate robust targeted assays. We demonstrate the performance of our method by interrogating the IGF-1/AKT signaling pathway; and show that even rarely observed phosphorylation events can be consistently detected and precisely quantified.
CNV discovery for milk composition traits in dairy cattle using whole genome resequencing
Copy number variations (CNVs) are important and widely distributed in the genome. CNV detection opens a new avenue for exploring genes associated with complex traits in humans, animals and plants. Herein, we present a genome-wide assessment of CNVs that are potentially associated with milk composition traits in dairy cattle. In this study, CNVs were detected based on whole genome re-sequencing data of eight Holstein bulls from four half- and/or full-sib families, with extremely high and low estimated breeding values (EBVs) of milk protein percentage and fat percentage. The range of coverage depth per individual was 8.2–11.9×. Using CNVnator, we identified a total of 14,821 CNVs, including 5025 duplications and 9796 deletions. Among them, 487 differential CNV regions (CNVRs) comprising ~8.23 Mb of the cattle genome were observed between the high and low groups. Annotation of these differential CNVRs were performed based on the cattle genome reference assembly (UMD3.1) and totally 235 functional genes were found within the CNVRs. By Gene Ontology and KEGG pathway analyses, we found that genes were significantly enriched for specific biological functions related to protein and lipid metabolism, insulin/IGF pathway-protein kinase B signaling cascade, prolactin signaling pathway and AMPK signaling pathways. These genes included INS, IGF2, FOXO3, TH, SCD5, GALNT18, GALNT16, ART3, SNCA and WNT7A, implying their potential association with milk protein and fat traits. In addition, 95 CNVRs were overlapped with 75 known QTLs that are associated with milk protein and fat traits of dairy cattle (Cattle QTLdb). In conclusion, based on NGS of 8 Holstein bulls with extremely high and low EBVs for milk PP and FP, we identified a total of 14,821 CNVs, 487 differential CNVRs between groups, and 10 genes, which were suggested as promising candidate genes for milk protein and fat traits. The online version of this article (doi:10.1186/s12864-017-3636-3) contains supplementary material, which is available to authorized users.
First‐generation HapMap in Cajanus spp. reveals untapped variations in parental lines of mapping populations
Whole genome re‐sequencing (WGRS) was conducted on a panel of 20 Cajanus spp. accessions (crossing parentals of recombinant inbred lines, introgression lines, multiparent advanced generation intercross and nested association mapping population) comprising of two wild species and 18 cultivated species accessions. A total of 791.77 million paired‐end reads were generated with an effective mapping depth of ~12X per accession. Analysis of WGRS data provided 5 465 676 genome‐wide variations including 4 686 422 SNPs and 779 254 InDels across the accessions. Large structural variations in the form of copy number variations (2598) and presence and absence variations (970) were also identified. Additionally, 2 630 904 accession‐specific variations comprising of 2 278 571 SNPs (86.6%), 166 243 deletions (6.3%) and 186 090 insertions (7.1%) were also reported. Identified polymorphic sites in this study provide the first‐generation HapMap in Cajanus spp. which will be useful in mapping the genomic regions responsible for important traits.
Characterization of Three New Insect Specific Flaviviruses: Their Relationship to the Mosquito Borne Flavivirus Pathogens
Three novel insect-specific flaviviruses, isolated from mosquitoes collected in Peru, Malaysia (Sarawak), and the United States, are characterized. The new viruses, designated La Tina, Kampung Karu, and Long Pine Key, respectively, are antigenically and phylogenetically more similar to the mosquito-borne flavivirus pathogens, than to the classical insect-specific viruses like cell fusing agent and Culex flavivirus. The potential implications of this relationship and the possible uses of these and other arbovirus-related insect-specific flaviviruses are reviewed.
Development of a comparative genomic fingerprinting assay for rapid and high resolution genotyping of Arcobacter butzleri
Molecular typing methods are critical for epidemiological investigations, facilitating disease outbreak detection and source identification. Study of the epidemiology of the emerging human pathogen Arcobacter butzleri is currently hampered by the lack of a subtyping method that is easily deployable in the context of routine epidemiological surveillance. In this study we describe a comparative genomic fingerprinting (CGF) method for high-resolution and high-throughput subtyping of A. butzleri. Comparative analysis of the genome sequences of eleven A. butzleri strains, including eight strains newly sequenced as part of this project, was employed to identify accessory genes suitable for generating unique genetic fingerprints for high-resolution subtyping based on gene presence or absence within a strain. A set of eighty-three accessory genes was used to examine the population structure of a dataset comprised of isolates from various sources, including human and non-human animals, sewage, and river water (n=156). A streamlined assay (CGF40) based on a subset of 40 genes was subsequently developed through marker optimization. High levels of profile diversity (121 distinct profiles) were observed among the 156 isolates in the dataset, and a high Simpson’s Index of Diversity (ID) observed (ID > 0.969) indicate that the CGF40 assay possesses high discriminatory power. At the same time, our observation that 115 isolates in this dataset could be assigned to 29 clades with a profile similarity of 90% or greater indicates that the method can be used to identify clades comprised of genetically similar isolates. The CGF40 assay described herein combines high resolution and repeatability with high throughput for the rapid characterization of A. butzleri strains. This assay will facilitate the study of the population structure and epidemiology of A. butzleri. The online version of this article (doi:10.1186/s12866-015-0426-4) contains supplementary material, which is available to authorized users.
Draft Genome Sequence of Sphingobium quisquiliarum Strain P25T, a Novel Hexachlorocyclohexane (HCH) Degrading Bacterium Isolated from an HCH Dumpsite
Here, we report the draft genome sequence (4.2 Mb) of Sphingobium quisquiliarum strain P25T, a natural lin (genes involved in degradation of hexachlorocyclohexane [HCH] isomers) variant genotype, isolated from a heavily contaminated (450 mg HCH/g of soil) HCH dumpsite.
Cell Cycle Control of Bivalent Epigenetic Domains Regulates the Exit from Pluripotency
Here we show that bivalent domains and chromosome architecture for bivalent genes are dynamically regulated during the cell cycle in human pluripotent cells. Central to this is the transient increase in H3K4-trimethylation at developmental genes during G1, thereby creating a “window of opportunity” for cell-fate specification. This mechanism is controlled by CDK2-dependent phosphorylation of the MLL2 (KMT2B) histone methyl-transferase, which facilitates its recruitment to developmental genes in G1. MLL2 binding is required for changes in chromosome architecture around developmental genes and establishes promoter-enhancer looping interactions in a cell-cycle-dependent manner. These cell-cycle-regulated loops are shown to be essential for activation of bivalent genes and pluripotency exit. These findings demonstrate that bivalent domains are established to control the cell-cycle-dependent activation of developmental genes so that differentiation initiates from the G1 phase.
Bivalent domains are unstable, dynamic, and cell-cycle regulated
CDK2 phosphorylates MLL2 and establishes bivalent domains in G1
Chromosome remodeling in G1 is required for the “poised” pluripotent state
Bivalent domains are unstable, dynamic, and cell-cycle regulated CDK2 phosphorylates MLL2 and establishes bivalent domains in G1 Chromosome remodeling in G1 is required for the “poised” pluripotent state In this report, Dalton and colleagues show that developmental genes are primed for activation in G1 phase of the cell cycle by a mechanism requiring convergence of the cell-cycle machinery with cell signaling pathways. This priming mechanism involves the establishment of bivalent epigenetic domains and dynamic changes in chromosome architecture around developmental genes.
First Draft Genome Sequences of Two Bartonella tribocorum Strains from Laos and Cambodia
Bartonella tribocorum is a Gram-negative bacterium known to infect animals, and rodents in particular, throughout the world. In this report, we present the draft genome sequences of two strains of B. tribocorum isolated from the blood of a rodent in Laos and a shrew in Cambodia.
Exome sequencing in undiagnosed inherited and sporadic ataxias
Inherited ataxias are difficult to diagnose genetically. Pyle et al. use whole-exome sequencing to provide a likely molecular diagnosis in 14 of 22 families with ataxia. The approach reveals de novo mutations, broadens the phenotype of other disease genes, and is equally effective in young and older-onset patients. Inherited ataxias are clinically and genetically heterogeneous, and a molecular diagnosis is not possible in most patients. Having excluded common sporadic, inherited and metabolic causes, we used an unbiased whole exome sequencing approach in 35 affected individuals, from 22 randomly selected families of white European descent. We defined the likely molecular diagnosis in 14 of 22 families (64%). This revealed de novo dominant mutations, validated disease genes previously described in isolated families, and broadened the clinical phenotype of known disease genes. The diagnostic yield was the same in both young and older-onset patients, including sporadic cases. We have demonstrated the impact of exome sequencing in a group of patients notoriously difficult to diagnose genetically. This has important implications for genetic counselling and diagnostic service provision.
Genotype and clinical course in 2 Chinese Han siblings with Wilson disease presenting with isolated disabling premature osteoarthritis
Supplemental Digital Content is available in the text Premature osteoarthritis (POA) is a rare condition in Wilson disease (WD). Particularly, when POA is the only complaint of a WD patient for a long time, there would be misdiagnosis or missed diagnosis and then treatment delay. Two Chinese Han siblings were diagnosed as WD by corneal K-F rings, laboratory test, and mutation analysis. They presented with isolated POA during the first 2 decades or more of their disease course, and were of missed diagnosis during that long time. The older affected sib became disabled due to his severe osteoarthritis when he was as young as 38 years old. Two compound heterozygous pathogenic variants c.2790_2792del and c.2621C>T were revealed in the ATP7B gene through targeted next-generation sequencing (NGS). Adolescent-onset POA could be the only complaint of WD individual for at least 2 decades. Long delay in the treatment of WD's POA could lead to disability in early adulthood. Detailed physical examination, special biochemical test, and genotyping through targeted NGS should greatly reduce diagnosis delay in atypical WD patients with isolated POA phenotype.
Whole Exome Sequencing Reveals Homozygous Mutations in RAI1, OTOF, and SLC26A4 Genes Associated with Nonsyndromic Hearing Loss in Altaian Families (South Siberia)
Hearing loss (HL) is one of the most common sensorineural disorders and several dozen genes contribute to its pathogenesis. Establishing a genetic diagnosis of HL is of great importance for clinical evaluation of deaf patients and for estimating recurrence risks for their families. Efforts to identify genes responsible for HL have been challenged by high genetic heterogeneity and different ethnic-specific prevalence of inherited deafness. Here we present the utility of whole exome sequencing (WES) for identifying candidate causal variants for previously unexplained nonsyndromic HL of seven patients from four unrelated Altaian families (the Altai Republic, South Siberia). The WES analysis revealed homozygous missense mutations in three genes associated with HL. Mutation c.2168A>G (SLC26A4) was found in one family, a novel mutation c.1111G>C (OTOF) was revealed in another family, and mutation c.5254G>A (RAI1) was found in two families. Sanger sequencing was applied for screening of identified variants in an ethnically diverse cohort of other patients with HL (n = 116) and in Altaian controls (n = 120). Identified variants were found only in patients of Altaian ethnicity (n = 93). Several lines of evidences support the association of homozygosity for discovered variants c.5254G>A (RAI1), c.1111C>G (OTOF), and c.2168A>G (SLC26A4) with HL in Altaian patients. Local prevalence of identified variants implies possible founder effect in significant number of HL cases in indigenous population of the Altai region. Notably, this is the first reported instance of patients with RAI1 missense mutation whose HL is not accompanied by specific traits typical for Smith-Magenis syndrome. Presumed association of RAI1 gene variant c.5254G>A with isolated HL needs to be proved by further experimental studies.
ANO10 mutations cause ataxia and coenzyme Q10 deficiency
Inherited ataxias are heterogeneous
disorders affecting both children and adults, with over 40 different causative genes, making molecular genetic diagnosis challenging. Although recent advances in next-generation sequencing have significantly improved mutation detection, few treatments exist for patients with inherited ataxia. In two patients with adult-onset cerebellar ataxia and coenzyme Q10 (CoQ10) deficiency in muscle, whole exome sequencing revealed mutations in ANO10, which encodes anoctamin 10, a member of a family of putative calcium-activated chloride channels, and the causative gene for autosomal recessive spinocerebellar ataxia-10 (SCAR10). Both patients presented with slowly progressive ataxia and dysarthria leading to severe disability in the sixth decade. Epilepsy and learning difficulties were also present in one patient, while retinal degeneration and cataract were present in the other. The detection of mutations in ANO10 in our patients indicate that ANO10 defects cause secondary low CoQ10 and SCAR10 patients may benefit from CoQ10 supplementation. The online version of this article (doi:10.1007/s00415-014-7476-7) contains supplementary material, which is available to authorized users.
Identification of a de novo DYNC1H1 mutation via WES according to published guidelines
De novo mutations that contribute to rare Mendelian diseases, including neurological disorders, have been recently identified. Whole-exome sequencing (WES) has become a powerful tool for the identification of inherited and de novo mutations in Mendelian diseases. Two important guidelines were recently published regarding the investigation of causality of sequence variant in human disease and the interpretation of novel variants identified in human genome sequences. In this study, a family with supposed movement disorders was sequenced via WES (including the proband and her unaffected parents), and a standard investigation and interpretation of the identified variants was performed according to the published guidelines. We identified a novel de novo mutation (c.2327C > T, p.P776L) in DYNC1H1 gene and confirmed that it was the causal variant. The phenotype of the affected twins included delayed motor milestones, pes cavus, lower limb weakness and atrophy, and a waddling gait. Electromyographic (EMG) recordings revealed typical signs of chronic denervation. Our study demonstrates the power of WES to discover the de novo mutations associated with a neurological disease on the whole exome scale, and guidelines to conduct WES studies and interpret of identified variants are a preferable option for the exploration of the pathogenesis of rare neurological disorders.
Identification of cell type specific mutations in nodal T cell lymphomas
Recent genetic analysis has identified frequent mutations in ten-eleven translocation 2 (TET2), DNA methyltransferase 3A (DNMT3A), isocitrate dehydrogenase 2 (IDH2) and ras homolog family member A (RHOA) in nodal T-cell lymphomas, including angioimmunoblastic T-cell lymphoma and peripheral T-cell lymphoma, not otherwise specified. We examined the distribution of mutations in these subtypes of mature T-/natural killer cell neoplasms to determine their clonal architecture. Targeted sequencing was performed for 71 genes in tumor-derived DNA of 87 cases. The mutations were then analyzed in a programmed death-1 (PD1)-positive population enriched with tumor cells and CD20-positive B cells purified by laser microdissection from 19 cases. TET2 and DNMT3A mutations were identified in both the PD1+ cells and the CD20+ cells in 15/16 and 4/7 cases, respectively. All the RHOA and IDH2 mutations were confined to the PD1+ cells, indicating that some, including RHOA and IDH2 mutations, being specific events in tumor cells. Notably, we found that all NOTCH1 mutations were detected only in the CD20+ cells. In conclusion, we identified both B- as well as T-cell-specific mutations, and mutations common to both T and B cells. These findings indicate the expansion of a clone after multistep and multilineal acquisition of gene mutations.
Mutations in histone modulators are associated with prolonged survival during azacitidine therapy
Early therapeutic decision-making is crucial in patients with higher-risk MDS. We evaluated the impact of clinical parameters and mutational profiles in 134 consecutive patients treated with azacitidine using a combined cohort from Karolinska University Hospital (n=89) and from King's College Hospital, London (n=45). While neither clinical parameters nor mutations had a significant impact on response rate, both karyotype and mutational profile were strongly associated with survival from the start of treatment. IPSS high-risk cytogenetics negatively impacted overall survival (median 20 vs 10 months; p<0.001), whereas mutations in histone modulators (ASXL1, EZH2) were associated with prolonged survival (22 vs 12 months, p=0.01). This positive association was present in both cohorts and remained highly significant in the multivariate cox model. Importantly, patients with mutations in histone modulators lacking high-risk cytogenetics showed a survival of 29 months compared to only 10 months in patients with the opposite pattern. While TP53 was negatively associated with survival, neither RUNX1-mutations nor the number of mutations appeared to influence survival in this cohort. We propose a model combining histone modulator mutational screening with cytogenetics in the clinical decision-making process for higher-risk MDS patients eligible for treatment with azacitidine.
STUB1 mutations in autosomal recessive ataxias – evidence for mutation specific clinical heterogeneity
A subset of hereditary cerebellar ataxias is inherited as autosomal recessive traits (ARCAs). Classification of recessive ataxias due to phenotypic differences in the cerebellum and cerebellar structures is constantly evolving due to new identified disease genes. Recently, reports have linked mutations in genes involved in ubiquitination (RNF216, OTUD4, STUB1) to ARCA with hypogonadism. With a combination of homozygozity mapping and exome sequencing, we identified three mutations in STUB1 in two families with ARCA and cognitive impairment; a homozygous missense variant (c.194A > G, p.Asn65Ser) that segregated in three affected siblings, and a missense change (c.82G > A, p.Glu28Lys) which was inherited in trans with a nonsense mutation (c.430A > T, p.Lys144Ter) in another patient. STUB1 encodes CHIP (C-terminus of Heat shock protein 70 – Interacting Protein), a dual function protein with a role in ubiquitination as a co-chaperone with heat shock proteins, and as an E3 ligase. We show that the p.Asn65Ser substitution impairs CHIP’s ability to ubiquitinate HSC70 in vitro, despite being able to self-ubiquitinate. These results are consistent with previous studies highlighting this as a critical residue for the interaction between CHIP and its co-chaperones. Furthermore, we show that the levels of CHIP are strongly reduced in vivo in patients’ fibroblasts compared to controls. These results suggest that STUB1 mutations might cause disease by impacting not only the E3 ligase function, but also its protein interaction properties and protein amount. Whether the clinical heterogeneity seen in STUB1 ARCA can be related to the location of the mutations remains to be understood, but interestingly, all siblings with the p.Asn65Ser substitution showed a marked appearance of accelerated aging not previously described in STUB1 related ARCA, none display hormonal aberrations/clinical hypogonadism while some affected family members had diabetes, alopecia, uveitis and ulcerative colitis, further refining the spectrum of STUB1 related disease. The online version of this article (doi:10.1186/s13023-014-0146-0) contains supplementary material, which is available to authorized users.
Glioblastoma adaptation traced through decline of an IDH1 clonal driver and macro evolution of a double minute chromosome
In a glioblastoma tumour with multi-region sequencing before and after recurrence, we find an IDH1 mutation that is clonal in the primary but lost at recurrence. We also describe the evolution of a double-minute chromosome encoding regulators of the PI3K signalling axis that dominates at recurrence, emphasizing the challenges of an evolving and dynamic oncogenic landscape for precision medicine. Glioblastoma (GBM) is the most common malignant brain cancer occurring in adults, and is associated with dismal outcome and few therapeutic options. GBM has been shown to predominantly disrupt three core pathways through somatic aberrations, rendering it ideal for precision medicine approaches. We describe a 35-year-old female patient with recurrent GBM following surgical removal of the primary tumour, adjuvant treatment with temozolomide and a 3-year disease-free period. Rapid whole-genome sequencing (WGS) of three separate tumour regions at recurrence was carried out and interpreted relative to WGS of two regions of the primary tumour. We found extensive mutational and copy-number heterogeneity within the primary tumour. We identified a TP53 mutation and two focal amplifications involving PDGFRA, KIT and CDK4, on chromosomes 4 and 12. A clonal IDH1 R132H mutation in the primary, a known GBM driver event, was detectable at only very low frequency in the recurrent tumour. After sub-clonal diversification, evidence was found for a whole-genome doubling event and a translocation between the amplified regions of PDGFRA, KIT and CDK4, encoded within a double-minute chromosome also incorporating miR26a-2. The WGS analysis uncovered progressive evolution of the double-minute chromosome converging on the KIT/PDGFRA/PI3K/mTOR axis, superseding the IDH1 mutation in dominance in a mutually exclusive manner at recurrence, consequently the patient was treated with imatinib. Despite rapid sequencing and cancer genome-guided therapy against amplified oncogenes, the disease progressed, and the patient died shortly after. This case sheds light on the dynamic evolution of a GBM tumour, defining the origins of the lethal sub-clone, the macro-evolutionary genomic events dominating the disease at recurrence and the loss of a clonal driver. Even in the era of rapid WGS analysis, cases such as this illustrate the significant hurdles for precision medicine success.
The clinical features, outcomes and genetic characteristics of hypertrophic cardiomyopathy patients with severe right ventricular hypertrophy
Severe right ventricular hypertrophy (SRVH) is a rare phenotype in hypertrophic cardiomyopathy (HCM) for which limited information is available. This study was undertaken to investigate the clinical, prognostic and genetic characteristics of HCM patients with SRVH. HCM with SRVH was defined as HCM with a maximum right ventricular wall thickness ≥10 mm. Whole-genome sequencing (WGS) was performed in HCM patients with SRVH. Multivariate Cox proportional hazards regression models were used to identify risk factors for cardiac death and events in HCM with SRVH. Patients with apical hypertrophic cardiomyopathy (ApHCM) were selected as a comparison group. The clinical features and outcomes of 34 HCM patients with SRVH and 273 ApHCM patients were compared. Compared with the ApHCM group, the HCM with SRVH group included younger patients and a higher proportion of female patients and also displayed higher cardiovascular morbidity and mortality. The multivariate Cox proportional hazards regression models identified 2 independent predictors of cardiovascular death in HCM patients with SRVH, a New York Heart Association class ≥III (hazard ratio [HR] = 8.7, 95% confidence interval (CI): 1.43-52.87, p = 0.019) and an age at the time of HCM diagnosis ≤18 (HR = 5.5, 95% CI: 1.24-28.36, p = 0.026). Among the 11 HCM patients with SRVH who underwent WGS, 10 (90.9%) were identified as carriers of at least one specific sarcomere gene mutation. MYH7 and TTN mutations were the most common sarcomere mutations noted in this study. Two or more HCM-related gene mutations were observed in 9 (82%) patients, and mutations in either other cardiomyopathy-related genes or ion-channel disease-related genes were found in 8 (73%) patients. HCM patients with SRVH were characterized by poor clinical outcomes and the presentation of multiple gene mutations.
Isolated inclusion body myopathy caused by a multisystem proteinopathy–linked hnRNPA1 mutation
To identify the genetic cause of isolated inclusion body myopathy (IBM) with autosomal dominant inheritance in 2 families. Genetic investigations were performed using whole-exome and Sanger sequencing of the heterogeneous nuclear ribonucleoprotein A1 gene (hnRNPA1). The clinical and pathologic features of patients in the 2 families were evaluated with neurologic examinations, muscle imaging, and muscle biopsy. We identified a missense p.D314N mutation in hnRNPA1, which is also known to cause familial amyotrophic lateral sclerosis, in 2 families with IBM. The affected individuals developed muscle weakness in their 40s, which slowly progressed toward a limb-girdle pattern. Further evaluation of the affected individuals revealed no apparent motor neuron dysfunction, cognitive impairment, or bone abnormality. The muscle pathology was compatible with IBM, lacking apparent neurogenic change and inflammation. Multiple immunohistochemical analyses revealed the cytoplasmic aggregation of hnRNPA1 in close association with autophagosomes and myonuclei. Furthermore, the aberrant accumulation was characterized by coaggregation with ubiquitin, sequestome-1/p62, valosin-containing protein/p97, and a variety of RNA-binding proteins (RBPs). The present study expands the clinical phenotype of hnRNPA1-linked multisystem proteinopathy. Mutations in hnRNPA1, and possibly hnRNPA2B1, will be responsible for isolated IBM with a pure muscular phenotype. Although the mechanisms underlying the selective skeletal muscle involvement remain to be elucidated, the immunohistochemical results suggest a broad sequestration of RBPs by the mutated hnRNPA1.
New perspective in diagnostics of mitochondrial disorders: two years’ experience with whole exome sequencing at a national paediatric centre
Whole-exome sequencing (WES) has led to an exponential increase in identification of causative variants in mitochondrial disorders (MD). We performed WES in 113 MD suspected patients from Polish paediatric reference centre, in whom routine testing failed to identify a molecular defect. WES was performed using TruSeqExome enrichment, followed by variant prioritization, validation by Sanger sequencing, and segregation with the disease phenotype in the family. Likely causative mutations were identified in 67 (59.3 %) patients; these included variants in mtDNA (6 patients) and nDNA: X-linked (9 patients), autosomal dominant (5 patients), and autosomal recessive (47 patients, 11 homozygotes). Novel variants accounted for 50.5 % (50/99) of all detected changes. In 47 patients, changes in 31 MD-related genes (ACAD9, ADCK3, AIFM1, CLPB, COX10, DLD, EARS2, FBXL4, MTATP6, MTFMT, MTND1, MTND3, MTND5, NAXE, NDUFS6, NDUFS7, NDUFV1, OPA1, PARS2, PC, PDHA1, POLG, RARS2, RRM2B, SCO2, SERAC1, SLC19A3, SLC25A12, TAZ, TMEM126B, VARS2) were identified. The ACAD9, CLPB, FBXL4, PDHA1 genes recurred more than twice suggesting higher general/ethnic prevalence. In 19 cases, variants in 18 non-MD related genes (ADAR, CACNA1A, CDKL5, CLN3, CPS1, DMD, DYSF, GBE1, GFAP, HSD17B4, MECP2, MYBPC3, PEX5, PGAP2, PIGN, PRF1, SBDS, SCN2A) were found. The percentage of positive WES results rose gradually with increasing probability of MD according to the Mitochondrial Disease Criteria (MDC) scale (from 36 to 90 % for low and high probability, respectively). The percentage of detected MD-related genes compared with non MD-related genes also grew with the increasing MD likelihood (from 20 to 97 %). Molecular diagnosis was established in 30/47 (63.8 %) neonates and in 17/28 (60.7 %) patients with basal ganglia involvement. Mutations in CLPB, SERAC1, TAZ genes were identified in neonates with 3-methylglutaconic aciduria (3-MGA) as a discriminative feature. New MD-related candidate gene (NDUFB8) is under verification. We suggest WES rather than targeted NGS as the method of choice in diagnostics of MD in children, including neonates with 3-MGA aciduria, who died without determination of disease cause and with limited availability of laboratory data. There is a strong correlation between the degree of MD diagnosis by WES and MD likelihood expressed by the MDC scale. The online version of this article (doi:10.1186/s12967-016-0930-9) contains supplementary material, which is available to authorized users.
Germline mutations in ETV6 are associated with thrombocytopenia, red cell macrocytosis and predisposition to lymphoblastic leukemia
Some familial platelet disorders are associated with predisposition to leukemia, myelodysplastic syndrome (MDS) or dyserythropoietic anemia.1,2 We identified a family with autosomal dominant thrombocytopenia, high erythrocyte mean corpuscular volume (MCV) and two occurrences of B-cell precursor acute lymphoblastic leukemia (ALL). Whole exome sequencing identified a heterozygous single nucleotide change in ETV6 (Ets Variant Gene 6), c.641C>T, encoding a p.Pro214Leu substitution in the central domain, segregating with thrombocytopenia and elevated MCV. A screen of 23 families with similar phenotype found two with ETV6 mutations. One family had the p.Pro214Leu mutation and one individual with ALL. The other family had a c.1252A>G transition producing a p.Arg418Gly substitution in the DNA binding domain, with alternative splicing and exon-skipping. Functional characterization of these mutations showed aberrant cellular localization of mutant and endogenous ETV6, decreased transcriptional repression and altered megakaryocyte maturation. Our findings underscore a key role for ETV6 in platelet formation and leukemia predisposition.
Behr’s Syndrome is Typically Associated with Disturbed Mitochondrial Translation and Mutations in the C12orf65 Gene
Behr’s syndrome is a classical phenotypic description of childhood-onset optic atrophy combined with various neurological symptoms, including ophthalmoparesis, nystagmus, spastic paraparesis, ataxia, peripheral neuropathy and learning difficulties. Here we describe 4 patients with the classical Behr’s syndrome phenotype from 3 unrelated families who carry homozygous nonsense mutations in the C12orf65 gene encoding a protein involved in mitochondrial translation. Whole exome sequencing was performed in genomic DNA and oxygen consumption was measured in patient cell lines. We detected 2 different homozygous C12orf65 nonsense mutations in 4 patients with a homogeneous clinical presentation matching the historical description of Behr’s syndrome. The first symptom in all patients was childhood-onset optic atrophy, followed by spastic paraparesis, distal weakness, motor neuropathy and ophthalmoparesis. We think that C12orf65 mutations are more frequent than previously suggested and screening of this gene should be considered not only in patients with mitochondrial respiratory chain deficiencies, but also in inherited peripheral neuropathies, spastic paraplegias and ataxias, especially with pre-existing optic atrophy.
A survey on cellular RNA editing activity in response to Candida albicans infections
Adenosine-to-Inosine (A-to-I) RNA editing is catalyzed by the adenosine deaminase acting on RNA (ADAR) family of enzymes, which induces alterations in mRNA sequence. It has been shown that A-to-I RNA editing events are of significance in the cell’s innate immunity and cellular response to viral infections. However, whether RNA editing plays a role in cellular response to microorganism/fungi infection has not been determined. Candida albicans, one of the most prevalent human pathogenic fungi, usually act as a commensal on skin and superficial mucosal, but has been found to cause candidiasis in immunosuppression patients. Previously, we have revealed the up-regulation of A-to-I RNA editing activity in response to different types of influenza virus infections. The current work is designed to study the effect of microorganism/fungi infection on the activity of A-to-I RNA editing in infected hosts. We first detected and characterized the A-to-I RNA editing events in oral epithelial cells (OKF6) and primary human umbilical vein endothelial cells (HUVEC), under normal growth condition or with C. albicans infection. Eighty nine thousand six hundred forty eight and 60,872 A-to-I editing sites were detected in normal OKF6 and HUVEC cells, respectively. They were validated against the RNA editing databases, DARNED, RADAR, and REDIportal with 50, 80, and 80% success rates, respectively. While over 95% editing sites were detected in Alu regions, among the rest of the editing sites in non repetitive regions, the majority was located in introns and UTRs. The distributions of A-to-I editing activity and editing depth were analyzed during the course of C. albicans infection. While the normalized editing levels of common editing sites exhibited a significant increase, especially in Alu regions, no significant change in the expression of ADAR1 or ADAR2 was observed. Second, we performed further analysis on data from in vivo mouse study with C. albicans infection. One thousand one hundred thirty three and 955 A-to-I editing sites were identified in mouse tongue and kidney tissues, respectively. The number of A-to-I editing events was much smaller than in human epithelial or endothelial cells, due to the lack of Alu elements in mouse genome. Furthermore, during the course of C. albicans infection we observed stable level of A-to-I editing activity in 131 and 190 common editing sites in the mouse tongue and kidney tissues, and found no significant change in ADAR1 or ADAR2 expression (with the exception of ADAR2 displaying a significant increase at 12 h after infection in mouse kidney tissue before returning to normal). This work represents the first comprehensive analysis of A-to-I RNA editome in human epithelial and endothelial cells. C. albicans infection of human epithelial and endothelial cells led to the up-regulation of A-to-I editing activities, through a mechanism different from that of viral infections in human hosts. However, the in vivo mouse model with C. albicans infection did not show significant changes in A-to-I editing activities in tongue and kidney tissues. The different results in the mouse model were likely due to the presence of more complex in vivo environments, e.g. circulation and mixed cell types. The online version of this article (10.1186/s12864-017-4374-2) contains supplementary material, which is available to authorized users.
Congenital myasthenic syndromes due to mutations in ALG2 and ALG14
Congenital myasthenic syndromes are a heterogeneous group of inherited disorders that arise from impaired signal transmission at the neuromuscular synapse. They are characterized by fatigable muscle weakness. We performed linkage analysis, whole-exome and whole-genome sequencing to determine the underlying defect in patients with an inherited limb-girdle pattern of myasthenic weakness. We identify ALG14 and ALG2 as novel genes in which mutations cause a congenital myasthenic syndrome. Through analogy with yeast, ALG14 is thought to form a multiglycosyltransferase complex with ALG13 and DPAGT1 that catalyses the first two committed steps of asparagine-linked protein glycosylation. We show that ALG14 is concentrated at the muscle motor endplates and small interfering RNA silencing of ALG14 results in reduced cell-surface expression of muscle acetylcholine receptor expressed in human embryonic kidney 293 cells. ALG2 is an alpha-1,3-mannosyltransferase that also catalyses early steps in the asparagine-linked glycosylation pathway. Mutations were identified in two kinships, with mutation ALG2p.Val68Gly found to severely reduce ALG2 expression both in patient muscle, and in cell cultures. Identification of DPAGT1, ALG14 and ALG2 mutations as a cause of congenital myasthenic syndrome underscores the importance of asparagine-linked protein glycosylation for proper functioning of the neuromuscular junction. These syndromes form part of the wider spectrum of congenital disorders of glycosylation caused by impaired asparagine-linked glycosylation. It is likely that further genes encoding components of this pathway will be associated with congenital myasthenic syndromes or impaired neuromuscular transmission as part of a more severe multisystem disorder. Our findings suggest that treatment with cholinesterase inhibitors may improve muscle function in many of the congenital disorders of glycosylation.
Splicing Variants of SERPINA1 Gene in Ovine Milk: Characterization of cDNA and Identification of Polymorphisms
The serine protease inhibitor, clade A, member 1 (SERPINA1) is the gene for a protein called alpha-1-antitrypsin (AAT), which is a member of the serine protease inhibitor (serpin) superfamily of proteins. By conformational change, serpins control several chemical reactions inhibiting the activity of proteases. AAT is the most abundant endogenous serpin in blood circulation and it is present in relatively high concentration in human milk as well as in bovine and porcine colostrum. Here we report for the first time the molecular characterization and sequence variability of the ovine SERPINA1 cDNA and gene. cDNAs from mammary gland and from milk were PCR amplified, and three different transcripts (1437, 1166 and 521bp) of the SERPINA1 gene were identified. We amplified and sequenced different regions of the gene (5’ UTR, from exon 2 to exon 5 and 3’ UTR), and we found that the exon-intron structure of the gene is similar to that of human and bovine. We detected a total of 97 SNPs in cDNAs and gene sequences from 10 sheep of three different breeds. In adult sheep tissues a SERPINA1 gene expression analysis indicated a differential expression of the three different transcripts. The finding reported in this paper will aid further studies on possible involvement of the SERPINA1 gene in different physiological states and its possible association with production traits.
Copy number alterations detected by whole exome and whole genome sequencing of esophageal adenocarcinoma
Esophageal adenocarcinoma (EA) is among the leading causes of cancer mortality, especially in developed countries. A high level of somatic copy number alterations (CNAs) accumulates over the decades in the progression from Barrett’s esophagus, the precursor lesion, to EA. Accurate identification of somatic CNAs is essential to understand cancer development. Many studies have been conducted for the detection of CNA in EA using microarrays. Next-generation sequencing (NGS) technologies are believed to have advantages in sensitivity and accuracy to detect CNA, yet no NGS-based CNA detection in EA has been reported. In this study, we analyzed whole-exome (WES) and whole-genome sequencing (WGS) data for detecting CNA from a published large-scale genomic study of EA. Two specific comparisons were conducted. First, the recurrent CNAs based on WGS and WES data from 145 EA samples were compared to those found in five previous microarray-based studies. We found that the majority of the previously identified regions were also detected in this study. Interestingly, some novel amplifications and deletions were discovered using the NGS data. In particular, SKI and PRKCZ detected in a deletion region are involved in transforming growth factor-β pathway, suggesting the potential utility of novel biomarkers for EA. Second, we compared CNAs detected in WGS and WES data from the same 15 EA samples. No large-scale CNA was identified statistically more frequently by WES or WGS, while more focal-scale CNAs were detected by WGS than by WES. Our results suggest that NGS can replace microarrays to detect CNA in EA. WGS is superior to WES in that it can offer finer resolution for the detection, though if the interest is on recurrent CNAs, WES can be preferable to WGS for its cost-effectiveness. The online version of this article (doi:10.1186/s40246-015-0044-0) contains supplementary material, which is available to authorized users.
Reconstructing the Population Genetic History of the Caribbean
The Caribbean basin is home to some of the most complex interactions in recent history among previously diverged human populations. Here, we investigate the population genetic history of this region by characterizing patterns of genome-wide variation among 330 individuals from three of the Greater Antilles (Cuba, Puerto Rico, Hispaniola), two mainland (Honduras, Colombia), and three Native South American (Yukpa, Bari, and Warao) populations. We combine these data with a unique database of genomic variation in over 3,000 individuals from diverse European, African, and Native American populations. We use local ancestry inference and tract length distributions to test different demographic scenarios for the pre- and post-colonial history of the region. We develop a novel ancestry-specific PCA (ASPCA) method to reconstruct the sub-continental origin of Native American, European, and African haplotypes from admixed genomes. We find that the most likely source of the indigenous ancestry in Caribbean islanders is a Native South American component shared among inland Amazonian tribes, Central America, and the Yucatan peninsula, suggesting extensive gene flow across the Caribbean in pre-Columbian times. We find evidence of two pulses of African migration. The first pulse—which today is reflected by shorter, older ancestry tracts—consists of a genetic component more similar to coastal West African regions involved in early stages of the trans-Atlantic slave trade. The second pulse—reflected by longer, younger tracts—is more similar to present-day West-Central African populations, supporting historical records of later transatlantic deportation. Surprisingly, we also identify a Latino-specific European component that has significantly diverged from its parental Iberian source populations, presumably as a result of small European founder population size. We demonstrate that the ancestral components in admixed genomes can be traced back to distinct sub-continental source populations with far greater resolution than previously thought, even when limited pre-Columbian Caribbean haplotypes have survived. Latinos are often regarded as a single heterogeneous group, whose complex variation is not fully appreciated in several social, demographic, and biomedical contexts. By making use of genomic data, we characterize ancestral components of Caribbean populations on a sub-continental level and unveil fine-scale patterns of population structure distinguishing insular from mainland Caribbean populations as well as from other Hispanic/Latino groups. We provide genetic evidence for an inland South American origin of the Native American component in island populations and for extensive pre-Columbian gene flow across the Caribbean basin. The Caribbean-derived European component shows significant differentiation from parental Iberian populations, presumably as a result of founder effects during the colonization of the New World. Based on demographic models, we reconstruct the complex population history of the Caribbean since the onset of continental admixture. We find that insular populations are best modeled as mixtures absorbing two pulses of African migrants, coinciding with the early and maximum activity stages of the transatlantic slave trade. These two pulses appear to have originated in different regions within West Africa, imprinting two distinguishable signatures on present-day Afro-Caribbean genomes and shedding light on the genetic impact of the slave trade in the Caribbean.
Identification of selective sweeps reveals divergent selection between Chinese Holstein and Simmental cattle populations
The identification of signals left by recent positive selection provides a feasible approach for targeting genomic variants that underlie complex traits and fitness. A better understanding of the selection mechanisms that occurred during the evolution of species can also be gained. In this study, we simultaneously detected the genome-wide footprints of recent positive selection that occurred within and between Chinese Holstein and Simmental populations, which have been subjected to artificial selection for distinct purposes. We conducted analyses using various complementary approaches, including LRH, XP-EHH and FST, based on the Illumina 770K high-density single nucleotide polymorphism (SNP) array, to enable more comprehensive detection. We successfully constructed profiles of selective signals in both cattle populations. To further annotate these regions, we identified a set of novel functional genes related to growth, reproduction, immune response and milk production. There were no overlapping candidate windows between the two breeds. Finally, we investigated the distribution of SNPs that had low FST values across five distinct functional regions in the genome. In the low-minor allele frequency bin, we found a higher proportion of low-FST SNPs in the exons of the bovine genome, which indicates strong purifying selection of the exons. The selection signatures identified in these two populations demonstrated positive selection pressure on a set of important genes with potential functions that are involved in many biological processes. We also demonstrated that in the bovine genome, exons were under strong purifying selection. Our findings provide insight into the mechanisms of artificial selection and will facilitate follow-up functional studies of potential candidate genes that are related to various economically important traits in cattle. The online version of this article (doi:10.1186/s12711-016-0254-5) contains supplementary material, which is available to authorized users.
Investigating the relationship between UMODL1 gene polymorphisms and high myopia: a case–control study in Chinese
The UMODL1 gene was found to be associated with high myopia in Japanese. This study aimed to investigate this gene for association with high myopia in Chinese. Two groups of unrelated Han Chinese from Hong Kong were recruited using the same criteria: Sample Set 1 comprising 356 controls (spherical equivalent, SE, within ±1 diopter or D) and 356 cases (SE ≤ −8D), and Sample Set 2 comprising 394 controls and 526 cases. Fifty-nine tag single nucleotide polymorphisms (SNPs) were selected and genotyped for Sample Set 1. Four SNPs were followed up with Sample Set 2. Both single-marker and haplotype analyses were performed with cases defined by different SE thresholds. Secondary phenotypes were also analyzed for association with genotypes. Data filtering left 57 SNPs for analysis. Single-marker analysis did not reveal any significant differences between cases and controls in the initial study. However, haplotype GCT for markers rs220168-rs220170-rs11911271 showed marginal significance (empirical P = 0.076; SE ≤ −12D for cases), but could not be replicated in the follow-up study. In contrast, non-synonymous SNP rs3819142 was associated with high myopia (SE ≤ −10D) in the follow-up study, but could not be confirmed using Sample Set 1. The SNP rs2839471, positive in the original Japanese study, gave negative results in all our analyses. Exploratory analysis of secondary phenotypes indicated that allele C of rs220120 was associated with anterior chamber depth (adjusted P = 0.0460). Common UMODL1 polymorphisms were unlikely to be important in the genetic susceptibility to high myopia in Han Chinese.
EIN2 dependent regulation of acetylation of histone H3K14 and non canonical histone H3K23 in ethylene signalling
Ethylene gas is essential for many developmental processes and stress responses in plants. EIN2 plays a key role in ethylene signalling but its function remains enigmatic. Here, we show that ethylene specifically elevates acetylation of histone H3K14 and the non-canonical acetylation of H3K23 in etiolated seedlings. The up-regulation of these two histone marks positively correlates with ethylene-regulated transcription activation, and the elevation requires EIN2. Both EIN2 and EIN3 interact with a SANT domain protein named EIN2 nuclear associated protein 1 (ENAP1), overexpression of which results in elevation of histone acetylation and enhanced ethylene-inducible gene expression in an EIN2-dependent manner. On the basis of these findings we propose a model where, in the presence of ethylene, the EIN2 C terminus contributes to downstream signalling via the elevation of acetylation at H3K14 and H3K23. ENAP1 may potentially mediate ethylene-induced histone acetylation via its interactions with EIN2 C terminus.
The translocation of the C-terminal domain of EIN2 to the nucleus is essential for induction of gene expression in response to the plant hormone ethylene. Here, Zhang et al. show that EIN2 is required for ethylene-inducible elevation of histone acetylation marks associated with transcriptional activation.
Dataset of TWIST1 regulated genes in the cranial mesoderm and a transcriptome comparison of cranial mesoderm and cranial neural crest
This article contains data related to the research article entitled “Transcriptional targets of TWIST1 in the cranial mesoderm regulate cell-matrix interactions and mesenchyme maintenance” by Bildsoe et al. (2016) . The data presented here are derived from: (1) a microarray-based comparison of sorted cranial mesoderm (CM) and cranial neural crest (CNC) cells from E9.5 mouse embryos; (2) comparisons of transcription profiles of head tissues from mouse embryos with a CM-specific loss-of-function of Twist1 and control mouse embryos collected at E8.5 and E9.5; (3) ChIP-seq using a TWIST1-specific monoclonal antibody with chromatin extracts from TWIST1-expressing MDCK cells, a model for a TWIST1-dependent mesenchymal state.
A Genome Wide Linkage Study for Chronic Obstructive Pulmonary Disease in a Dutch Genetic Isolate Identifies Novel Rare Candidate Variants
Chronic obstructive pulmonary disease (COPD) is a complex and heritable disease, associated with multiple genetic variants. Specific familial types of COPD may be explained by rare variants, which have not been widely studied. We aimed to discover rare genetic variants underlying COPD through a genome-wide linkage scan. Affected-only analysis was performed using the 6K Illumina Linkage IV Panel in 142 cases clustered in 27 families from a genetic isolate, the Erasmus Rucphen Family (ERF) study. Potential causal variants were identified by searching for shared rare variants in the exome-sequence data of the affected members of the families contributing most to the linkage peak. The identified rare variants were then tested for association with COPD in a large meta-analysis of several cohorts. Significant evidence for linkage was observed on chromosomes 15q14–15q25 [logarithm of the odds (LOD) score = 5.52], 11p15.4–11q14.1 (LOD = 3.71) and 5q14.3–5q33.2 (LOD = 3.49). In the chromosome 15 peak, that harbors the known COPD locus for nicotinic receptors, and in the chromosome 5 peak we could not identify shared variants. In the chromosome 11 locus, we identified four rare (minor allele frequency (MAF) <0.02), predicted pathogenic, missense variants. These were shared among the affected family members. The identified variants localize to genes including neuroblast differentiation-associated protein (AHNAK), previously associated with blood biomarkers in COPD, phospholipase C Beta 3 (PLCB3), shown to increase airway hyper-responsiveness, solute carrier family 22-A11 (SLC22A11), involved in amino acid metabolism and ion transport, and metallothionein-like protein 5 (MTL5), involved in nicotinate and nicotinamide metabolism. Association of SLC22A11 and MTL5 variants were confirmed in the meta-analysis of 9,888 cases and 27,060 controls. In conclusion, we have identified novel rare variants in plausible genes related to COPD. Further studies utilizing large sample whole-genome sequencing should further confirm the associations at chromosome 11 and investigate the chromosome 15 and 5 linked regions.
A nonsense mutation in PRNP associated with clinical Alzheimer's disease☆
Here, we describe a nonsense haplotype in PRNP associated with clinical Alzheimer's disease. The patient presented an early-onset of cognitive decline with memory loss as the primary cognitive problem. Whole-exome sequencing revealed a nonsense mutation in PRNP (NM_000311, c.C478T; p.Q160*; rs80356711) associated with homozygosity for the V allele at position 129 of the protein, further highlighting how very similar genotypes in PRNP result in strikingly different phenotypes.
AR 13, a Celecoxib Derivative, Directly Kills Francisella In Vitro and Aids Clearance and Mouse Survival In Vivo
Francisella tularensis (F. tularensis) is the causative agent of tularemia and is classified as a Tier 1 select agent. No licensed vaccine is currently available in the United States and treatment of tularemia is confined to few antibiotics. In this study, we demonstrate that AR-13, a derivative of the cyclooxygenase-2 inhibitor celecoxib, exhibits direct in vitro bactericidal killing activity against Francisella including a type A strain of F. tularensis (SchuS4) and the live vaccine strain (LVS), as well as toward the intracellular proliferation of LVS in macrophages, without causing significant host cell toxicity. Identification of an AR-13-resistant isolate indicates that this compound has an intracellular target(s) and that eﬄux pumps can mediate AR-13 resistance. In the mouse model of tularemia, AR-13 treatment protected 50% of the mice from lethal LVS infection and prolonged survival time from a lethal dose of F. tularensis SchuS4. Combination of AR-13 with a sub-optimal dose of gentamicin protected 60% of F. tularensis SchuS4-infected mice from death. Taken together, these data support the translational potential of AR-13 as a lead compound for the further development of new anti-Francisella agents.
Complimentary mechanisms of dual checkpoint blockade expand unique T cell repertoires and activate adaptive anti tumor immunity in triple negative breast tumors
Triple-negative breast cancer (TNBC) is an aggressive and molecularly diverse breast cancer subtype typified by the presence of p53 mutations (∼80%), elevated immune gene signatures and neoantigen expression, as well as the presence of tumor infiltrating lymphocytes (TILs). As these factors are hypothesized to be strong immunologic prerequisites for the use of immune checkpoint blockade (ICB) antibodies, multiple clinical trials testing single ICBs have advanced to Phase III, with early indications of heterogeneous response rates of <20% to anti-PD1 and anti-PDL1 ICB. While promising, these modest response rates highlight the need for mechanistic studies to understand how different ICBs function, how their combination impacts functionality and efficacy, as well as what immunologic parameters predict efficacy to different ICBs regimens in TNBC. To address these issues, we tested anti-PD1 and anti-CTLA4 in multiple models of TNBC and found that their combination profoundly enhanced the efficacy of either treatment alone. We demonstrate that this efficacy is due to anti-CTLA4-driven expansion of an individually unique T-cell receptor (TCR) repertoire whose functionality is enhanced by both intratumoral Treg suppression and anti-PD1 blockade of tumor expressed PDL1. Notably, the individuality of the TCR repertoire was observed regardless of whether the tumor cells expressed a nonself antigen (ovalbumin) or if tumor-specific transgenic T-cells were transferred prior to sequencing. However, responsiveness was strongly correlated with systemic measures of tumor-specific T-cell and B-cell responses, which along with systemic assessment of TCR expansion, may serve as the most useful predictors for clinical responsiveness in future clinical trials of TNBC utilizing anti-PD1/anti-CTLA4 ICB.
Rare variants of small effect size in neuronal excitability genes influence clinical outcome in Japanese cases of SCN1A truncation positive Dravet syndrome
Dravet syndrome (DS) is a rare, devastating form of childhood epilepsy that is often associated with mutations in the voltage-gated sodium channel gene, SCN1A. There is considerable variability in expressivity within families, as well as among individuals carrying the same primary mutation, suggesting that clinical outcome is modulated by variants at other genes. To identify modifier gene variants that contribute to clinical outcome, we sequenced the exomes of 22 individuals at both ends of a phenotype distribution (i.e., mild and severe cognitive condition). We controlled for variation associated with different mutation types by limiting inclusion to individuals with a de novo truncation mutation resulting in SCN1A haploinsufficiency. We performed tests aimed at identifying 1) single common variants that are enriched in either phenotypic group, 2) sets of common or rare variants aggregated in and around genes associated with clinical outcome, and 3) rare variants in 237 candidate genes associated with neuronal excitability. While our power to identify enrichment of a common variant in either phenotypic group is limited as a result of the rarity of mild phenotypes in individuals with SCN1A truncation variants, our top candidates did not map to functional regions of genes, or in genes that are known to be associated with neurological pathways. In contrast, we found a statistically-significant excess of rare variants predicted to be damaging and of small effect size in genes associated with neuronal excitability in severely affected individuals. A KCNQ2 variant previously associated with benign neonatal seizures is present in 3 of 12 individuals in the severe category. To compare our results with the healthy population, we performed a similar analysis on whole exome sequencing data from 70 Japanese individuals in the 1000 genomes project. Interestingly, the frequency of rare damaging variants in the same set of neuronal excitability genes in healthy individuals is nearly as high as in severely affected individuals. Rather than a single common gene/variant modifying clinical outcome in SCN1A-related epilepsies, our results point to the cumulative effect of rare variants with little to no measurable phenotypic effect (i.e., typical genetic background) unless present in combination with a disease-causing truncation mutation in SCN1A.
De novo derivation of proteomes from transcriptomes for transcript and protein identification
Identification of proteins by tandem mass spectrometry requires a database of the proteins that could be in the sample. This is available for model species (e.g. humans) but not for non-model species. Ideally, for a non-model species the sequencing of expressed mRNA would generate a protein database for mass spectrometry based identification, allowing detection of genes and proteins using high throughput sequencing and protein identification technologies. Here we use human cells infected with human adenovirus as a complex and dynamic model to demonstrate this approach is robust. Our Proteomics Informed by Transcriptomics technique identifies >99% of over 3700 distinct proteins identified using traditional analysis reliant on comprehensive human and adenovirus protein lists. This facilitates high throughput acquisition of direct evidence for transcripts and proteins in non-model species. Critically, we show this approach can also be used to highlight genes and proteins undergoing dynamic changes in post transcriptional protein stability.
New Genetic Loci Associated with Preharvest Sprouting and Its Evaluation Based on the Model Equation in Rice
Preharvest sprouting (PHS) in rice panicles is an important quantitative trait that causes both yield losses and the deterioration of grain quality under unpredictable moisture conditions at the ripening stage. However, the molecular mechanism underlying PHS has not yet been elucidated. Here, we explored the genetic loci associated with PHS in rice and formulated a model regression equation for rapid screening for use in breeding programs. After re-sequencing 21 representative accessions for PHS and performing enrichment analysis, we found that approximately 20,000 SNPs revealed distinct allelic distributions between PHS resistant and susceptible accessions. Of these, 39 candidate SNP loci were selected, including previously reported QTLs. We analyzed the genotypes of 144 rice accessions to determine the association between PHS and the 39 candidate SNP loci, 10 of which were identified as significantly affecting PHS based on allele type. Based on the allele types of the SNP loci, we constructed a regression equation for evaluating PHS, accounting for an R2 value of 0.401 in japonica rice. We validated this equation using additional accessions, which exhibited a significant R2 value of 0.430 between the predicted values and actual measurements. The newly detected SNP loci and the model equation could facilitate marker-assisted selection to predict PHS in rice germplasm and breeding lines.
Shared ancestral susceptibility to colorectal cancer and other nutrition related diseases
The majority of non-syndromic colorectal cancers (CRCs) can be described as a complex disease. A two-stage case–control study on CRC susceptibility was conducted to assess the influence of the ancestral alleles in the polymorphisms previously associated with nutrition-related complex diseases. In stage I, 28 single nucleotide polymorphisms (SNPs) were genotyped in a hospital-based Czech population (1025 CRC cases, 787 controls) using an allele-specific PCR-based genotyping system (KASPar®). In stage II, replication was carried out for the five SNPs with the lowest p values. The replication set consisted of 1798 CRC cases and 1810 controls from a population-based German study (DACHS). Odds ratios (ORs) and 95% confidence intervals (CIs) for associations between genotypes and CRC risk were estimated using logistic regression. To identify signatures of selection, Fay-Wu’s H and Integrated Haplotype Score (iHS) were estimated. In the Czech population, carriers of the ancestral alleles of AGT rs699 and CYP3A7 rs10211 showed an increased risk of CRC (OR 1.26 and 1.38, respectively; two-sided p≤0.05), whereas carriers of the ancestral allele of ENPP1 rs1044498 had a decreased risk (OR 0.79; p≤0.05). For rs1044498, the strongest association was detected in the Czech male subpopulation (OR 0.61; p=0.0015). The associations were not replicated in the German population. Signatures of selection were found for all three analyzed genes. Our study showed evidence of association for the ancestral alleles of polymorphisms in AGT and CYP3A7 and for the derived allele of a polymorphism in ENPP1 with an increased risk of CRC in Czechs, but not in Germans. The ancestral alleles of these SNPs have previously been associated with nutrition-related diseases hypertension (AGT and CYP3A7) and insulin resistance (ENPP1). Future studies may shed light on the complex genetic and environmental interactions between different types of nutrition-related diseases.
Accurate Breakpoint Mapping in Apparently Balanced Translocation Families with Discordant Phenotypes Using Whole Genome Mate Pair Sequencing
Familial apparently balanced translocations (ABTs) segregating with discordant phenotypes are extremely challenging for interpretation and counseling due to the scarcity of publications and lack of routine techniques for quick investigation. Recently, next generation sequencing has emerged as an efficacious methodology for precise detection of translocation breakpoints. However, studies so far have mainly focused on de novo translocations. The present study focuses specifically on familial cases in order to shed some light to this diagnostic dilemma. Whole-genome mate-pair sequencing (WG-MPS) was applied to map the breakpoints in nine two-way ABT carriers from four families. Translocation breakpoints and patient-specific structural variants were validated by Sanger sequencing and quantitative Real Time PCR, respectively. Identical sequencing patterns and breakpoints were identified in affected and non-affected members carrying the same translocations. PTCD1, ATP5J2-PTCD1, CADPS2, and STPG1 were disrupted by the translocations in three families, rendering them initially as possible disease candidate genes. However, subsequent mutation screening and structural variant analysis did not reveal any pathogenic mutations or unique variants in the affected individuals that could explain the phenotypic differences between carriers of the same translocations. In conclusion, we suggest that NGS-based methods, such as WG-MPS, can be successfully used for detailed mapping of translocation breakpoints, which can also be used in routine clinical investigation of ABT cases. Unlike de novo translocations, no associations were determined here between familial two-way ABTs and the phenotype of the affected members, in which the presence of cryptic imbalances and complex chromosomal rearrangements has been excluded. Future whole-exome or whole-genome sequencing will potentially reveal unidentified mutations in the patients underlying the discordant phenotypes within each family. In addition, larger studies are needed to determine the exact percentage for phenotypic risk in families with ABTs.
A Multi Breed Genome Wide Association Analysis for Canine Hypothyroidism Identifies a Shared Major Risk Locus on CFA12
Hypothyroidism is a complex clinical condition found in both humans and dogs, thought to be caused by a combination of genetic and environmental factors. In this study we present a multi-breed analysis of predisposing genetic risk factors for hypothyroidism in dogs using three high-risk breeds—the Gordon Setter, Hovawart and the Rhodesian Ridgeback. Using a genome-wide association approach and meta-analysis, we identified a major hypothyroidism risk locus shared by these breeds on chromosome 12 (p = 2.1x10-11). Further characterisation of the candidate region revealed a shared ~167 kb risk haplotype (4,915,018–5,081,823 bp), tagged by two SNPs in almost complete linkage disequilibrium. This breed-shared risk haplotype includes three genes (LHFPL5, SRPK1 and SLC26A8) and does not extend to the dog leukocyte antigen (DLA) class II gene cluster located in the vicinity. These three genes have not been identified as candidate genes for hypothyroid disease previously, but have functions that could potentially contribute to the development of the disease. Our results implicate the potential involvement of novel genes and pathways for the development of canine hypothyroidism, raising new possibilities for screening, breeding programmes and treatments in dogs. This study may also contribute to our understanding of the genetic etiology of human hypothyroid disease, which is one of the most common endocrine disorders in humans.
Draft Genome Sequence of Saccharomycopsis fermentans CBS 7830, a Predacious Yeast Belonging to the Saccharomycetales
Saccharomycopsis fermentans is an ascomycetous necrotrophic fungal pathogen that penetrates and kills fungal prey cells via targeted penetration pegs. Here, we report the draft genome sequence and scaffold assembly of this mycoparasite.
Complete mitogenome sequences of four flatfishes (Pleuronectiformes) reveal a novel gene arrangement of L strand coding genes
Few mitochondrial gene rearrangements are found in vertebrates and large-scale changes in these genomes occur even less frequently. It is difficult, therefore, to propose a mechanism to account for observed changes in mitogenome structure. Mitochondrial gene rearrangements are usually explained by the recombination model or tandem duplication and random loss model. In this study, the complete mitochondrial genomes of four flatfishes, Crossorhombus azureus (blue flounder), Grammatobothus krempfi, Pleuronichthys cornutus, and Platichthys stellatus were determined. A striking finding is that eight genes in the C. azureus mitogenome are located in a novel position, differing from that of available vertebrate mitogenomes. Specifically, the ND6 and seven tRNA genes (the Q, A, C, Y, S1, E, P genes) encoded by the L-strand have been translocated to a position between tRNA-T and tRNA-F though the original order of the genes is maintained. These special features are used to suggest a mechanism for C. azureus mitogenome rearrangement. First, a dimeric molecule was formed by two monomers linked head-to-tail, then one of the two sets of promoters lost function and the genes controlled by the disabled promoters became pseudogenes, non-coding sequences, and even were lost from the genome. This study provides a new gene-rearrangement model that accounts for the events of gene-rearrangement in a vertebrate mitogenome.
The Genome Sequence of Rickettsia felis Identifies the First Putative Conjugative Plasmid in an Obligate Intracellular Parasite
We sequenced the genome of Rickettsia felis, a flea-associated obligate intracellular α-proteobacterium causing spotted fever in humans. Besides a circular chromosome of 1,485,148 bp, R. felis exhibits the first putative conjugative plasmid identified among obligate intracellular bacteria. This plasmid is found in a short (39,263 bp) and a long (62,829 bp) form. R.
felis contrasts with previously sequenced Rickettsia in terms of many other features, including a number of transposases, several chromosomal toxin–antitoxin genes, many more spoT genes, and a very large number of ankyrin- and tetratricopeptide-motif-containing genes. Host-invasion-related genes for patatin and RickA were found. Several phenotypes predicted from genome analysis were experimentally tested: conjugative pili and mating were observed, as well as β-lactamase activity, actin-polymerization-driven mobility, and hemolytic properties. Our study demonstrates that complete genome sequencing is the fastest approach to reveal phenotypic characters of recently cultured obligate intracellular bacteria.
Rickettsia felis is an obligate intracellular bacterium that lives in fleas and causes spotted fever in humans. Its genome sequence provides the first evidence that such bacteria can undergo conjugation.
Draft Genome Sequence of Saccharomycopsis fodiens CBS 8332, a Necrotrophic Mycoparasite with Biocontrol Potential
Saccharomycopsis fodiens is an ascomycetous necrotrophic mycoparasite. Predator-prey interaction leads to killing of the host cell by a penetration peg and utilization of cell content by the predator. Here, we report the 14.9-Mb S. fodiens draft genome sequence assembled into 9 large scaffolds and 13 minor scaffolds (<20 kb).
A novel ammonia oxidizing archaeon from wastewater treatment plant: Its enrichment, physiological and genomic characteristics
Ammonia-oxidizing archaea (AOA) are recently found to participate in the ammonia removal processes in wastewater treatment plants (WWTPs), similar to their bacterial counterparts. However, due to lack of cultivated AOA strains from WWTPs, their functions and contributions in these systems remain unclear. Here we report a novel AOA strain SAT1 enriched from activated sludge, with its physiological and genomic characteristics investigated. The maximal 16S rRNA gene similarity between SAT1 and other reported AOA strain is 96% (with “Ca. Nitrosotenuis chungbukensis”), and it is affiliated with Wastewater Cluster B (WWC-B) based on amoA gene phylogeny, a cluster within group I.1a and specific for activated sludge. Our strain is autotrophic, mesophilic (25 °C–33 °C) and neutrophilic (pH 5.0–7.0). Its genome size is 1.62 Mb, with a large fragment inversion (accounted for 68% genomic size) inside. The strain could not utilize urea due to truncation of the urea transporter gene. The lack of the pathways to synthesize usual compatible solutes makes it intolerant to high salinity (>0.03%), but could adapt to low salinity (0.005%) environments. This adaptation, together with possibly enhanced cell-biofilm attachment ability, makes it suitable for WWTPs environment. We propose the name “Candidatus Nitrosotenuis cloacae” for the strain SAT1.
A marine inducible prophage vB_CibM P1 isolated from the aerobic anoxygenic phototrophic bacterium Citromicrobium bathyomarinum JL354
A prophage vB_CibM-P1 was induced by mitomycin C from the epipelagic strain Citromicrobium bathyomarinum JL354, a member of the alpha-IV subcluster of marine aerobic anoxygenic phototrophic bacteria (AAPB). The induced bacteriophage vB_CibM-P1 had Myoviridae-like morphology and polyhedral heads (approximately capsid 60–100 nm) with tail fibers. The vB_CibM-P1 genome is ~38 kb in size, with 66.0% GC content. The genome contains 58 proposed open reading frames that are involved in integration, DNA packaging, morphogenesis and bacterial lysis. VB_CibM-P1 is a temperate phage that can be directly induced in hosts. In response to mitomycin C induction, virus-like particles can increase to 7 × 109 per ml, while host cells decrease an order of magnitude. The vB_CibM-P1 bacteriophage is the first inducible prophage from AAPB.
Complete Genome Sequence of Enterococcus Bacteriophage EFLK1
We previously isolated EFDG1, a lytic phage against enterococci for therapeutic use. Nevertheless, EFDG1-resistant bacterial strains (EFDG1r) have evolved. EFLK1, a new highly effective phage against EFDG1r strains, was isolated in this study. The genome of EFLK1 was fully sequenced, analyzed, and deposited in GenBank.
Clear Genetic Distinctiveness between Human and Pig Derived Trichuris Based on Analyses of Mitochondrial Datasets
The whipworm, Trichuris trichiura, causes trichuriasis in ∼600 million people worldwide, mainly in developing countries. Whipworms also infect other animal hosts, including pigs (T. suis), dogs (T. vulpis) and non-human primates, and cause disease in these hosts, which is similar to trichuriasis of humans. Although Trichuris species are considered to be host specific, there has been considerable controversy, over the years, as to whether T. trichiura and T. suis are the same or distinct species. Here, we characterised the entire mitochondrial genomes of human-derived Trichuris and pig-derived Trichuris, compared them and then tested the hypothesis that the parasites from these two host species are genetically distinct in a phylogenetic analysis of the sequence data. Taken together, the findings support the proposal that T. trichiura and T. suis are separate species, consistent with previous data for nuclear ribosomal DNA. Using molecular analytical tools, employing genetic markers defined herein, future work should conduct large-scale studies to establish whether T. trichiura is found in pigs and T. suis in humans in endemic regions. Trichuriasis is a neglected tropical disease (NTD) caused by parasitic nematodes of the genus Trichuris (Nematoda), causing significant human and animal health problems as well as considerable socio-economic consequences world-wide. Although Trichuris species are considered to be relatively host specific, there has been significant controversy as to whether Trichuris infecting humans (recognized as T. trichiura) is a distinct species from that found in pigs (recognized as T. suis), or not. In the present study, we sequenced, annotated and compared the complete mitochondrial genomes of Trichuris from these two hosts and undertook a phylogenetic analysis of the mitochondrial datasets. This analysis showed clear genetic distinctiveness and strong statistical support for the hypothesis that T. trichiura and T. suis are separate species, consistent with previous studies using nuclear ribosomal DNA sequence data. Future studies could explore, using mitochondrial genetic markers defined in the present study, cross-transmission of Trichuris between pigs and humans in endemic regions, and the population genetics of T. trichiura and T. suis.
Draft Genome Sequence of Zymomonas mobilis ZM481 (ATCC 31823)
Zymomonas mobilis ZM481 (ATCC 31823) is an ethanol-tolerant strain that can produce the highest level of ethanol in Z. mobilis from glucose in the shortest time. Here, we report a draft genome sequence of ZM481, which can help us understand the genes related to the ethanol tolerance of this strain.
Draft Genome Sequence of Polaromonas glacialis Strain R3 9, a Psychrotolerant Bacterium Isolated from Arctic Glacial Foreland
Here we report the draft genome sequence of the psychrotolerant Polaromonas glacialis strain R3-9, isolated from Midtre Lovénbreen glacial foreland near Ny-Alesund, Svalbard Archipelago, Norway.
Rapid Evolution of the Mitochondrial Genome in Chalcidoid Wasps (Hymenoptera: Chalcidoidea) Driven by Parasitic Lifestyles
Among the Chalcidoids, hymenopteran parasitic wasps that have diversified lifestyles, a partial mitochondrial genome has been reported only from Nasonia. This genome had many unusual features, especially a dramatic reorganization and a high rate of evolution. Comparisons based on more mitochondrial genomic data from the same superfamily were required to reveal weather these unusual features are peculiar to Nasonia or not. In the present study, we sequenced the nearly complete mitochondrial genomes from the species Philotrypesis. pilosa and Philotrypesis sp., both of which were associated with Ficus hispida. The acquired data included all of the protein-coding genes, rRNAs, and most of the tRNAs, and in P. pilosa the control region. High levels of nucleotide divergence separated the two species. A comparison of all available hymenopteran mitochondrial genomes (including a submitted partial genome from Ceratosolen solmsi) revealed that the Chalcidoids had dramatic mitochondrial gene rearrangments, involved not only the tRNAs, but also several protein-coding genes. The AT-rich control region was translocated and inverted in Philotrypesis. The mitochondrial genomes also exhibited rapid rates of evolution involving elevated nonsynonymous mutations.
Complete Genome Sequence of the Streptococcus suis Temperate Bacteriophage ϕNJ2
Streptococcus suis is an important cause of meningitis, arthritis, and sudden death in young piglets and of meningitis in humans. A novel temperate S. suis-specific bacteriophage (ϕNJ2) was identified. The phage was induced from the S. suis strain NJ2 by using mitomycin C, and the whole genome sequence was determined. The ϕNJ2 genome is 37,282 bp in length and contains 56 open reading frames (ORFs). While 31 ORFs (55%) encoded hypothetical proteins, other ORFs were predicted to be functional, clearly indicating the novelty of ϕNJ2.
A Framework for Assessing the Concordance of Molecular Typing Methods and the True Strain Phylogeny of Campylobacter jejuni and C. coli Using Draft Genome Sequence Data
Tracking of sources of sporadic cases of campylobacteriosis remains challenging, as commonly used molecular typing methods have limited ability to unambiguously link genetically related strains. Genomics has become increasingly prominent in the public health response to enteric pathogens as methods enable characterization of pathogens at an unprecedented level of resolution. However, the cost of sequencing and expertise required for bioinformatic analyses remains prohibitive, and these comprehensive analyses are limited to a few priority strains. Although several molecular typing methods are currently widely used for epidemiological analysis of campylobacters, it is not clear how accurately these methods reflect true strain relationships. To address this, we have developed a framework and associated computational tools to rapidly analyze draft genome sequence data for the assessment of molecular typing methods against a “gold standard” based on the phylogenetic analysis of highly conserved core (HCC) genes with high sequence quality. We analyzed 104 publicly available whole genome sequences (WGS) of C. jejuni and C. coli. In addition to in silico determination of multi-locus sequence typing (MLST), flaA, and porA type, as well as comparative genomic fingerprinting (CGF) type, we inferred a “reference” phylogeny based on 389 HCC genes. Molecular typing data were compared to the reference phylogeny for concordance using the adjusted Wallace coefficient (AWC) with confidence intervals. Although MLST targets the sequence variability in core genes and CGF targets insertions/deletions of accessory genes, both methods are based on multi-locus analysis and provided better estimates of true phylogeny than methods based on single loci (porA, flaA). A more comprehensive WGS dataset including additional genetically related strains, both epidemiologically linked and unlinked, will be necessary to more comprehensively assess the performance of subtyping methods for outbreak investigations and surveillance activities. Analyses of the strengths and weaknesses of widely used typing methodologies in inferring true strain relationships will provide guidance in the interpretation of this data for epidemiological purposes.
The Dynamic Regulatory Genome of Capsaspora and the Origin of Animal Multicellularity
The unicellular ancestor of animals had a complex repertoire of genes linked to multicellular processes. This suggests that changes in the regulatory genome, rather than in gene innovation, were key to the origin of animals. Here, we carry out multiple functional genomic assays in Capsaspora owczarzaki, the unicellular relative of animals with the largest known gene repertoire for transcriptional regulation. We show that changing chromatin states, differential lincRNA expression, and dynamic cis-regulatory sites are associated with life cycle transitions in Capsaspora. Moreover, we demonstrate conservation of animal developmental transcription-factor networks and extensive network interconnection in this premetazoan organism. In contrast, however, Capsaspora lacks animal promoter types, and its regulatory sites are small, proximal, and lack signatures of animal enhancers. Overall, our results indicate that the emergence of animal multicellularity was linked to a major shift in genome cis-regulatory complexity, most notably the appearance of distal enhancer regulation.
Dynamic chromatin states and cis-regulatory sites in a unicellular context
Elaborate lincRNA regulation associated with a unicellular life cycle
Premetazoan origin of core metazoan developmental transcription-factor networks
Distal enhancer elements are a metazoan innovation
Dynamic chromatin states and cis-regulatory sites in a unicellular context Elaborate lincRNA regulation associated with a unicellular life cycle Premetazoan origin of core metazoan developmental transcription-factor networks Distal enhancer elements are a metazoan innovation Analysis of the regulatory genome in one of our closest unicellular relatives suggests that the appearance of developmental promoters and distal enhancer elements, rather than of gene innovations, may have been the critical events underlying the origin of multicellular organisms.
Correction of the auditory phenotype in C57BL/6N mice via CRISPR/Cas9 mediated homology directed repair
Nuclease-based technologies have been developed that enable targeting of specific DNA sequences directly in the zygote. These approaches provide an opportunity to modify the genomes of inbred mice, and allow the removal of strain-specific mutations that confound phenotypic assessment. One such mutation is the Cdh23ahl allele, present in several commonly used inbred mouse strains, which predisposes to age-related progressive hearing loss. We have used targeted CRISPR/Cas9-mediated homology directed repair (HDR) to correct the Cdh23ahl allele directly in C57BL/6NTac zygotes. Employing offset-nicking Cas9 (D10A) nickase with paired RNA guides and a single-stranded oligonucleotide donor template we show that allele repair was successfully achieved. To investigate potential Cas9-mediated ‘off-target’ mutations in our corrected mouse, we undertook whole-genome sequencing and assessed the ‘off-target’ sites predicted for the guide RNAs (≤4 nucleotide mis-matches). No induced sequence changes were identified at any of these sites. Correction of the progressive hearing loss phenotype was demonstrated using auditory-evoked brainstem response testing of mice at 24 and 36 weeks of age, and rescue of the progressive loss of sensory hair cell stereocilia bundles was confirmed using scanning electron microscopy of dissected cochleae from 36-week-old mice. CRISPR/Cas9-mediated HDR has been successfully utilised to efficiently correct the Cdh23ahl allele in C57BL/6NTac mice, and rescue the associated auditory phenotype. The corrected mice described in this report will allow age-related auditory phenotyping studies to be undertaken using C57BL/6NTac-derived models, such as those generated by the International Mouse Phenotyping Consortium (IMPC) programme. The online version of this article (doi:10.1186/s13073-016-0273-4) contains supplementary material, which is available to authorized users.
Patterns of Genome Wide Variation in Glossina fuscipes fuscipes Tsetse Flies from Uganda
The tsetse fly Glossina fuscipes fuscipes (Gff) is the insect vector of the two forms of Human African Trypanosomiasis (HAT) that exist in Uganda. Understanding Gff population dynamics, and the underlying genetics of epidemiologically relevant phenotypes is key to reducing disease transmission. Using ddRAD sequence technology, complemented with whole-genome sequencing, we developed a panel of ∼73,000 single-nucleotide polymorphisms (SNPs) distributed across the Gff genome that can be used for population genomics and to perform genome-wide-association studies. We used these markers to estimate genomic patterns of linkage disequilibrium (LD) in Gff, and used the information, in combination with outlier-locus detection tests, to identify candidate regions of the genome under selection. LD in individual populations decays to half of its maximum value (r2max/2) between 1359 and 2429 bp. The overall LD estimated for the species reaches r2max/2 at 708 bp, an order of magnitude slower than in Drosophila. Using 53 infected (Trypanosoma spp.) and uninfected flies from four genetically distinct Ugandan populations adapted to different environmental conditions, we were able to identify SNPs associated with the infection status of the fly and local environmental adaptation. The extent of LD in Gff likely facilitated the detection of loci under selection, despite the small sample size. Furthermore, it is probable that LD in the regions identified is much higher than the average genomic LD due to strong selection. Our results show that even modest sample sizes can reveal significant genetic associations in this species, which has implications for future studies given the difficulties of collecting field specimens with contrasting phenotypes for association analysis.