Computational protocol: Emergent Transcriptomic Technologies and Their Role in the Discovery of Biomarkers of Liver Transplant Tolerance

Similar protocols

Protocol publication

[…] Rigorous quality control criteria help to ensure high quality data collection from arrays that are reproducible and comparable. The MicroArray Quality Control project (MAQC), an unprecedented, community-wide effort to appraise microarray reliability and quality control metrics, reported that, with careful experimental design and appropriate data transformation and analysis, data can be reproducible and comparable across laboratories, institutions, and researchers (). A number of commercial software packages have been developed to aid the quality control process (–). Specialist software is also available to aid with data normalization, a crucial step in the conversion of raw data into scaled relative expression levels (, , ). Statistical packages are also utilized for calculating differential expression, controlling for false positives, selecting significance cut-offs, the clustering of genes thought to be similar or co-regulated, and final pathway analyses, enabling the identification of gene sets associated with specific biological functions (–).Much of the complexity involved in the statistical analyses stems not only from the high numbers of genes measured per sample (“curse of dimensionality”) but also the disproportion between this and the limited numbers of samples available for testing (“curse of scarcity”) – a difficulty often faced in biomarker research. This is overcome, in part, through adjusted p-values (q-value) such as false discovery rates (). Tools for analysis are often free to download and now widely used. They include significance analysis of microarrays (SAM), GenePattern, and GenMAPP (). Despite robust analytical tools, an undiscerning researcher can erroneously use data to “discover” sets of genes that are able to differentiate the samples on which the gene algorithm modeling was based even when the data are completely random. This problem should be circumvented by ensuring that a gene model is tested on a validation group that is independent from the training set used to create the model in the first place. This approach is more desirable than cross-validation techniques sometimes employed (–). Further, technical validation of microarray results on a different transcriptional platform, usually RT-PCR, is also recommended to minimize inter- or intra-platform variability in hybridization noise that may arise between batches or laboratories.In order to verify the reproducibility of analyses and to corroborate clinical validity, public microarray databases serve as essential repositories. In the transplant setting, where studies often include only small numbers of recipients, these resources are especially important. The Functional Genomics Data (FGED) Society (formerly the MGED Society), a non-profit, volunteer run organization promoting the sharing of high-throughput research data, helped to define the Minimum Information About a Microarray Experiment (MIAME) guidelines for data content standards. The Society also set the standard data exchange format, known as the Microarray Genetic Expression Markup Language (MAGE-ML). Thorough reviews of the numerous databases in existence have been set out in the literature ().Microarray data output is necessarily dependent on the quality of the original biological samples. RNA is considerably more susceptible to rapid enzymatic degradation than DNA, thereby making efficient processing and appropriate storage using robust protocols essential. Microarrays offer snapshots of gene expression. The kinetics of transcripts and the variability of changing levels of expression in relation to their baseline remain little understood and so are not amenable to statistical interpretation (, ). Matters are further complicated by tissue heterogeneity, as is the case in blood samples for instance. This heterogeneity makes anatomical detail in the microarray approach difficult, in that it is difficult to know which cells’ gene expression profiles are being analyzed. Cell sorting and microdissection are ways to tackle this difficulty, as is the application of statistical deconvolution methods such as the cell-specific significance analysis of microarrays (csSAM) (). While peripheral blood has been at the forefront of efforts to identify biomarkers, the possibility of interrogating RNA extracted from paraffin embedded biopsies is a useful addition to investigative efforts.It becomes clear then, that to discern biological fact from mere noise, it is essential that due attention is paid to the analytical complexities involved in microarray interpretation. Although, as we will see, microarray profiling has yielded important data in the pursuit of biomarkers of tolerance, and the technology is becoming more commonplace in transplantation research, the promise of emerging next-generation sequencing (NGS) technologies is likely to eclipse many microarray applications. In essence, NGS involves the sequential identification of the bases of small fragments of DNA from signals, which are emitted when each fragment is re-synthesized from a DNA template strand. By extending this process across millions of reactions in parallel, the technology enables rapid sequencing of large stretches of DNA base-pairs spanning entire genomes ().In part, the promise of NGS stems from sidestepping some of the aforementioned problems inherent in microarray technology. NGS is highly reliable, and has greater dynamic range as it directly quantifies discrete digital sequencing readouts as opposed to relying on hybridization steps. Loss of specificity due to cross-hybridization is controlled; the detection of rare and low abundance transcripts is made more achievable; the unbiased detection of novel transcripts is made possible since the need for transcript-specific probes utilized by microarray become redundant; and errors in probe design, which are relatively common in microarray chips, are avoided. In addition to these technical considerations, NGS technology is advancing at such a pace that the prospect of “sequencing everything” (genome, epigenome, transcriptome) in a timely and cost-effective manner is well within reach. In the 4 years between 2007 and 2011, a single sequencing run’s output increased 1000×, far outstripping Moore’s law, while the cost of sequencing the entire genome has fallen from over 150,000 USD in 2009, to less than 5000 USD in 2014 (). Of course, NGS presents its own technological and bioinformatics challenges – which have been comprehensively reviewed elsewhere (, ). […]

Pipeline specifications

Software tools GenePattern, GenMAPP, CSsam
Application ChIP-on-chip analysis
Diseases Liver Diseases