Computational protocol: Shotgun Pyrosequencing Metagenomic Analyses of Dusts from Swine Confinement and Grain Facilities

Similar protocols

Protocol publication

[…] Unexpectedly, we found that additional bioinformatic processing steps had to be taken with dust-derived read datasets from environments with a resident mammal, i.e., swine facility (swine and human) and household (human) datasets. Each of these datasets had to be computationally partitioned into “swine/human” and “filtered” data subsets in order to maintain the original focus of this study, which was only on the latter subset, i.e., agricultural dust microbiota. Thus, post-QC reads were aligned using the blastn program in BLAST+ v.2.2.25 against the unmasked swine (Sus scrofa) draft genome sequence (ssc_ref_Sscrofa10) and/or the unmasked human genome sequence (hs_ref_GRCh37.p5). The seed size used was six nucleotides and only the best BLASTn hit per read was considered. The NCBI BLASTn program reports expect values <1e−179 as zero; hence zero expect values were converted to 1e−179 before log10 transformation.The possibility remained that some reads that aligned to the swine/human genomes may align with an even lower BLASTn expect value to a known bacterial genome sequence. Thus, the 1,480 complete and 1,659 draft bacterial genome sequences that were available on the NCBI FTP site on 11/16/2011 were downloaded and formatted as a BLASTn database for a second round of alignments, using the mammalian best-hit reads as queries. Very few of these reads yielded a significant alignment to any of the available bacterial genome sequences, but in cases where such an alignment yielded a lower BLASTn expect value than was obtained with the same read’s best mammalian genome BLASTn hit, the read was re-classified and added to the “filtered” dataset. For the control read alignments against mammalian genome sequences using the swine feces shotgun metagenomic read datasets (post-QC read datasets comprised of 127,088; 427,661; and 563,638 reads for the swine feces 1, 2, and 3 datasets, respectively), the alignment workload was reduced by using an evenly sampled subset of 20,000 reads for each of these three datasets. Except where explicitly indicated, all results reported for the swine facility and household dust-derived datasets are based on analyses of only their respective “filtered” read subsets. [...] During MG-RAST (v. 3.0) read dataset upload, the default options for quality control (QC) were selected, i.e., base-call quality filtering, read-length filtering, and de-replication of reads, but screening against a model organism genome sequence was not selected. Individual read datasets were then used for MG-RAST organism. These individual read dataset MG-RAST abundance profiles were then used as input for Principal Component Analysis (PCA) performed at multiple levels of the relevant classification hierarchy, as well as two-group statistical tests performed at the lowest or “leaf” level of the relevant classification hierarchy, using the “Statistical Analysis of Metagenomic Profiles” (STAMP v. 2.0) software .Read dataset collections (e.g., swine confinement facility dust [n = 2 samples], grain elevator dust [n = 2 samples] or household dust [n = 2 samples]) were also created in MG-RAST, and these collections were also used for MG-RAST organism. These read dataset collection results were then used as input for summary histograms at all levels in the relevant classification hierarchy. Read dataset collection results were also used for comparisons of MG-RAST’s organism abundance profiles between swine feces control datasets obtained using either 16S rRNA amplicon-based or shotgun metagenomic-based approaches.Organism abundance profiling using shotgun metagenomic read datasets was carried out using the “best hit classification” alignment procedure against the M5 non-redundant protein database (M5NR), using the following parameter values: Max. e-Value Cutoff: 1e−5; Min. % Identity Cutoff: 60%; Min. Alignment Length Cutoff: 50. MG-RAST’s Lowest Common Ancestor (LCA) organism abundance profiling procedure for shotgun reads did not produce profiles that could be used with STAMP, and hence the LCA results were not compared.Organism abundance profiling using the swine feces control “16S” read dataset collection was carried out using the “best hit classification” alignment procedure against the Ribosomal Database Project database (RDP, University of Michigan) , using the following parameter values: Max. e-Value Cutoff: 1e−5; Min. % Identity Cutoff: 97%; Min. Alignment Length Cutoff: 50. The top 10 taxa in the RDP-based organism abundance profiles were ranked based on their relative abundances (taxon-specific abundance/total abundance). Relative abundances values for the same taxa were obtained from the organism abundance profiles carried out using the swine feces control shotgun read dataset collection and the M5NR-based database. Relative abundance ratios were then calculated as the ratio of M5NR-based relative abundance divided by RDP-based relative abundance. Perfect concordance between the RDP and M5NR organism abundance profiles would yield a relative abundance ratio of 1. […]

Pipeline specifications

Software tools BLASTN, STAMP
Databases MG-RAST
Applications Metagenomic sequencing analysis, 16S rRNA-seq analysis
Organisms Sus scrofa, Bacteria, Viruses, Homo sapiens, Firmicutes
Diseases Asthma, Bronchitis, Drug Hypersensitivity, Pulmonary Disease, Chronic Obstructive