Computational protocol: Airway Microbial Community Turnover Differs by BPD Severity in Ventilated Preterm Infants

Similar protocols

Protocol publication

[…] As previously described, Illumina MiSeq paired-end sequences were sorted by sample via barcodes in the paired reads with a python script[]. Sorted paired end sequence data were deposited in the NCBI Short Read Archive under accession number SRP066782. The sorted paired reads were assembled using phrap [, ]. Pairs that did not assemble were discarded. Assembled sequence ends were trimmed over a moving window of 5 nucleotides until average quality met or exceeded 20. Trimmed sequences with more than 1 ambiguity or shorter than 200 nt were discarded. Potential chimeras identified with Uchime (usearch6.0.203_i86linux32) [] using the Schloss [] Silva reference sequences were removed from subsequent analyses. Assembled sequences were aligned and classified with SINA (1.2.11) [] using the 629,125 bacterial sequences in Silva 111[] as reference configured to yield the Silva taxonomy. Operational taxonomic units (OTUs) were produced by clustering sequences with identical taxonomic assignments. This process generated 22,512,493 sequences for 350 samples (average size: 64,321 sequences/sample; min: 6,451; max: 196,691). The median Goods coverage score was ≥ 99.6% at the rarefaction point of 6,451. The software package Explicet (v2.10.5,[] was used to calculate rarefied Good’s coverage and alpha diversity (Shannon index) measures. [...] All data was prospectively collected and managed using REDCap (Research Electronic Data Capture)[] database hosted at the University of Colorado Denver Development and Informatics Service Center. Chi-squared tests and Wilcoxon ranked sum tests were used to compare clinical characteristics across BPD groups for categorical and continuous variables, respectively. To account for differences in sequencing depth, the relative abundance (RA) of each taxon was calculated (number of sequences for specific taxa/ total number of sequences*100). Random forests (RF) consisting of 5,000 classification trees were used to multivariately evaluate microbial taxa cross-sectionally across BPD groups []. A random forest is an ensemble method which fits many classification trees and is often used for high dimensional data. Multidimensional Scaling using the proximity matrix from the RF was used to visualize the separation between BPD groups based on the microbial communities. Communities across collection times within a subject were compared using Shannon Beta, a measure of beta diversity []. This approach extends the beta diversity measure to apply to a group of samples rather than just for pair-wise comparisons. For our example, we used the following Hβi=∑jcijci++∑kcijkcij+ln(cijkcij+ci+kc+++)(1) where cijk is the sequence count for subject i, from sample j and taxon k, the + in the subscript denotes the summation of the counts over the specified indicator.For ease of clinical interpretation, Shannon Beta is expressed as a Hill's number which indicates the effective number of communities represented by the group of samples. This measure is dependent on the number of samples from which it was calculated, and ranges from 1 to N (in our study n = 4). A normalizing transformation was used to rescale the Hill's numbers to allow comparison across subjects with differences in the number of collected samples []. Hni=Hβi−1ji−1(2) where ji is the number of samples for subject i.Morisita-Horn (MH, Beta-diversity) for pairwise samples within each subject were calculated. Both MH and total bacterial load were compared across the BPD severity groups using a log-normal model and generalized estimating equations to account for repeated measures.Individual taxa of interest were modeled over time with the age of the subject at the specimen collection as the time variable and using a generalized linear mixed model. A beta-binomial join-point model was fit with a knot placed at 10 days and included a random subject effect and group specific dispersion parameters. Placement of the knot was determined by fitting a cubic function, setting the first derivative to zero and solving. Details of this analysis are described in the online data supplement (). The RF was performed using the RandomForest R package, all other analyses were performed using SAS version 9.4 software (SAS Institute Inc.: Cary, NC, 2014). […]

Pipeline specifications

Software tools UCHIME, USEARCH, Explicet, randomforest
Applications Miscellaneous, Metagenomic sequencing analysis, 16S rRNA-seq analysis
Organisms Homo sapiens
Diseases Bacterial Infections, Bronchopulmonary Dysplasia
Chemicals Oxygen