Computational protocol: Targeted metagenomic sequencing data of human gut microbiota associated with Blastocystis colonization

Similar protocols

Protocol publication

[…] The analysis of all samples was performed using a home-made Galaxy (v1.0.0) pipeline (http://www.pegase-biosciences.com/pub_2014/#ECCB), as shown in , linking three main analytical steps: raw data preprocessing, clustering analysis and OTU classification, and read count normalization. Unless otherwise mentioned in , default parameters were used for all software programs. The home-made scripts are available in Figshare (scripts.zip file, Data Citation 2).The preprocessing step (), using Mothur (v1.27.0) and home-made scripts (scripts.zip file, Data Citation 2), filtered the raw data (Data Citation 1) to minimize erroneous reads generated by the Ion Torrent PGM sequencer. Reads shorter than 150 bases and/or containing large homopolymers were removed. The reads were then aligned against the SILVA 102 bacterial database, and those reads with alignments of fewer than 100 bases were filtered out. Finally, the filtered reads were deduplicated to reduce the datasets. Three of the 96 samples, considered to be outliers (see the Technical validation section) were discarded in this step, before proceeding with the subsequent analyses. The number of reads remaining after this preprocessing step was 2,742,108.In the second analytical step, OTU clustering was performed using ESPRIT-Tree version 11152011, which allows the same OTU definition precision as standard hierarchical clustering procedures, but requires less execution time. The OTUs were classified using classify.seqs in Mothur (v1.27.0) with the SILVA 102 database and the RDP taxonomy ().Intra-sample rarefaction curves were generated using the home-made rarefaction curve plotting tool (rarefaction.R in scripts.zip file, Data Citation 2). It provides a way of comparing the richness observed in the samples. Graphically it presents the number of OTUs theoretically observed for a range number of sequences into the sample at variable distances (91%, 93%, 95%, 97%).For each sample, the output of this second analytical step is an OTU table file (OTU_count_tables tsv file, Data Citation 2) containing four columns: the first column is the consensus read name associated to the OTU, the second column is the OTU raw counts, the third column is the consensus read name (same as the first column) and the fourth column is the associated taxon. The consensus sequence for a given OTU is the most abundant sequence in this OTU. The characteristics of these OTU_count_tables files are summarized in .In the third analytical step of the home-made pipeline (), all the annotated OTU tables were merged into a global OTU table using a home-made python script (v2.7.3) (OTU_tables_merge.py script from scripts.zip file, Data Citation 2), based on each OTU's taxonomic annotation.The Global OTU table is a tabulation-formatted file (TSV) in which each column of this table represents one sample, and each line represents one taxon (identified by its OTU identifier in the first column and by the taxonomic annotation in the last column). This merged OTU table describes 474 OTUs, their annotation, and the number of reads belonging to each OTU per sample. Finally, this annotated OTU table was converted into a global BIOM file by the biom (v2.1.4) convert command. HDF5 format was chosen to optimize the storage. The characteristics of the BIOM file (Data Citation 2) are summarized in .As advised by previous recommendations, the DESeq2 package integrated into QIIME (v1.9.0), was used to normalize the total read counts and avoid rarefaction of the read count data. The normalization_table.py python script from QIIME (v1.9.0) was configured with the following options: the algorithm chosen was DESeq2, replacing negative numbers produced by the DESeq normalization technique with zeros. A normalized Global BIOM file was then produced (). Note that the taxonomic information disappears after this normalization step. The biom (v2.1.4) 'add-metadata' option was used to add the taxonomic information back and obtain a fully annotated and normalized BIOM file (Data Citation 2). […]

Pipeline specifications

Software tools Galaxy, mothur, ESPRIT-Tree, DESeq2, QIIME
Application 16S rRNA-seq analysis
Organisms Homo sapiens
Diseases Colonic Neoplasms