Computational protocol: Intestinal Bacterial Communities of Trypanosome-Infected and Uninfected Glossina palpalis palpalis from Three Human African Trypanomiasis Foci in Cameroon

Similar protocols

Protocol publication

[…] Illumina MiSeq reads were analyzed using in-house pipelines (Richard Christen; described in Boissière et al., ; Hartmann et al., ; Gimonneau et al., ; Massana et al., ). Briefly, Silva 119 NR (analyses performed in 2014) was used as the reference database for taxonomic identification (Quast et al., ). An in silico extraction of Silva amplicons using forward: 5′-CCTACGGGNGGCWGCAG-3′ and reverse: 5′-GACTACHVGGGTHTCTAATCC-3′ primers, followed by an analysis by length/number of amplicons, yielded the following results: 1–50/0; 51–100/0; 101–150/2; 151–200/1; 201–250/1; 251–300/11; 301–350/40; 351–400/21,669; 401–450/448,575; 451–500/450; 501–550/192; 551–600/1,198; 601–650/0. This demonstrates that most amplicons were between 350 and 450 nucleotides in length. The extracted database is referenced below as the refseq. Each sequence identifier was reformatted to eight taxonomic fields as in PR2 (Guillou et al., ) making it easier to use a pipeline for analyses.In order to assemble paired-end reads, the software programs PEAR (Zhang et al., ) and FLASH (Magoc and Salzberg, ) were both tested. PEAR using default parameters merged more pairs, and paired-end reads were therefore assembled and quality filtered (using Illumina quality scores) with PEAR. 94.26% (3,622,233 reads) of the total reads (3,842,672 reads) were assembled; 0.003% were discarded and 5.734% were not assembled.Reads were then sorted by length using a dedicated python script, yielding the following results: 0–0/0; 1–50/3; 51–100/1,556; 101–150/1,435; 151–200/1,535; 201–250/1,578; 251–300/2,251; 301–350/9,401; 351–400/38,323; 401–450/132,487; 451–500/3,433,239; 501–550/381; 551–600/44; 601–650/0. Reads shorter than 300 nucleotides (0.23% out of the overall number of reads) were discarded, resulting in the extraction of 3,613,875 reads. The fasta file was then demultiplexed using a dedicated C++ program. Primer trimming was performed using CutAdapt v1.8.1 (Martin, ).Each file was dereplicated, sorted by decreasing abundance, chimera checked with UCHIME (Edgar et al., ) and then clustered using Crunchclust (Mondani et al., ; Gimonneau et al., ; Massana et al., ; Tchioffo et al., ; available from; at a Levenstein distance of 5). After this clustering step, clusters that contained <2 reads were discarded as artifacts (see Boissière et al., ). All these steps resulted in 2,562,144 high-quality sequences contained in clusters with at least two reads (70.73% of the total assembled reads) that were subsequently used for taxonomic assignment. In each cluster, the most abundant sequence was kept as the representative one, since it was assumed to have the least errors in a cluster. Taxonomic assignment was done as in Pawlowski et al. (). Briefly, a Needleman–Wunsch algorithm to search for the 30 most similar sequences to each representative sequence from the refseq was employed. The reference sequences with the highest percentage were then used, and taxonomy to a given level was obtained. When more than one result emerged, the two highest hits were reported. When similarity was <80%, sequences were not assigned. Abundance matrices were generated for statistical analyses at each taxonomic level. Several abundant Operational Taxonomic Units (OTUs) could not be identified satisfactorily down to the genus or species level. In these cases (rough estimation: 1%), reads sequences and similar refseq sequences were selected and then aligned using ClustalO (Sievers et al., ) and SeaView (Gouy et al., ). Trees were plotted using TreeDyn (Chevenet et al., ) or MetaPhlAn (Segata et al., ), and distinct robust subtrees were annotated as distinct species whenever possible. [...] Statistical analyses were performed using the R package vegan. Rarefaction curves (Figure ) were performed prior to comparative analyses between infected and uninfected flies, and between the sampling sites. Significant differences in bacterial richness between the infected and uninfected flies, and between the three sampling sites were tested using non-parametric Kruskal–Wallis test. We used vegdist and hclust using single, average, and complete linkage methods for hierarchical clustering and then compared them for the presence of sub-trees. Nonmetric multidimensional scaling (nMDS) were generated using the R packages ggplot2 and phyloseq (McMurdie and Holmes, ). […]

Pipeline specifications

Software tools PEAR, cutadapt, UCHIME, SeaView, TreeDyn, MetaPhlAn, Hclust, Ggplot2, phyloseq
Applications Miscellaneous, Phylogenetics, 16S rRNA-seq analysis
Diseases Trypanosomiasis, Trypanosomiasis, African