Computational protocol: Conserved Units of Co-Expression in Bacterial Genomes: An Evolutionary Insight into Transcriptional Regulation

Similar protocols

Protocol publication

[…] The number Mij(x) of genomes in which genes i and j are at normalized distance xij ≤ x is computed as Mij(x) = ∑g ωg1(xij ≤ x) (we consider genomes where at least one gene is present and we set xij = 1 if one of the two genes is missing), with genome weights defined by ωg = 1/|{h: Dgh < δ}|, where |{h: Dgh < δ}| denotes the number of genomes h at phylogenetic distance at most δ from g. Here, we fix δ to δ = 0.25, which is large enough to treat as equivalent the different strains of a same species (larger values δ may reveal more conserved syntenic relations []). This weighting procedure defines an effective number of genomes as M′ = ∑g ωg with here M′ = 500—for the pair ij, we define the corresponding effective number of genomes, Mij′, by considering only the genomes where i and j are present (Mij′≤M′). We use a simple definition of evolutionary distance based on the sequence similarity of a few representative conserved genes (quantifying the phylogenetic distance between bacterial genomes is a notoriously difficult task, given that different genes in a same genome often have different histories []). Specifically, we selected the 10 genes associated with the COGs 126G, 173J, 202K, 2255L, 481M, 497L, 541U, 544O, 556L, 1158K. These genes were taken from a list of genes shown to reflect phylogenetic distances between bacterial strains [], with the additional constraint that they comprise a single copy in most of the genomes of our dataset. We aligned the amino sequences of these genes with MAFFT [] and defined the similarity between any two genes by their fraction of common amino acids in the resulting multiple sequence alignment, excluding positions with gaps in the two genes. The evolutionary similarity Sgh between two strains g and h was obtained by averaging these similarities over the representative genes, taking only into account those genes present in single copy in the two strains. We then defined an evolutionary distance between strains as Dgh = 1 − Sgh. We checked that this procedure yields a robust estimation of evolutionary distance by repeating the analysis with subsets of only 5 of the 10 genes and verifying that it leads to equivalent results (). [...] To analyze transcription in non-coding, inter-operon regions of E. coli, we use RNA-seq data from [], which we retrieved in the form of.sra files. RNA reads were mapped to the genome of E. coli K12 MG1655 using bowtie2. The number of reads per bp was then computed as the genomic coverage of the data (using genomeCoverageBed and the flags “-d -split”), with the final expression levels equal to the log-value of the mean number of reads found in the regions of interest. We considered datasets for which more than ∼ 90% of the reads were uniquely mapped. Our results are averaged over 7 different conditions corresponding to the following GEO Accession Number: GSM1104381 (sgrS- with vector), GSM1104384 (sgrS- with sgrS+ plasmid), GSM1104387 (WT in LB +αMG), GSM1104401 (WT in defined medium with glycerol +αMG), GSM1104402 (WT in defined medium with glycerol −αMG), GSM1104405 (sgrS- in defined medium with glycerol +αMG) and GSM1104408 (sgrS- in defined medium with glycerol −αMG). Analyzing inter-operonic transcription also requires identifying transcription start sites (TSS). We retrieved TSS datasets from the most recent update of RegulonDB (Morett dataset []) and from the recent dataset of Palsson’s group []. We combined these two datasets into a single list of TSSs, and considered operons for which the first gene had an associated TSS in the immediate upstream inter-operonic region. For genes with several potential TSSs in the inter-operonic region, we considered the closest upstream start sites. To assess whether synteny segments display any specific inter-operon transcriptional activity between co-directional consecutive operons, we further limited biases from mis-annotations by considering only inter-operon regions of size larger than 100 bp, which corresponds in E. coli to 243 cases of co-directional consecutive operons (29 pairs are intra-segment pairs). Considering the 7 different RNA-seq conditions of E. coli, we thus analyzed 203 (29 × 7) situations inside a same segment and 1498 (214 × 7) situations outside segments.To investigate the phenomenon of transcriptional read-through in B. subtilis, we analyzed the tendency of adjacent genes from different operons to belong to one of the transcriptional units identified by the BaSysBio consortium. These transcriptional units represent blocks of contiguous expression that often extend the known operons of B. subtilis []. […]

Pipeline specifications

Software tools MAFFT, Bowtie2
Applications Phylogenetics, RNA-seq analysis
Organisms Bacillus subtilis, Escherichia coli, Bacteria