Computational protocol: Open-Phylo: a customizable crowd-computing platform for multiple sequence alignment

Similar protocols

Protocol publication

[…] Once uploaded in our system, levels (that is casual puzzles of 20 nucleotides used in the classic/casual version of the game) are extracted from the MSAs submitted by Open-Phylo users. The protocol used to determine the levels (automatically) is:1. Pan a reading frame of ten to twenty nucleotides across the sequences.2. Calculate the number of nucleotides (without gaps) for each species. Then, compute the average and standard deviation.3. Calculate the number of pairwise matches between nucleotides in columns (ignoring the tree structure). From this number, derive the ratio of matches vs all possible pairwise comparisons within columns.4. The level is accepted if the standard deviation in Step 2 is greater than 1, and the level match ratio/fraction in Step 3 is between 0.32 and 0.38.5. If accepted, the reading frame jumps past the current nucleotides (to prevent an overlap). Otherwise, it shifts by one position to the right.In addition, users can also create their own levels through the Open-Phylo web interface. To do so, a user selects a region (using the shift key) of the MSA with a size of between ten and twenty columns. All non-empty rows (that is with at least one nucleotide) are included in the new level. [...] We evaluated Open-Phylo on MSAs of promoters regions of tumor suppressor genes: the P53 tumor suppressor protein, breast cancer type 1 susceptibility protein (BRCA1) and retinoblastoma protein (RB1). The sequences and initial Multiz alignments were downloaded from the UCSC Genome Browser [].These initial alignments were divided into smaller MSAs of 300 columns. Each of these MSAs was realigned with one of the four alignments programs used in this study (Multiz [], MUSCLE [], PRANK [] or T-Coffee []) using the default alignment settings. The latter were the initial MSAs uploaded to the Open-Phylo web-user interface. All data (initial MSAs together with the MSAs improved with Open-Phylo) are available at []. [...] The casual and expert versions of the video game Phylo use the same scoring scheme. This is a simplified version of more realistic objective functions used to estimate the quality of an MSA. In our case, the scoring scheme for a given puzzle alignment must be evolutionarily realistic while being intuitive and fast to compute (as it is recomputed in real time every time the player modifies the alignment).We made minor modifications to the scoring scheme to improve on that used in the first version of the casual game []. The Phylo interface displays a simplified and entertaining representation of an MSA instance with its associated phylogenetic tree. Each nucleotide is represented with a brick whose color indicates its type (adenine, cytosine, guanine or thymine). To evaluate a given alignment, the game infers ancestral nucleotides or gaps at each ancestral node of the phylogenetic tree using a maximum parsimony approach (the Fitch algorithm []), considering a gap as a fifth character, independently for each position. The scores for induced pairwise alignments, each evaluated using an affine gap cost model, are summed over all edges of the tree. To make the scoring intuitive, our scheme uses integer values (the score for a match is +1, for a mismatch -1, for a gap opening -4 and for a gap extension -1), which approximate those used by BLASTZ []. Compared to the value used in the original casual Phylo game [], the gap opening score has been reduced in our new implementation. This change allows gamers to accommodate more gaps and it makes the game more entertaining while keeping the scoring realistic.Because it infers ancestral nucleotides independently at each position, the original Fitch algorithm is not designed to accommodate an affine gap penalty model and may result in sub-optimal ancestral sequences, which would yield a pessimistic alignment evaluation. However, exact algorithms or better approximations are computationally more expensive [,], and we considered that the simplicity of our scoring method and its speed largely compensate for the slight accuracy loss. Nonetheless, we addressed this issue in the expert version and enabled users to modify the ancestor sequences (see section 'Advanced player (expert) version’). Therefore, advanced players can improve sub-optimal ancestors calculated by the game, and identify good MSAs that would be missed by the classical scoring algorithm.Finally, our new version of Phylo also ignores gaps at the beginning and the end of each pairwise alignment. This modification enabled us to counter a basic strategy used in the first version of the casual Phylo game [], which consisted in pushing all sequences to the left (or right) to minimize the number of gaps. While solutions using this technique often improve the score of the initial casual puzzle within the game, they rarely improve complete MSAs using more realistic objective functions. This new feature also made the game more challenging and thus entertaining. […]

Pipeline specifications