Discovery of patterns (motifs) in biological sequences

Introduction


The Motif Discovery Problem (MDP) is an important biological problem whose main objective is to discover over-represented short patterns, named motifs, in a set of sequences that may have some biological significance. More concretely, a motif is defined as a sequence pattern that has some biological relevance, such as being DNA binding sites for a protein. Motifs are mixed with a large amount of biological information in sequences; therefore, discovering them is not a trivial task. MDP tries to solve optimally the problem of finding motifs, applied to the specific task of discovering novel Transcription Factor Binding Sites (TFBS) in sequences. Tools for the assessment of similarity and the identification of related sequences are among the most important tools in bioinformatics as they may serve for the functional categorization of novel genes or proteins of unknown function. To discover motifs with biological significance we must fulfil specific objectives while satisfying a variety of constraints. Thus, we proceed proposing a multiobjective formulation adequately adapted according to the real-world biological requirements, defining three conflicting objectives to be maximized: motif length, support, and similarity.

 

Motif Discovery Problem


Given a set of sequences S = { S_i | i = 1,2,...,D } defined on the alphabet B. S_i = { S_i^j | j = 1,2,...,w_i } is a sequence, where w_i is the sequence width. The set of all the subsequences contained in S is {s_i^{j_i} | i = 1,2,...,D, j_i = 1,2,...,w_i - l + 1 }, where j_i is the binding site of a possible motif instance s_i^j on sequence S_i, and l is the motif length, the first objective to be maximized. To obtain the values of the other two objectives we have to build the Position Indicator Matrix (PIM) A = { A_i | i = 1,2,...,D } of the motif, where A_i = { A_i^j | j = 1,2,...,w_i } is the indicator row vector with respect to a sequence S_i. A_i^j is 1 if the position j in S_i is a binding site, and 0 otherwise. We refer to the number of motif instances as |A| = \sum_{i=1}^D{\sum_{j=1}^{w_i}{A_i^j}}. We also require to find the consensus motif, which is a string abstraction of the motif instances. In this work we consider a single motif instance per sequence. Only those sequences that achieve a motif instance of certain quality with respect to the consensus motif were taken into account when we perform the final motif. This is indicated by the second objective, the support.

Furthermore, S(A) = {S(A)_1,S(A)_2,...,S(A)_{|A|}} is a set of |A| motif instances, where S(A)_i = S(A)_i^1S(A)_i^2...S(A)_i^l is the ith motif instance in |A|. S(A) can also be expanded as (S(A)^1,S(A)^2,...,S(A)^l), where S(A)^j = S(A)_1^jS(A)_2^j...S(A)_{|A|}^j is the list of characters on the jth position in the motif instances.

Then, we build the Position Count Matrix (PCM) N(A) with the numbers of different characters on each position of the candidate motifs (A) which have passed the threshold marked by the support. N(A) = { N(A)^1,N(A)^2,...,N(A)^l }, and N(A)^j = { N(A)_b^j | b \in{B} }, where N(A)_b^j = |{ S(A)_i^j | S(A)_i^j = b}|. The dominant characters of each position are normalized in the Position Frequency Matrix (PFM) N = \frac{N(A)}{|A|}. Finally, we calculate the third objective, the similarity, averaging all the dominance values of each PFM column, as is indicated in the following expression:

where f(b,i) is the score of character b in column i in the PFM and max_b{f(b,i)} is the dominance value of the dominant character in column i.

To guide the pattern search to solutions that have biological relevance, we have incorporated several constraints that should be satisfied by each solution. In motif discovery, motifs are usually very short, so that, finding long motifs means to lose a great computational time. To avoid this, we have restricted the motif length to the range [6,64], where the minimum is 6 and the maximum is 64. In the second objective we have also set a minimum support value of 2 for the motifs of the data sets composed by 4 or less sequences, and of 3 for the other ones (more than 4 sequences). Normally the binding sites are composed by motifs of all or nearly all sequences, and without this constraint is very easy to predict motifs with a high similarity (even 100%) formed, for example, by candidates of only one sequence. Finally, we have applied the complexity concept. The complexity of the candidate motifs should be considered in order to avoid low complexity solutions, for example, the two candidate motifs 'AAAA' and 'AAAA' are very similar, in fact they are identical, but it is not a meaningful motif. The average complexity for a final motif represents the total complexity score for each candidate motif. We calculate the complexity of a candidate motif by using the following expression:

where N = 20 for protein sequences and N = 4 for DNA sequences, l is the motif length, and n_i is the number of characters of type i in B. For example, if we consider the motif 'AAAA' we will obtain a minimum complexity. Otherwise, if we have, for example, the 'AQEW' motif we will obtain a higher complexity. As we can see in the equation, if we do not normalize the complexities when we compare motifs, the maximum complexity was highly dependent of the motif length. The compositional complexity calculation was revised such that the maximum possible complexity score is calculated for each possible motif length prior to the evolutionary computation. During evolution, each complexity score was rescaled between [0,1] where the maximum possible complexity score was 1. This removed any potential bias in complexity relative to the motif length.

 

Example


For simplicity, the following example is based on DNA sequences. Table 1 illustrates an artificial MDP with motif length = 7. By using the motif instances shown in Table 1a and 1c, we obtain the consensus motif A[GT]TTGAA. Due to there is a tie in the second position of the motif, for this position, we select one of the winner nucleotides randomly, in this case we have chosen the nucleotide T. With this consensus motif, in Table 1d we can calculate the value of the second objective. Those sequences whose motif instances exceed a threshold value of concordance of 50% will be taken into account in the support, in this example we have support = 7. The last step is to build the PCM and the PFM by using the nucleotides of the motif instances that have passed the concordance threshold. Having done that, we can obtain the value of the similarity, applying the Similarity equation. In this example we obtain similarity = 0.65.

Table 1.- An artificial motif discovery problem

 

References


  1. "Searching for Common Patterns on Protein Sequences by means of a Parallel Hybrid Honey-Bee Mating Optimization Algorithm". David L. González-Álvarez, Miguel A. Vega-Rodríguez, Álvaro Rubio-Largo. Parallel Computing, Volume 76, Elsevier Science, Amsterdam, The Netherlands, 2018, pp. 1-17, ISSN: 0167-8191. DOI: 10.1016/j.parco.2018.04.001. (Impact factor = 1.281, Quartile Q3)
  2. "A Hybrid MPI/OpenMP Parallel Implementation of NSGA-II for Finding Patterns in Protein Sequences". David L. González-Álvarez, Miguel A. Vega-Rodríguez, Álvaro Rubio-Largo. Journal of Supercomputing, Volume 73, Issue 6, Springer, Dordrecht, The Netherlands, 2017, pp. 2285-2312, ISSN: 0920-8542. DOI: 10.1007/s11227-016-1916-3. (Impact factor = 1.532, Quartile Q2)