Data Augmentation Algorithms for Detecting Conserved Domains in

Dec 15, 2007 - Protein Sequences: A Comparative Study. Chengpeng Bi. Bioinformatics and Intelligent Computing Lab, Children's Mercy Hospitals and Clin...
0 downloads 0 Views 3MB Size
Data Augmentation Algorithms for Detecting Conserved Domains in Protein Sequences: A Comparative Study Chengpeng Bi Bioinformatics and Intelligent Computing Lab, Children’s Mercy Hospitals and Clinics Schools of Medicine, Computing and Engineering University of Missouri, Kansas City, Missouri 64108 Received July 27, 2007

Protein conserved domains are distinct units of molecular structure, usually associated with particular aspects of molecular function such as catalysis or binding. These conserved subsequences are often unobserved and thus in need of detection. Motif discovery methods can be used to find these unobserved domains given a set of sequences. This paper presents the data augmentation (DA) framework that unifies a suite of motif-finding algorithms through maximizing the same likelihood function by imputing the unobserved data. The data augmentation refers to those methods that formulate iterative optimization by exploiting the unobserved data. Two categories of maximum likelihood based motif-finding algorithms are illustrated under the DA framework. The first is the deterministic algorithms that are to maximize the likelihood function by performing an iteratively optimal local search in the alignment space. The second is the stochastic algorithms that are to iteratively draw motif location samples via Monte Carlo simulation and simultaneously keep track of the superior solution with the best likelihood. As a result, four DA motif discovery algorithms are described, evaluated, and compared by aligning real and simulated protein sequences. Keywords: data augmentation • expectation maximization (EM) • Gibbs sampling • Markov chain Monte Carlo • motif discovery • multiple local alignment • protein sequence analysis

1. Introduction Conserved domain detection by sequence alignment methods often gives the first clues about the protein structure and function. For example, the recognition of weakly conserved domains or motifs among distantly related protein sequences can be of great importance because such regions usually correspond to structural or functional core residues.1,2 They are distinct units of molecular structure, usually associated with particular aspects of molecular function such as catalysis or binding. The exact sequence locations of these domains are commonly unrevealed or unobserved, but they represent discrete units of three-dimensional (3D) structure. Therefore, identification of these conserved domains (or motifs) in protein sequences can give insight into understanding molecular structure and function. It can also help classify protein superfamilies. As a result, motif discovery algorithms2 as well as multiple sequence alignment methods have been widely deployed in bioinformatics, from determination and classification of sequence families to identification of conserved domains for sustaining protein structural or functional prediction (see reviews in refs 3-5). In principle, given a set of related protein sequences, a local alignment program identifies and organizes regions that are over-represented in the input sequences and produces local alignments of these regions. In practice, alignment profiles of conserved domains have been useful for detecting very distant relationships and, when compiled into profile databases, can be useful for scanning new sequences for motifs. There are various sequence homologies among 192 Journal of Proteome Research 2008, 7, 192–201 Published on Web 12/15/2007

related proteins: some protein families share global sequence similarity, whereas others may have short homologous regions embedded in longer regions of no conservation.6 Pairwise and multiple sequence alignment (MSA) methods6 have been used to detect global similarity among closely related sequences. However, as protein sequences accumulate, more short and subtle domains need identification with more efficient alignment methods.2 Motif discovery methods based on multiple local alignment have been developed to meet this challenge.1,2 There are two classes of methods developed to efficiently tackle the conserved domain detection or motif discovery problem: (1) block-based models including expectation maximization (EM)7–9 and Gibbs sampling,10,1,11 and (2) gap-based methods in the form of hidden Markov models (HMM).12 In addition, word enumerative methods are also well developed to tackle the motif problem, but they are not strictly rooted on the multiple local alignment and thus are not discussed in this paper. A comprehensive overview of motif algorithms can be found in refs 13 and 14. A large conserved domain may be composed of several discontinuous subunits or motifs, each of which is presumably a contiguous subsequence without indel (i.e., insert/deletion), and thus HMM models are not covered here. An introduction to the MSA and HMM methods in sequence analysis is nicely treated in the book by Durbin et al.,12 and a recent survey of the MSA algorithms and relevant software packages is given in ref 15. It has been observed that block- and gap-based modeling techniques are correlated.16 Notice that in the block-based model10.1021/pr070475q CCC: $40.75

 2008 American Chemical Society

research articles

Algorithms to Detect Conserved Domains in Protein Sequences ing methods both EM and Gibbs methods are based on a common statistical foundation1 that was less explicitly investigated. This paper builds such a foundation on the data augmentation framework which unifies the existing motiffinding algorithms and facilitates new extensions as well. In the literature, the EM algorithms are often used together with maximum log-likelihood modeling,8,9 whereas Gibbs samplings are ordinarily integrated with the Bayesian computation.11 It is thus hard to compare these algorithms. This paper uses data augmentation17 as the framework and then develops motif discovery algorithms based on the common statistical foundation to maximize the log-likelihood function by imputing the unobserved data. It aims to evaluate and compare the performance of different optimization schemes or algorithms based on the common objective function. The term data augmentation (DA) refers to methods for constructing iterative optimization via the introduction of unobserved or missing data.17 For multiple sequence local alignment of conserved domains, the domain locations are such unobserved or missing data. There are two categories of optimization algorithms under the data augmentation framework. The first is the deterministic algorithms including EM and other greedy heuristic methods that optimize a specified objective function by iteratively performing a local greedy search in the alignment space. The second is the stochastic algorithms such as the Gibbs sampling1,10,11 and Metropolis methods18,19 that repeatedly draw motif location samples according to a conditional probability distribution via Monte Carlo simulation. The article is organized as follows. Section 2 briefly outlines the data augmentation algorithms. Section 3 introduces the basics of motif discovery and multiple sequence local alignment. Section 4 is devoted to developing four schemes for motif discovery algorithms under the data augmentation framework. Section 5 evaluates and compares these algorithms through experimental examples. Section 6 concludes the paper with discussion.

maximize the expected log-likelihood. In practice, one can improve the log-likelihood iteratively instead of maximizing it at each iteration (t). This partially implemented M step or generalized EM (GEM) tactic has proven convergence7 and possesses a great advantage over the standard EM; that is, the capability of addressing the difficulty when computing the M step is intractable. One such generalized EM algorithm is the expectation/conditional maximization (CM), or ECM for short,22 that takes advantage of the simplicity of complete data conditional maximum likelihood estimation by replacing a complicated M step of EM with several computationally simpler CM steps. On the other hand, partial implementation of the E step is also inevitable in many cases because high dimensional integration is often hard to directly determine. Monte Carlo simulation is regularly used to draw samples based on a conditional density distribution of an unobserved variable and then forms psuedo-complete data to compute the E step. The Monte Carlo versions of the above E step can spawn several other DA algorithms such as Gibbs sampling10 and Metropolis.18 They can be treated as variants of the more general Monte Carlo based EM strategies.23,24 This paper only focuses on Gibbs and Metropolis. One can refer to the book by Liu25 for further reading on the Monte Carlo techniques. Neal and Hinton26 showed that incremental updating of one unobserved variable at one time and immediate re-estimation of the parameters for the next unobserved variable will overcome the difficulty in high dimensional sampling and eventually speed up the convergence. The winner-take-all EM algorithm (WEM) is one of such incremental EM variants. The WEM algorithm can be seen as a deterministic version of the Gibbs sampler; that is, it iteratively draws a sample with the maximum conditional probability until the log-likelihood converges. Therefore, WEM should converge faster than EM and the Gibbs sampler. A perfect archetype of the WEM spirit is the well-known k-means clustering algorithm which can be viewed as an incremental WEM algorithm as applied to the Gaussian mixture models.27 The WEM strategy is often used to estimate hidden Markov models for speech recognition.26

2. Data Augmentation Algorithms

3. Motif Discovery Problem 17

According to van Dyk and Meng, the data augmentation (DA) algorithms cover those which establish iterative optimization algorithms by introducing unobserved data. The unobserved data are often alternatively referred to as missing data. They are interchangeably used throughout this paper. By this definition, the expectation maximization (EM) algorithm is a deterministic version of the DA algorithms and performs maximization of a likelihood function by iteratively updating the conditional density functions of the unobserved variables and then estimating the parameters of interest. There are various versions of stochastic DA algorithms such as Metropolis18 and Gibbs,10 but the stochastic method was popularized in the statistical literature of Tanner and Wong’s data augmentation algorithm20 for posterior sampling and in the physics literature of Swendsen and Wang’s algorithm21 for sampling from the Ising and Potts models and their generalizations (they used a different term for unobserved data, i.e., latent variables). Note that the term DA adopted by van Dyk and Meng is borrowed from Tanner and Wong,20 but it has been generalized and used to cover all algorithms (both deterministic and stochastic) applying the missing data strategy.17 The EM algorithm starts with an initial seed parameter (θf(0)) and iterates the expectation (E) and maximization (M) steps. The E step computes the expected log-likelihood given the conditional distribution of missing data. The M step updates the current parameter to

3.1. Defining a Motif. The motif discovery problem is simply defined as finding some recurrent short sequence patterns or motifs that are likely embedded in a given set of biologically related sequences (S). A motif (i.e., conserved pattern) is often represented as either a consensus sequence or a positionspecific weight matrix (PWM).28 The consensus sequence characterizes a motif by the most frequent residue at each position. The PWM characterizes a motif by a table or matrix, counting the frequencies of each residue type at each position (Figure 1). The frequencies in a PWM can be summarized in two ways: (1) computing their observed frequencies via dividing the number of residue types (k) at each position j (cjk) from an alignment matrix (Figure 1C) by the number of sites (N), i.e., θjk ) cjk/N; (2) taking the logarithm of the likelihood ratio between the observed frequency (θjk) and the background frequency (θ0k) to form an information weight matrix (I), i.e., Ijk ) log(θjk/θ0k). Here, the background frequency is the frequency of observing a residue in the given sequence data set or user-specified otherwise. Note that the background sequence is often assumed from the zero-order Markov chain; however, a higher order Markov chain can also be allowed. A psuedo-count (β) is often added to the observed frequencies to avoid illegal zero-logarithm due to a small sample. For a protein sequence, a good estimate1 is β ) √N ⁄ |K|, and thus the adjusted frequency is computed as θjk ) (cjk + β)/(N + β|K|), Journal of Proteome Research • Vol. 7, No. 01, 2008 193

research articles

Bi

Figure 1. Sequence alignment and motif representation. (A) A collection of conserved domains or motif sites aligned using 20 annotated helix-turn-helix sequences. Only part of the conserved domain is calculated and displayed in B-D. Notice that the leading digital numbers are the starting positions of the whole aligned subsequences. The highlighted motif starts from +11 and runs only 13 residue long. (B) A protein sequence logo visualizing the conserved domain plotted using WebLogo (http://weblogo.berkeley.edu). (C) The alignment matrix derived from (A) by counting each residue type at each column (i.e., a motif position). The cells represent the number ^ , of residue k at position j of of times a residue k is observed at position j in the alignment of sites. The frequencies or probabilities, θ jk ^ ) 0, the motif sites can be obtained by dividing the values in the cells of the alignment matrix by the total number of sites (20), e.g., θ 1A ^ ) 1⁄10,θ ^ ) 1⁄5, and so on. (D) A consensus sequence is derived from (C) by taking the highest frequent residue type as the θ 1I 1V consensus letter at each column in the motif table.

where K is the sequence alphabet, e.g., for the protein sequence K ) {A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y, V}, and 194

Journal of Proteome Research • Vol. 7, No. 01, 2008

|K| is the number of letters in an alphabet (e.g., |K| ) 20 for protein sequences). If the positions within a conserved domain

research articles

Algorithms to Detect Conserved Domains in Protein Sequences are assumed to be independent, the alignment matrix can be modeled as a product multinomial (PM); i.e., the observed residue counts (cjk) at each position j are assumed to follow an independent multinomial distribution M(θfj), where M(θfj) ) (θjA, θjR,...,θjH,...,θjV) is the probability that each of the 20 residue types occurs at the jth position of a conserved pattern or motif. Therefore, a w residue motif sequence (S) can be thought of as drawing a sample from a product of multinomial distributions (PM): (s1,· · ·, sw)∼PM(Θ), where Θ ) [θf1,...,θfj,...,θfw]. 3.2. De Novo Motif-Finding Using Multiple Local Alignment. Multiple local alignment is the most widely used method to locate over-represented sites in a given sequence set. The aligned over-represented sites are then used to build a frequency matrix that depicts a conserved domain or motif (Figure 1). Let S ) {S1,· · ·, Si,· · ·, SN}denote the sequence data set. Let Li be the length of the sequence i (Si) and Sij denote a residue symbol taking on a value in the alphabet K. If only one motif per sequence (i.e., oops model) is assumed, there are N motifs in total for N sequences. A zero or one motif per sequence (i.e., zoops) model is also frequently used. Nonetheless, both oops and zoops models assume that sequence data come from a two-component multinomial mixture model: background sequence model (θf0) and w-mer motif model (Θ ) [θf1,...,θfw]). To simplify notation, Θ is used to contain the background parameters (θf0) in the following derivation. Let Ai be an indicator variable drawn from the location space {0, 1}(Li-w+1) of sequence i, A ) [A1,· · ·, Ai,· · ·, AN]T be the set of indicator variables representing the motif start sites (i.e., a local alignment) in the sequences, and w be the motif width. Intuitively, the total number of local alignments (V) is LN (L is the average sequence length). Alternatively, an indicator variable ai ) l is used to represent the motif starting at position l on sequence i, which is equivalent to Ail ) 1. Note that ai ) 0 means that no motifs were found on sequence i. The alignment of motif sites is randomly initialized (A(0)), and then it is progressively refined until convergence. The alignment problem has proven NP-complete.30

4. Motif Discovery via DA Algorithms The goal is to perform local alignment using only sequence information and therefore to find the conserved patterns in the aligned sequences. Although many endeavors have been attempted and numerous algorithms have been developed in the past two decades,13,19 the motif discovery problem still remains challenging because motifs are typically short, degenerate, and obey few rules.31 The EM-based motif discovery algorithm was first developed using position weight matrixbased statistical modeling by Lawrence and Reilly8 that has spawned a number of other variants.9,19,24,32,33 This methodology has been generalized to one of the most popular motif discovery software packages called MEME.9 This section gives a concise description of a standard EM motif discovery algorithm and then extends its statistical formulation to other DA algorithms. 4.1. EM Motif Algorithm. Since the motif locations (A) are unobserved, the EM algorithm is used to estimate the model parameters given the observed sequences (S). The full data for the motif sequence model are (S, A) ) {(Si, Ai): i ) {1,· · ·, N}}. Given a motif location (ai), the conditional likelihood of sequence i is a product of two components: probability of background subsequences (i.e., nonsites) and probability of motif subsequences (motif site starting at ai). EM starts from

(0)

an initial alignment (A ) that is randomly generated. Note that an alignment (A(t)) can be uniquely mapped onto a model estimate (Θ(t)), but the reversal may be not true. Let Θ(t) be the parameter estimates after the t-th iteration. Then the EMbased motif-finding algorithms maximize the log-likelihood objective function (h) by iteratively performing the expectation (E) and maximization (M) steps.19,9,32 The E step calculates a site (ai) conditional probability19 as follows w

π(ai ) l|Si, Θ(t)) ∝

K

∏∏ j)1 k)1

( ) θ(t) jk

I(Si, l+j-1)k)

(1)

(t) θ0k

where I( · ) is the indicator function. The M step updates the parameter (Θ(t)) by averaging across all potential sites.19 The EM algorithm first randomly initializes an alignment (A(0)) and then merely iterates E and M steps a number of times until it converges. The objective function (h) maximizes the loglikelihood which is given by w

∑ ∑ θ^

h(A(t)) ) h(Θ(t)) ) |A|

(t) jk

^(t)) log(θ jk

(2)

j)0 k∈K

The above EM motif discovery algorithm has been implemented in C++, and it is termed as deterministic DA-EM (or DEM) which is comparable to other EM algorithms in performance.19 DA-EM lays a foundation for other DA algorithms. 4.2. WEM Motif Algorithm. The WEM algorithm initializes an alignment (A(0)) like EM and then iteratively performs socalled partial winning (W) and update (U) steps until the objective function (h) converges. In the W step, one can sequentially take a sequence and scan it by the target density function (π). A winner is the site with the largest probability. A partial W step is immediately followed by a partial U step. The partial U step plainly adjusts residues in the current PWM matrix via a self-tuning function (λ) that adjusts the residue type changes at the same motif location between two partial alignments of adjacent epochs (detailed in the Supporting Information). For example, in the previous alignment of the domain position j, sequence i contributes a residue type “R” (i.e., arginine), and supposing that in the current alignment it changes to a type “W” (i.e., tryptophan), the tuning function decreases the “R” count by 1 followed by an increase of “W” by 1 at the same location. Therefore, each partial update costs a linear time, O(L + w), that includes locating the winning position in O(L) time. Here W is the domain/motif width (it is often short), but a sequence can be very long (i.e., L is large). The WEM’s update is more efficient compared to the standard EM overall averaging update in O(wL) time.19 Note that the WEM motif algorithm was recently implemented, and here it is named DA-WEM. 4.3. Metropolis Motif Algorithm. It starts with any alignment A(0) by randomly initializing a seed, and then the Metropolis algorithm iterates the following two steps: First draw a new alignment (A′) by a series of independent random samplings according to eq 1; calculate the change ∆h ) h(A′) - h(A(t)); and compute the acceptance rate (RM) as RM(A′, A(t)) ) min{1, exp(∆h)}

(3)

Second, one decides by tossing a Bernoulli coin with a probability of RM coming up heads: if heads shows up, accept the new move A(t+1) ) A′; otherwise keep it as it was A(t+1) ) A(t). The above algorithm has been previously implemented in C++19 and is named here as GA-Metropolis. Journal of Proteome Research • Vol. 7, No. 01, 2008 195

research articles 4.4. Gibbs Motif Algorithm. The Gibbs sampler first initializes A(0) in the same way as Metropolis, then systematically chooses a sequence i from (1, ..., N), and then updates it with a new sample drawn from the site-conditional density distribution π( · ) that can be approximated by eq 1. A slight difference is that the PWM model Θ(t) is built from an alignment excluding sequence i. Consequently, a little extra work is put into removing the motif sites on sequence i from the current alignment A(t) first and then rebuilding a new PWM model. Equation 1 is used to scan the sequence i. The following two steps are repeatedly performed: (i) draw a new sample from π( · ), and (ii) update the PWM model (Θ(t)). As a special case of Metropolis, the Gibbs sampler indeed constructs a Markov chain whose equilibrium distribution is π. The motif optimization problem is solved throughout the Monte Carlo simulation inasmuch as all the best solutions are kept during the sampling process. Like other DA algorithms presented here, the Gibbs sampler outputs the top n motif solutions. The Gibbs motif sampler here is called DA-Gibbs. 4.5. Computational Complexity. Because four DA motiffinding algorithms discussed above are implemented in the same framework, they can be directly compared in running time. For one iteration, all DA algorithms share the same scanning time in O(LN), where L is the sequence length on average. However, they vary in the update step. DA-EM takes O(wLN) time to update a PWM table, whereas DA-WEM takes only O(wN) + O(LN). If efficiently implemented, both DAMetropolis and DA-Gibbs require O(LN) time to draw new samples in one iteration and O(|K|w) time to update a PWM table. Notice that DA-Gibbs needs a little more time to keep track of the old table. DA-EM and DA-WEM can automatically determine their iterations. However, a prerequisite for DAGibbs and DA-Metropolis algorithms is to specify how many iterations are needed in a run. More iterations are often prescribed for the stochastic DAs than their deterministic counterparts. In reality, DA-EM and DA-WEM converge much faster than the stochastic DAs.

5. Experimental Results The DA algorithms are first applied to a case study on HTH_ICLR domains using 92 sequences. Simulated data sets are generated to examine the relationship between the sequence length and algorithm accuracy. Finally, benchmark data sets are used to corroborate the DA’s capability to detect conserved domains and to compare with other gap-based MSA methods. 5.1. Case Study on HTH_ICLR Domains. The helix-turnhelix (HTH) is composed of two almost perpendicular R helices linked by a several-residue β turn. The HTH motif is a common recognition element used by transcription regulators in prokaryotes and eukaryotes. Many bacterial transcription regulators which bind DNA through a HTH motif can be classified into subfamilies on the basis of sequence similarities. IclR is one of the subfamilies, and it is the repressor of the acetate operon in bacteria.34 The NCBI conserved domain database (CDD) stores 94 IclR sequences with conserved HTH_ICLR domains (conserved domains ID: pfam09339.1).35 However, there are two protein sequences containing some unknown amino acids, and thus these sequences are removed from the data set. The length of the 92 available HTH_ICLR sequences ranges from 219 to 572 amino acids, and the total size is 25 107 residues. 5.1.1. Running Time Comparison. The protein data set used to test the running time is extracted from the NCBI conserved 196

Journal of Proteome Research • Vol. 7, No. 01, 2008

Bi 35

domain database (CDD) of IclR sequences. The core motif width is specified as 29 residues long in the run-time testing. DA motif-finding algorithms set their maximum iterations of 100 per run, and each program runs with 100 different restarts using the 92 sequences. The results show that DA-WEM is the fastest one with a running time at 0.08 s per run. DA-EM is the second, taking about 0.44 s per run. DA-Gibbs and DAMetropolis are slower than their deterministic counterparts as discussed: DA-Gibbs consumes 2.95 s per run, whereas Metropolis takes 2.57 s per run. Section 5.2 gives a more detailed comparison using simulated data. DA-EM and DA-WEM consume less than 15 iterations to converge, whereas their stochastic counterparts require more than 50 iterations to converge on average. Another drawback of the stochastic DA algorithms is that they often need a so-called burn-in time, and things are even worse because it is very hard to determine how long the burn-in time is. In practice, one has to specify the number of iterations as more than what is needed. The running time results are comparable to the analysis of the algorithm computational complex in Section 4.5. 5.1.2. Profiling the Log-Likelihoods. To challenge the DAalgorithms, the 39 most diverse protein sequences are used to test the distribution of the log-likelihoods in the DA motif algorithms by setting the motif length of 29 residues. Each algorithm is run 200 times, each of which initializes a different seed alignment, and then the top 100 solutions are taken to plot the profile of the log-likelihoods as shown in Figure 2A-D. Note that the log-likelihood is normalized by setting the minimum likelihood in the top 100 runs to zero in each case and dividing by the number of sequences used (i.e., 39). Therefore, there are 100 other runs with negative normalized likelihoods not shown in the figure. The highlighted zones in Figure 2A-D are the feasible solution space in each case according to the tested data set. They are only a small portion of all the simulated solutions. In other words, a large number of output motif models are not the solutions of specific interest, and solutions outside the feasible zone may represent noise signals or different kinds of motifs that are as of yet undefined. It is also interesting to see that DA-Gibbs and WEM are able to explore more feasible spaces (the wide zones in Figure 2B and D) than the other DA algorithms. The DA-Metropolis algorithm has the narrowest zone (Figure 2C) among all. Figure 2E demonstrates two similar alignments with two-residue phase shifts. The highlighted areas are the aligned domains with leading numbers that indicate the start positions of the domain. Two alignments are located in the feasible solution space; however, the left alignment is highlighted in Figure 2E with a slightly higher log-likelihood (17.9), and thus it exactly locates the annotated conserved domains in NCBI CDD.35 The second alignment spotlighted on the right of Figure 2E indeed detects the annotated domain with two-residue phase shifts, and its log-likelihood decreases a little bit (17.5) compared to the alignment spotlighted on the left. This discovery implies that the phase shifts problem previously mentioned1 may be easily solved by the likelihood profiling strategy (i.e., generating an ensemble of solutions via multiple runs and then picking the one of interest). 5.1.3. Multiple Motifs. All DA algorithms are able to detect the 29-residue long core motif in the 39 most diverse sequences. The core motif alignment is partially shown in the middle of Figure 3A. The core motif sequence logo is also displayed in the middle of Figure 3B. The same sequence set is fed to MEME9 and the Gibbs motif sampler,2 and both

Algorithms to Detect Conserved Domains in Protein Sequences

research articles

Figure 2. Profiles of log-likelihoods using four DA algorithms. The log-likelihoods calculated here are normalized by setting the minimum log-likelihood in the profile to zero. Notice that each likelihood data point (q) represents an interval of q ( 5.0. The highlighted area in each graph is the feasible solution space. (A) Log-likelihood profile with the DA-EM motif algorithm. (B) Log-likelihood profile with the DA-WEM motif algorithm. (C) Log-likelihood profile by the DA-Metropolis motif algorithm. (D) Log-likelihood profile by the DA-Gibbs motif algorithm. (E) Two similar alignments with two-residue phase shifts.

programs achieve the same accuracy as the DA algorithms. However, the Gibbs motif sampler2 automatically determines the core motif length as 41 residues long instead of 29. This actually covers both motif-1 (left alignment in Figure 3A) and the core-motif (mid-alignment in Figure 3A). Structure data show that three helixes are separated with a transition structure (e.g., loops or backbone) in between (see highlighted part in Figure 3D). This flexible alignment divides the whole conserved

domain into three submotifs, and thus it may maximize the total log-likelihood by allowing some spacers. Such refinement would presumably optimize the configuration energy while predicting the 3D structure of an unknown protein. To locate the three subunits in the HTH_ICLR conserved domain, one can use the common erasing technique1,2,9 to perform motif discovery recursively. The first or core motif sites are detected as discussed above, and then these core motif sites Journal of Proteome Research • Vol. 7, No. 01, 2008 197

research articles

Bi

Figure 3. Alignment of the HTH_ICLR domains using the 39 most diverse sequences from the NCBI CDD database.35 The 2G7U chain A structure data (mmdb: 38409, Zheng et al., unpublished) is used to plot the 3D structure using Cn3D software downloaded from NCBI.35 (A) Alignment of the HTH_ICLR domains. The core motif is aligned in the center, and the left subunit (motif-1) is aligned with three residues shifted to the left compared to the same alignment reported in the CDD.35 The right subunit (motif-2) is aligned with two residues shifted to the left. (B) Sequence logos of three motifs, motif-1, core motif, and motif-2, displayed from left to right. The sequence logos are plotted using WebLogo (http://weblogo.berkeley.edu). (C) The organization of a typical protein in the IclR family (Chain A) from NCBI CDD. (D) The 3D structure of the 2G7U chain A. The highlighted structure part is the HTH_ICLR domains.

are masked and perform another round of motif finding. The three subunits are found by the DA algorithms by deploying the motif-erasing technique. Their alignments are in part presented in Figure 3A, and corresponding sequence logos are illustrated in Figure 3B. Among the top likelihood solutions, the core motif is aligned in the center which is consistent with those annotated in CDD, whereas the left subunit (motif-1) is aligned with three residues left shifted compared to the same alignment reported in the CDD.35 The right subunit (motif-2) is aligned with two residues shifted to the left as well. Notice 198

Journal of Proteome Research • Vol. 7, No. 01, 2008

that the DA algorithms did find alignments among motif-1 and motif-2 solutions that are comparable to the CDD alignments, but they exhibit lower likelihoods in the profile (i.e., less than top 50 alignments in log-likelihood) than those shown in Figure 2A (i.e., the spotlighted regions). Notice that alignments from CDD come from the multiple sequence alignment methods.3,6 Figure 3D illustrates the 3D structure of the IclR protein 2G7U chain A. The highlighted structure part of 2G7U_A in Figure 3D is the reconstructed HTH_ICLR conserved domain where three helixes are clearly observed: motif-1 covers one helix

Algorithms to Detect Conserved Domains in Protein Sequences

research articles

Figure 4. Comparison of algorithm performance using simulated sequences. The relationship between performance and sequence length in (A) the small group with 25 HTH sequences and in (B) the large group with 35 HTH sequences. (C) Plot of running time vs sequence length. Each simulation is repeated five times, and each data point is their average. Note that all DA algorithms run 5000 times on each data set. The performance coefficient is computed according to ref 37. The simulation was performed using the Apple Xserve cluster.

object; the core motif involves one helix-turn-helix structure; and motif-2 indicates a conserved protein backbone. In addition, the spacer distributions can be deduced from the alignment of three subunits in the HTH_ICLR domain. 5.2. Motif Subtlety and Large-Scale Alignments. Subtle motif discovery is one of the toughest issues in bioinformatics. Despite the fact that many algorithmic efforts tackling this problem have been undertaken, none of them are very successful.31 The motif subtlety in unaligned sequences is dependent on two aspects: the degree of motif conservation and the amount of the unaligned sequences containing the motif sites. In the extreme case, if a motif is mimicking the background sequences, there seems to be no solution. Thus, a motif should be somewhat conserved. While dealing with a largescale motif alignment, two scenarios could arise: (1) the data set may contain a large number of sequences, but each sequence is relatively short; or (2) it may consist of a small number of unaligned sequences, but each sequence is very long. The first scenario is not challenging and often renders motif discovery more tractable. The second scenario is very

tough and can be described as alignment in the “twilight” zone. Simulation is deployed to inspect these scenarios. The simulated data are generated as follows: Given a number of known HTH sites, one can computationally insert each site into four sequences of different lengths ranging from 1000 to 4000 amino acids, and the HTH-embedded sequences with the same length form a data set. The background sequences are generated according to the protein sequences from which the HTH sites are extracted. The DA algorithms run 5000 times in all data sets. Two simulation groups are generated: in the small group, each data set consists of 25 HTH sequences, and in the large group, each set contains 35 sequences. Figure 4 shows the simulated results. In the small group, as shown in Figure 4A, WEM failed, whereas the other three DA algorithms performed about the same in the range of 1000–2000 amino acids. In the range of 3000–4000 amino acids, DEM was the best and Gibbs performed a little better than Metropolis in the 3000 sets. WEM exhibited a fluctuating performance curve (Figure 4A) that may indicate more cycles are needed to improve its performance. Journal of Proteome Research • Vol. 7, No. 01, 2008 199

research articles

Bi a

Table 1. Comparison of Core Block Alignments Using Benchmark Data REF

MUS

CLU

DEM

GIBBS

METRO

WEM

6 7

0.86 ( 0.08 0.75 ( 0.201

0.80 ( 0.11 0.71 ( 0.240

0.85 ( 0.19 0.89 ( 0.08

0.83 ( 0.18 0.89 ( 0.09

0.84 ( 0.19 0.89 ( 0.09

0.85 ( 0.19 0.90 ( 0.08

mean

0.805

0.755

0.870

0.860

0.865

0.875

a

Each entry is the precision of detected core blocks. The precision is the number of correct residues aligned in the block divided by the number of sequences aligned in the reference alignment. Note that: REF ) reference set, MUS ) MUSCLE, CLU ) ClustalLW.

In the large group (i.e., 35 sequences per data set), as shown in Figure 4B, the WEM’s performance was significantly improved, whereas the Metropolis’ performance did not change. In the range of 1000–3000 amino acid lengths, the DA-EM and Gibbs algorithms performed equally well. When the sequence length was extended to 5000 amino acids, all DA’s performances dropped dramatically. DA-EM is the overall best predictor. Gibbs is slightly better than its deterministic counterpart WEM at the price of computing time; for example, it spent 74-fold more time than WEM in the 5000 amino acid case (Figure 4C). Notice that the time consumed by two stochastic DAs increases sharply as the sequences become longer (Figure 4C). Contrarily, DA-WEM is strikingly timeefficient. Although DA-Metropolis may improve its accuracy through running a longer chain, this is prevented due to its lack of time efficiency. One would prefer to use the Gibbs sampler if time is of no concern. DA-EM would be of general preference if sufficient runs are provided. DA-WEM may be of practical use in large-scale alignment of highly conserved domains. 5.3. Benchmark Data Testing. References 6 (it contains 13 data sets) and 7 (eight data sets) are downloaded from the BAliBASE version 3.0.36 They are devoted to the particular problems posed by sequences with transmembrane regions, repeats, and inverted domains. These benchmark data are widely used and provide high-quality, manually refined reference alignments based on 3-D structural superpositions. These data sets contain 2555 protein sequences, and their core blocks are well annotated. As the benchmark data, they are fed into the DA algorithms as well as ClustalW 1.836 and MUSCLE 3.6.38 The DA algorithms run with 1000 restarts, and ClustalW and MUSCLE run as their defaults. The performance is evaluated by the average column-wise precision, that is, the number of correctly aligned residues in the core blocks divided by the number of sequences aligned in the reference alignment. The performance results are shown in Table 1. All DA motif algorithms were able to detect the core blocks in the benchmark data, and their average performance is comparable to those using the MSA methods (i.e., MUSCLE and ClustalW) in reference 6. Note that reference 6 contains small core blocks with a length of around 20 residues. In one case test of reference 6, the DA algorithms were predicted with a low precision (0.30) as indicated by a slightly large performance variance (0.19). Although more runs as well as different domain widths were adjusted, DA performances in this case were not improved. It is observed that the conservation of the annotated blocks in the case that DA failed is very low; therefore DA is easily trapped in a nonsite alignment that eventually has a much higher log-likelihood than that of the annotated ones. However, the MSA methods were able to detect the low conserved blocks with precision at 0.60. Reference 7 is comprised of large blocks (around 70 residues long). The DA algorithms outperform two MSA algorithms in reference 7. The MSA algorithms have larger variances in 200

Journal of Proteome Research • Vol. 7, No. 01, 2008

reference 7 because they were predicted with low accuracy (around 0.35) in two cases. While taking a close look at the aligned regions using the MSA algorithms, one could perceive that about half of the seqeunces with misaligned regions are due to various gap insertions, whereas the other half of the sequences were still correctly aligned. It is also shown that MUSCLE outperformed ClustalW on average (Table 1) which is consistent with a previous comparison using other benchmarks.38 All DA algorithms performed approximately equally.

6. Discussion This paper presents the data augmentation framework under which four DA motif-finding algorithms are described and evaluated using simulated and benchmark data sets. The deterministic DA algorithms are much faster than their stochastic DA counterparts. DA-WEM is very fast, and thus it is likely to be a good candidate for performing large-scale alignment of highly conserved domains. Notice that four DA algorithms (whether they are deterministic or stochastic) suffer from the local optimum. One may perform multiple runs to reduce the likelihood of getting trapped in the local optimum, but it may not work in some circumstances. Although DA-EM outperformed others on average, Gibbs may work better in subtle motif finding. Various extensions to the DA basic versions have already been studied. Besides the basic DA-EM algorithm, there are quite a few other variants, for example, the extension to dealing with scenarios such as zoops or multiple motifs that has been implemented in MEME.9 Quite recently, supervised EM algorithms are proposed by specifying a set of constraints that the PWM of the unknown motif must satisfy.33 Such constraints are often based on prior knowledge about the structure of the protein binding sites. Another interesting EM implementation is using the genetic algorithm to evolve a population of solutions and thus efficiently reduce the possibility of getting stuck in a local optimum.32 The first version of the Gibbs sampler1 assumed the oops model (i.e., so-called site sampler) and resolved the phase shift problem through an additional randomized step. However, as indicated in this paper, one additional run is often not enough to resolve the issue because there are likely to be a substantially large number of feasible solutions depending on the sequence data size. In practice, one needs a profile of a number of potential solutions such as the one presented in the paper. Later, the Gibbs motif sampler2 improved the site sampler to handle zoops and multiplemotif cases. It used a motif erasing technique to handle the multiple motifs that was first reported in MEME.9 One can find one motif in a time and then erase these motif sites from the sequence set and start over in the next round to find another type of motifs. This procedure can be easily applied recursively as used in the paper. The Gibbs motif sampler also implemented a column sampling to optimize the motif lengths.2 Most recently, the new Gibbs Centroid Sampler39 employed some innovative techniques to overcome the local optimum

research articles

Algorithms to Detect Conserved Domains in Protein Sequences difficulty in Gibbs sampling. It is developed to report a centroid alignment among multiple runs of the Gibbs motif sampler.2 The centroid alignment is defined as the one that has the minimum total distance to the set of samples chosen from the a posteriori probability distribution of transcription factor binding site alignments in an ensemble of solutions.39 The centroid sampler has been used in DNA sequence alignment and RNA secondary structure prediction; however, it has not yet been applied in conserved domain detection in protein sequences. In some cases, it may be better to enumerate the top n solutions by simply profiling the log-likelihood because a large set of biopolymer sequences are often embedded with a multitude of suboptimal signals and the correct solution may include a small subset of those signals or sites. The likelihood profiling can easily solve the phase shift issue in protein sequence motif discovery because a collection of phase shift alignments could form such a small set of similar likelihoods. However, similar likelihood alignments do not necessarily imply that they are precipitated by phase shifts. It is hard to simulate a real distribution of those solutions since the search space is so huge and limited amounts of sampling may miss a lot of information. It is imperative that further investigation be performed into the large-scale simulation of the log-likelihood function as well as theoretic studies on sampling from an ensemble of solutions. This may stimulate a new research thrust in bioinformatics and thus probably give new insights into the challenging motif discovery problem.

Acknowledgment. The author thanks the reviewers for their helpful comments for improving the paper. This research is supported in part by the Katharine B. Richardson Foundation. Supporting Information Available: Technical details are given. This material is available free of charge via the Internet at http://pubs.acs.org. References (1) Lawrence, C. E.; Altschul, S. F.; Boguski, M. S.; Liu, J. S.; Neuwald, A. F.; Wootton, J. C. Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment. Science 1993, 262, 208– 214. (2) Neuwald, A. F.; Liu, J. S.; Lawrence, C. E. Gibbs motif sampling: Detection of bacterial outer membrane protein repeats. Protein Sci. 1995, 4, 1618–1632. (3) Livingstone, C. D.; Barton, G. J. Protein sequence alignments: a strategy for the hierarchical analysis of residue conservation. Comput. Appl. Biosci. 1993, 9, 745–756. (4) Rost, B.; Sander, C. Combining evolutionary information and neural networks to predict protein secondary structure. Proteins 1994, 19, 55–72. (5) Baker, D.; Sali, A. Protein structure prediction and structural genomics. Science 2001, 294, 93–96. (6) Thompson, J. D.; Plewniak, F.; Poch, O. A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res. 1999, 27, 2682–2690. (7) Dempster, A. P.; Laird, N. M.; Rubin, D. B. Maximum Likelihood from Incomplete Data via the EM Algorithm (with Discussion). J. Royal Stat. Soc. B 1977, 39, 1–38. (8) Lawrence, C. E.; Reilly, A. A. An expectation maximization algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins: Struct., Funct., Genet. 1990, 7, 41–51. (9) Bailey, T. L.; Elkan, C. Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Machine Learning 1995, 21, 51–80. (10) Geman, S.; Geman, D. Stochastic relaxation, Gibbs distribution and Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 1984, 6, 721–741. (11) Liu, J. S.; Neuwald, A. F.; Lawrence, C. E. Bayesian models for multiple local sequence alignment and Gibbs sampling strategies. J. Am. Stat. Assoc. 1995, 90, 1156–1170.

(12) Durbin, R.; Eddy, S.; Krogh, A.; Mitchison, G. Biological sequence analysis: probabilistic models of proteins and nucleic acids; Cambridge University Press UK: London, 1998. (13) MacIsaac, K. D.; Fraenkel, E. Practical strategies for discovering regulatory DNA sequence motifs. PLoS Comput. Biol. 2006, 2, e26. (14) Ji, H.; Wong, W. W. Computational biology: Towards deciphering gene regulatory information in mammalian genomes. Biometrics 2006, 62, 645–663. (15) Edgar, R. C.; Batzoglou, S. Multiple sequence alignment. Curr. Opin. Struct. Biol. 2006, 16, 368-373. (16) Liu, J. S. Bayesian modeling and computation in bioinformatics research. In Current Topics in Computational Biology, Jiang, T., Xu, Y., Zhang, M., Ed.; MIT Press.: Cambridge, MA, 2002; pp 11-44. (17) van Dyk, D. A.; Meng, X.-L. The art of data augmentation. J. Comput. Graphical Stat. 2001, 10, 1–50. (18) Metropolis, N.; Rosenbluth, A. W.; Rosenbluth, M. N.; Teller, A.; Teller, H. Equations of statecalculations by fast computing machines. J. Chem. Phys. 1953, 12, 1087–1091. (19) Bi, C.-P. SEAM: A stochastic EM-type algorithm for motif-finding in biopolymer sequences. J. Bioinf. Comput. Biol. 2007, 5, 47–77. (20) Tanner, M. A.; Wong, W. H. The calculation of posterior distributions by data augmentation. J. Am. Stat. Assoc. 1987, 82, 528–550. (21) Swendsen, R. H.; Wang, J. S. Nonuniversal critical dynamics in Monte Carlo simulations. Phys. Rev. Lett. 1987, 58, 86–88. (22) Meng, X. L.; Rubin, D. Maximum likelihood estimation via the ECM algorithm: A general framework. Biometrika 1993, 80, 267–278. (23) Wei, G. C. G.; Tanner, M. A. A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. J. Am. Stat. Assoc. 1990, 85, 699–704. (24) Bi, C.-P. Multiple sequence local alignment using Monte Carlo EM algorithm. In ISBRA 2007 Lecture Notes in Bioinformatics; Mandoiu, I., Zelikovsky, A., Eds.; Springer-Verlag: Berlin Heidelberg, 2007; Vol. 4463, 465–476. (25) Liu, J. S. Monte Carlo Strategies in Scientific Computing; Springer: New York, 2001. (26) Neal, R. M.; Hinton, G. E. A view of the EM algorithm that justifies incremental, sparse, and other variants. In Learning in Graphical Models; Jordan, M., Ed.; NATO Science Series Kluwer: Norwell, MA, 1998; pp 355–368. (27) Fraley, C.; Raftery, A. Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc. 2002, 97, 611–631. (28) Stormo, G. D. DNA binding sites: representation and discovery. Bioinformatics 2000, 16, 16–23. (29) Berg, O. G.; von Hippel, P. H. Selection of DNA binding sites by regulatory proteins: Statistical-mechanical theory and application to operators and promoters. J. Mol. Biol. 1987, 193, 723–750. (30) Wang, L.; Jiang, T. On the complexity of multiple sequence alignment. J. Comput. Biol. 1994, 1, 337–348. (31) Tompa, M.; Li, N.; Bailey, T. L.; Church, G. M.; De Moor, B.; Eskin, E.; Favorov, A. V.; Frith, M. C.; Fu, Y.; Kent, W. J.; Makeev, V. J.; Mironor, A. A.; Noble, W. S.; Pavesi, G.; Pesole, G.; Regnier, M.; Simonis, N.; Sinha, S.; Thijs, G.; vanHelden, J.; Vandenbogaert, M.; Weng, Z.; Workman, C.; Ye, C.; Zhu, Z. Assessing computational tools for the discovery of transcription factor binding sites. Nat. Biotechnol. 2005, 23, 137–144. (32) Bi, C.-P. A genetic-based EM motif-finding algorithm for biological sequence analysis. Proc. of IEEE Symp. on Computat. Intell. Bioinf. Computat. Biol. 2007, 07, 275–282. (33) Bembom, O.; Keles, S.; van der Laan, M. J. Supervised detection of conserved motifs in DNA sequences with cosmo. Stat. Appl. Genet. Mol. Biol. 2006, 6,Article 8. (34) Krell, T.; Molina-Henares, A. J.; Ramos, J. L. The IclR family of transcriptional activators and repressors can be defined by a single profile. Protein Sci. 2006, 15, 1207–1213. (35) Marchler-Bauer, A.; Anderson, J. B.; Derbyshire, M. K.; DeWeeseScott, C.; Gonzales, N. R.; Gwadz, M.; Hao, L.; He, S.; Hurwitz, D. I.; Jackson, J. D.; Ke, Z.; Krylov, D. et al. CDD: a conserved domain database for interactive domain family analysis. Nucleic Acids Res. 2007, 35, D237–40. (36) Thompson, J. D.; Koehl, P.; Ripp, R.; Poch, O. BAliBASE 3.0: Latest developments of the multiple sequence alignment benchmark. Proteins 2005, 61, 127–36. (37) Pevzner, P. A.; Sze, S.-H. Combinatorial approaches to finding subtle signals in DNA sequences. Proc. Int. Conf. Intell. Syst. Mol. Biol. 2000, 8, 269–278. (38) Edgar, R. C. MUSCLE: multiple sequence alignment with high accuracy and high thoughput. Nucleic Acids Res. 2004, 32, 1792–1797. (39) Thompson, W. A.; Newberg, L. A.; Conlan, S.; McCue, L. A.; Lawrence, C. E. The Gibbs centroid sampler. Nucleic Acids Res. 2007, 35, W232–W237.

PR070475Q Journal of Proteome Research • Vol. 7, No. 01, 2008 201