Predicting Nuclear Localization - American Chemical Society

nonredundant data set extracted from Swiss-Prot R50.0. We demonstrate that accuracy on truly novel proteins is lower than that of previous estimations...
0 downloads 0 Views 110KB Size
Predicting Nuclear Localization John Hawkins,* Lynne Davis, and Mikael Bode´ n ARC Centre for Complex Systems, School of Information Technology and Electrical Engineering, University of Queensland, QLD 4072, Australia Received October 25, 2006

Nuclear localization of proteins is a crucial element in the dynamic life of the cell. It is complicated by the massive diversity of targeting signals and the existence of proteins that shuttle between the nucleus and cytoplasm. Nevertheless, a majority of subcellular localization tools that predict nuclear proteins have been developed without involving dual localized proteins in the data sets. Hence, in general, the existing models are focused on predicting statically nuclear proteins, rather than nuclear localization itself. We present an independent analysis of existing nuclear localization predictors, using a nonredundant data set extracted from Swiss-Prot R50.0. We demonstrate that accuracy on truly novel proteins is lower than that of previous estimations, and that existing models generalize poorly to dual localized proteins. We have developed a model trained to identify nuclear proteins including dual localized proteins. The results suggest that using more recent data and including dual localized proteins improves the overall prediction. The final predictor NUCLEO operates with a realistic success rate of 0.70 and a correlation coefficient of 0.38, as established on the independent test set. (NUCLEO is available at: http://pprowler.itee.uq.edu.au.) Keywords:

bioinformatics • protein localization prediction • nuclear import • nuclear localization signals

1. Introduction An essential feature of eukaryotic cells is the segregation of the genetic material from the rest of the cell via a nuclear membrane. The separation allows the cell to regulate those molecules that can interact with the genome through the process of membrane transport. However, because of the role the nucleus plays in information processing, the transport mechanism itself must accommodate the import of housekeeping proteins, a process for exporting RNA and a process by which information about the changing environment of the cell can be imported to affect the transcription of the genome.1 Thus, nuclear localization is much more than the mere functional compartmentalization that is observed in most subcellular localization processes. Nuclear localization is a complicated set of processes that play a crucial role in the dynamical self-regulation of the cell.2 At present, there are only two prediction services designed to specifically identify proteins imported into the nucleus: PredictNLS3 and NucPred.4 In addition to these specialized models, there are a number of general purpose subcellular localization predictors that include the nucleus in their list of targets. Hence, to investigate our current capacity to identify nuclear proteins and extensively benchmark our own predictor, we include these general purpose predictors. In this study, we have developed a model based on a Support Vector Machine (SVM) with a custom kernel to identify proteins that are localized to the nucleus, either temporarily or perma* To whom correspondence [email protected].

1402

should

be

addressed.

Journal of Proteome Research 2007, 6, 1402-1409

Published on Web 02/24/2007

E-mail:

nently. The kernel employs a composite spectrum (or multiple k-mer) encoding conjoined with a bit vector indicating the presence or absence of a range of sequence motifs known to be important for nuclear proteins. The model is evaluated and compared against the existing suite of nuclear localization predictors, using an independent nonredundant data set.

2. Background The transport of molecules between the cytoplasm and nucleus is mediated by a large macromolecular machine called the Nuclear Pore Complex (NPC).5 The NPC allows passive diffusion of small molecules between the nucleus and cytoplasm, including proteins up to a molecular weight of approximately 50 kDa.6 The NPC also allows integral membrane proteins to diffuse through its lateral channels.7 Although passive diffusion is possible, most molecules that function within the nuclear matrix require the assistance of importins or exportins to be transported into or out of the nucleus. For the process of importing molecules, the importins bind to the cargo molecule, then the cytoplasmic fibrils bind to the cargoimportin complex and undergo conformational changes that effect the transport of the complex through the center of the NPC. The importins must recognize certain features, Nuclear Localization Signals (NLSs), within the cargo molecule. Hence, understanding the processes of nuclear localization revolves around understanding NLSs.8 The localization signals are extremely varied. The first NLSs to be discovered, the so-called classical NLSs, are short patterns which may appear anywhere in the sequence, but are exposed on the surface of the folded proteins. The proteins are 10.1021/pr060564n CCC: $37.00

 2007 American Chemical Society

research articles

Predicting Nuclear Localization Table 1. Comparison of the Current Performance Results Reported for Nuclear Localization Predictiona model

year

tp

tn

fp

fn

Data set

SucR

Sens

Spec

MCC

pSLIP HSLPred LOCtree (Animal) LOCtree (Plant) P2SL NucPred GO-FunD-PseAA LOCnet predictNLS PLOC SubLoc PSORT II

2005 2005 2005 2005 2005 2004 2004 2003 2003 2003 2001 1999

1799 438 26 671 1841 130 1354 1739 -

5041 2852 948 2351 308 -

133 124 11 161 53 -

133 124 6 219 91 48 1788 193 -

7106 3532 3538 991 3402 471 7579 539

0.963 0.930 0.983 0.888 0.813 0.780 0.794 -

0.931 0.824 0.780 0.813 0.753 0.953 0.730 0.430 0.900 0.874 0.717

0.931 0.958 0.989 0.806 0.710 1.000 -

0.91 0.79 0.74 0.75 0.71 0.47 0.58 0.75 -

7589 2427

a Middle section shows numbers of true positives, false positives, true negatives, and false negatives, and the total number of sequences in the training data set. In most cases, the full set of statistics reported here were not published and have been calculated on the basis of what was published. Where this was not possible, the entry has been left blank. In particular, the actual numbers of sequences correctly classified have been reverse-engineered based on the sizes of the data sets in each study and the particular performance statistics given in each instance.

characteristically not processed after localization; therefore, the retention of the signal allows the protein to move in and out of the nucleus. The classical NLSs have been generalized to monopartite and bipartite motifs. The monopartite NLS is described by the pattern (K/R)4-6 and the bipartite by the pattern (K/R)2×10-12(K/R)2.9 After the identification of the classical NLS pattern, numerous nuclear proteins lacking a classical NLS were discovered. Christophe et al. give a listing of nonclassical sequences known to be involved in nuclear localization. They represent a diverse set of sequences with varying lengths up to 38 residues, none of which is particularly high in basic residues; one example they cite is even high in acidic residues.9 NLSdb is a recent database of nuclear localization signals. It contains 114 motifs taken from experimentally confirmed NLSs in the literature. It also contains a further 194 signals generated by performing in silico mutagenesis on the motifs and selecting those that match only known nuclear proteins.3,10 Aside from the large range of known, and potential, localization signals, there is also considerable evidence of dual localization of nuclear proteins in other compartments, notably, the sharing of proteins between cytoplasm and nucleus in a continual shuttling process, but also the existence of DNA/RNA metabolism in the nucleus, mitochondria, and chloroplasts.11 To further complicate matters, among the proteins that qualify for passive diffusion, there are many that are not annotated as dually localized. In Swiss-Prot R50.0, there are 2899 nondually localized nuclear proteins with a molecular weight less than 50 kDa. This suggests that these proteins also contain NLSs that act against the equalizing forces of passive diffusion. It is apparent that one must overcome numerous obstacles to produce an accurate predictor of nuclear localization based on the presence of nuclear localization signals. The matching rules for classical NLSs are so loose that they are a very poor predictor of nuclear localization, simply because there are many known non-nuclear proteins that contain classical NLSs.12 Second, there are known nuclear proteins that do not contain an NLS, and some of the nuclear proteins that contain classical NLSs have been demonstrated to use some other process for nuclear localization, the range of which seems to be constantly increasing.9 There are only two specialized predictors available for identifying nuclear proteins, predictNLS and NucPred. PredictNLS bases its decision on the presence of a known or putative NLS, derived from the contents of NLSdb.3 NucPred

uses an ensemble of regular expression-based models that were evolved to identify proteins imported into the nucleus.4 However, a number of general subcellular localization predictors include the nucleus among the list of targets, dating back to PSORT, which relies on the existence of a small set of known NLSs and amino acid composition.13-16 In the last 3 years a number of models have appeared using sophisticated machine learning techniques for predicting a range of localizations, including the nucleus. The published performance of the available predictors is presented in Table 1. Although the performance results are based on different data sets and, hence, are not a valid comparison of the underlying techniques, they are nevertheless estimates of the performance of the trained version of the model that the authors have made available. Consequently, they are the primary source of information that biologists use when deciding which predictor to use. By tabulating these published results, we can summarize the overall impression one would have gained in examining the literature to identify the optimal nuclear localization predictor. In some instances, we have had to calculate the performance metrics from the published data to allow comparison. In a few cases, this was not possible. We display the raw performance details in terms of the numbers of true positives (tp), true negatives (tn), false positives (fp), and false negatives (fn) where these results are available. We then report a number of performance summary statistics. We use the following measures: The sensitivity (Sens) ) tp/(tp + fn), and the specificity (Spec) ) tn/(tn + fp), which demonstrate the tendency to identify positive and negative samples, respectively. We report the success rate (SucR) ) (tp + tn)/N, as a measure of overall performance showing the proportion of the N proteins correctly classified. Finally, we report the Matthews correlation coefficient (MCC) (eq 1)17 as a measure of overall accuracy. tp‚tn - fp‚fn

x(tp + fn)(tp + fp)(tn + fp)(tn + fn)

(1)

Looking at the results in Table 1, one would be forgiven for assuming that the problem is all but solved, the latest models report accuracies that leave little room for improvement. However, closer examination of these studies reveals that many of them (pSLIP, GO-FunD-PseAA, HSLPred, and SubLoc) use homologous data for training and testing. Some of them also rely on homology searches in their prediction algorithm (P2SL, HSLPred). Genuine generalization by a model requires that it Journal of Proteome Research • Vol. 6, No. 4, 2007 1403

research articles has extracted the essence of the real-world process that is generating the distinctions it wishes to predict. In the case of nuclear localization, this involves identifying the sequence motifs that interact with the importins and effect transport via the NPC. This remains true regardless of whether the subsequences are identified explicitly for the user, or remain hidden inside the ‘black-box’. A model that relies on sequence homology utilizes a different strategy by exploiting the rich evolutionary history of sequences and gambling that no novel protein sequence is far from known ones. When a model is trained and tested on redundant data, “the apparent predictive performance may be overestimated, reflecting the method’s ability to reproduce its own particular input rather than its generalization power.”18. Sequence alignments become a poor indicator of homology at 25-40% sequence identity,19 and this threshold has become a de facto standard for developing and evaluating machine learning applications. Higher thresholds are a compromise taken in the face of a paucity of data. The pSLIP model is trained on the same data set as PLOC.20 This data set derives from Swiss-Prot R39 and was reduced to remove sequences with over 80% similarity.21 The same data set is used in the development of GO-FunD-PseAA.22 The data used in the development of HSLPred was curated such that 90% similarity is accepted,23 as is the data set used in the development of SubLoc.24 The P2SL model utilizes the data set obtained from Lu et al.,25 which they extend with hand-picked sequences from Swiss-Prot.26 Neither Lu et al. nor Atalay et al. mention whether redundancy reduction is performed. Both models explicitly rely on information from BLAST queries in such a way that the classifications are based on sequence homology to known proteins.26 The selection of models that are part of the HSLPred system also include a homology-based model. Although models produced in this fashion will perform well on proteins within the same family as known proteins, they may fail on novel ones. The performance statistics produced by these machines are not a reliable guide to their ability to generalize, and cannot fairly be compared with models tested on redundancy-reduced data sets. The LOCtree model by Nair and Rost was trained and tested on data that was reduced so that no sequences had a HVAL1 greater than 5.27 (HVAL is an alternative measure of similarity that takes the percentage of pairwise identical residues with a factor correcting for the actual number of aligned residues.) The LOCnet model was trained on data that was reduced so that no sequences had greater than 40% similarity. Furthermore, their performance results are based on a separate test data set consisting of sequences added between R40 and R41 of Swiss-Prot, themselves reduced so that no test sequence had greater than 40% similarity to any of the training sequences. LOCtree and LOCnet, like the other predictors, were trained only on sequences with a single experimentally confirmed localization. The focus on exclusively localized proteins can be justified by the general dominance of singularly localized proteins in other organelles. However, the nucleus is notable for the presence of numerous nuclear-cytoplasmic shuttling proteins. Hence, the exclusion of dual or multicompartmentalized proteins conflicts with the dynamic nature of nuclear protein trafficking. In Swiss-Prot R50, there are 334 plant and 5055 non-plant soluble proteins that are experimentally determined to be localized only to the nucleus. With multiple localizations, these 1404

Journal of Proteome Research • Vol. 6, No. 4, 2007

Hawkins et al.

numbers increase to 376 plant and 6276 non-plant proteins. In other words, almost 20% of the available nuclear localization data comes from dual or multilocalized proteins. By contrast, consider that, in the same Swiss-Prot release, there are only 176 other soluble proteins that are dually localized between some subset of all the other compartments of the eukaryotic cell. In other words, 88% of the dually localized proteins include the nucleus among their list of locations. The only existing prediction services that have included dually localized proteins in their development are PredictNLS and NucPred. PredictNLS predicts nuclear localization via the presence of an NLS contained in NLSdb. During the development of NLSdb, the 194 NLSs obtained by performing in silico mutagenesis were allowed to match dually localized nuclear proteins. The authors report that the service identifies 43% of known nuclear proteins.3,10 However, as these figures are generated on the same data that the service was created from, they do not represent a genuine guide to generalization ability. NucPred uses an ensemble of 100 regular expressions that were evolved to identify proteins imported into the nucleus. The training set consisted of 760 nuclear proteins and 1147 non-nuclear proteins, redundancy-reduced so that no sequences with greater than 50% sequence identity over an alignment of 20% or more of the sequences are kept. Their validation set consisted of 178 nuclear and 293 non-nuclear proteins.4 The authors report that their model performed with an MCC of 0.47 on their validation set when using a threshold of 0.5, meaning that 50% of the regular expressions predict the protein to be nuclear. Given that the authors took a stringent data curation approach, this can be taken as a good estimate of the current generalization performance on predicting nuclear import via the nuclear pore complex. It is apparent that predicting nuclear localization is a difficult task that as of yet has not been dealt with adequately. We curate a reliable and redundancy-reduced data set and then develop a model to predict whether a protein is likely to be located to the nucleus, be it permanently or temporarily. The evaluation of the model and all other available predictors is done using strictly separate and nonhomologous test data.

3. Methods Using a data set curated to focus on the problem of distinguishing proteins imported through the NPC, we have developed a model using a Support Vector Machine with a customized kernel. We first outline the data set development and follow with the details of the model. 3.1. Data Set. Of all the models surveyed, the latest release of Swiss-Prot used in the development of training data was HSLPred, employing data from Swiss-Prot R44.1. We extracted our data from Swiss-Prot R50 but split the extracted data into a training/development set and a quarantined test set for evaluating the model in comparison to the alternative predictors. The training set consists of all sequences added to SwissProt prior to 2005, and the test set of all sequences added from 2005 to R50. As a positive set, we extracted all proteins annotated with ‘NUCLEUS’ or ‘NUCLEAR’ in the SUBCELLULAR LOCALIZATION region of the CC field. We excluded proteins with the keywords “PROBABLE”, “BY SIMILARITY”, or “POTENTIAL” in the SUBCELLULAR LOCALIZATION field. In an effort to exclude all protein fragments, we filtered proteins annotated as “Fragment” in the DE field as well as all of those that do not begin

research articles

Predicting Nuclear Localization

with a methionine residue. Although Swiss-Prot contains some protein sequences without the initiator methionine, an overwhelming majority still have it in place. Hence, because of the ample data available, this criteria allows us to be certain that we have excluded fragments with a minimal loss of data. We exclude membrane proteins as they utilize alternative import pathways. As stated previously, unlike other predictors, we do not exclude proteins annotated as dually localized. Our data set is focused on identifying the key factors in localization to the nucleus via the nuclear pore complex; thus, shuttling proteins may offer considerable insight. Finally, we exclude proteins annotated as possessing signal peptides (SP). Dual localization between the endoplasmic reticulum (ER) and nucleus constitutes a tiny subset, and research into retrograde transport of toxins suggests there may be some alternative transport mechanism between these compartments.28,29 The filtering resulted in a set of 6061 nuclear proteins. For the negative set, we extract all proteins with an experimentally determined subcellular localization to an organelle other than the nucleus. Furthermore, we include neither protein fragments, membrane, nor dually localized proteins of other compartments. We do this because dually localized proteins have a tendency to have less well defined localization and we wish to exclude the possibility that the negative set has proteins with minimal amounts of nuclear localization. We also exclude proteins bound for the ER because the ER is known to be only involved in processing some nuclear membrane proteins,7 and our mature classifier uses PProwler30 to exclude proteins with a signal peptide. Hence, the definition of the negative data set focuses the problem on distinguishing between whether a soluble matrix protein that has no signal peptide will be localized to the nucleus. A model trained on this data requires a preprocessing step to remove membrane and SP proteins. Before redundancy reduction, 9990 negative proteins were extracted. Because of the ample amount of data available, we performed stringent redundancy reduction on both the positive and negative data set to ensure integrity of the training and testing procedure. Both the positives and negatives were redundancy-reduced using BlastClust with a 10% identity threshold, so that the remaining sequences have no more than 10% identical residues in the aligned regions covering at least 90% of the sequences. For each cluster generated by BlastClust, we retained the protein with the longest sequence. The number of proteins in the final training sets are as follows: positives, 2842; negatives, 2606. The number of proteins in the testing sets are as follows: positives, 564; negatives, 398. The test data set contains 177 dually localized nuclear proteins and 387 nondually localized proteins. Of these dually localized proteins, only 5 are from plant proteomes. 3.2. Model Development. The task of predicting categories for biological sequences is generally difficult because they have variable lengths. In certain localization problems, this is mitigated by the positioning of the targeting signal at one of the ends of the sequence. However, in nuclear localization, there is no consistent sequence position for the signals used by the physical processes, necessitating the use of a model that makes a prediction on the basis of the entire sequence. Support Vector Machines are powerful machine learning tools that have proven effective in a a range of bioinformatics applications. The SVM training procedure results in a decision function of the form shown in eq 2,31

N

∑ y R K(x , x) + R )

f(x) ) sign(

i i

i

0

(2)

i)1

where yi ∈ {-1, + 1} is the target class for sample i ∈ { 1, ..., N}, xi is the vector describing the ith sample, and Ri is the ith Lagrange multiplier which is adapted during the training procedure. The kernel function K(x1, x2) is used to perform the comparison between two samples, x1 and x2. In theory, this kernel function calculates the dot product between two samples within the feature space being used. However, the utility of kernel methods lies in that one need not generate the feature space directly; many kernel functions offer a shortcut in which the feature space is implicit in their definition. The kernel function can be customized for the problem, and thus, kernel methods allow bioinformatics researchers to design methods for comparing variable length protein sequences. A range of custom sequence kernels have emerged that allow the comparison of variable length sequences. One of the simplest of these kernels is the spectrum kernel, which involves depicting a sequence by the composition of k-mers.32 The popular amino acid composition of a sequence is merely a normalized spectrum of k ) 1. The kernel computes the dot product of two sequences in a feature space consisting of all possible k-mers for the given k. However, to calculate the dot product, one need not explicitly generate the entire feature space; hence, the spectrum kernel is a computationally efficient way to compare sequences in potentially large feature spaces. The operation of the spectrum kernel is depicted in Figure 1. We have used only a fragment of a protein corresponding to the Mat R2 yeast protein NLS on the left and a potential match within the known nuclear localization signal of the hDNA topo II protein, identified as NLSdb entry 34.10,33 We show the translation of these peptide fragments into the feature space of a Spectrum k ) 2 kernel. Note that only a subsection of the full feature space is shown. The power of the spectrum kernel lies in its ability to take arbitrary length sequences and compare them in a space that specifies the structure of the sequence in terms of fixed-length subsequences. However, there are several weaknesses to the method: only a single length of subsequence is considered, and no allowance is made for the kinds of length varying elements common in biological motifs. In effect, all subsequences are treated equally, and the technique is in no way tuned to the particular patterns most important in the problem. To address these issues and focus our model on the task of identifying NLSs, we have created a custom kernel that takes two additional vector parameters, K(x1, x2, K, P). The first additional parameter, K, is a list of integers to be used for consecutive k values in spectrum feature spaces. The kernel is not performing its evaluation on k-mers of a single length. The feature vectors for each k-mer are evaluated separately and concatenated. The second parameter P takes a list of sequence motifs in a regular expression format. A bit vector is generated that is as long as the number of patterns. Each pattern is checked against the sequence, and the corresponding position in the vector is assigned a 0 or 1 for absence or presence, respectively. The motif sets we use for our feature vectors were chosen from the literature to give the model access to information about patterns that are known to be important in nuclear proteins. The three sets of patterns are outlined in Table 2. In all cases, the motif match features are conjoined with the composite Journal of Proteome Research • Vol. 6, No. 4, 2007 1405

research articles

Hawkins et al.

Figure 1. Demonstration of the operation of the Spectrum Kernel (k ) 2) on protein fragments showing the Mat R2 yeast protein NLS on the left and a potential match within the hDNA topo II protein, identified as NLSdb entry: 34. Table 2. Motifs Sets Used as Feature Vectors for Augmenting the Kernela

a

set name

number

description

Loose NLS Tight NLS DNA Domains

15 220 180

The most general NLS motifs, classical and nonclassical The full set of specific NLSs available in NLSdb10 All DNA and RNA associated patterns from the PROSITE database

Description of the three sets of motifs used as feature vectors for augmenting the spectrum-based kernels.

Table 3. Support Vector Machine Simulations with Spectrum Kernelsa spectrum k-value

motif patterns 1

2

3

4

5

6

1 2 3 4 5

0.309 -

0.301 0.301 -

0.365 0.387 0.391 -

0.427 0.477 0.466 0.440 -

0.415 0.464 0.473 0.440 0.436 -

0.408 0.439 0.466 0.436 0.424 0.414

1,4

-

0.469

0.447

-

0.490

0.489

a

The Matthews’ correlation coefficient of SVMs with different spectrum kernel settings. The top row shows performance of a spectrum kernel with a single k-value. Below that are the results for kernels utilizing a combination of two different k-values. All possible combinations of the k-values chosen are displayed in an upper triangular fashion. The final row displays spectrum K ) [1,4] combined with one of the remaining values as a third composite in the kernel. All values result from a single run of 5-fold cross-validation.

spectrum features, and the kernel function returns the dotproduct of these composite vectors. The SVM models were implemented using a customized version of the LibSVM package.34 We have extended the software to allow us to plug-in kernels, and have been using the modified package to test a variety of bioinformatics-specific kernels.35

4. Results In the initial round of simulations, we simply explored the effectiveness of different combinations of k for K. Each simulation was a 5-fold cross-validation for which we measure and report the MCC over the entire data set (eq 1). For each cross-validation run, the data was split into five roughly equal subsets by iterating over the subsets and randomly selecting sequences until all the sequences had been allocated. The results of our first round of simulations are summarized in Table 3. We first used a standard spectrum kernel with k varying from 1 up to 6; performance was observed to peak at k ) 4 and deteriorate with values above this. We then tried all possible combinations of spectrums within this range: combining spectrum 1 with 2, spectrum 1 with 3, and so on. Of these combinations, the best proved to be K ) [1,4]. We then added a third value to the K vector, and performance was seen to peak with K ) [1,4,5], providing an MCC of 0.49 under 5-fold cross-validation. 1406

Table 4. Support Vector Machine Simulations Trained with Motif Match Input Vectorsa

Journal of Proteome Research • Vol. 6, No. 4, 2007

tp

tn

fp

fn

Loose NLSs 1312 2083 523 1530 Tight NLSs 535 2520 86 2307 DNA Binding 528 2448 158 2314 Domains All 1364 2185 421 1478

SucR

Sens

Spec

MCC

0.623 0.462 0.715 0.276 0.561 0.188 0.862 0.244 0.546 0.186 0.770 0.188 0.651 0.480 0.764 0.339

a

Performance statistics for a linear kernel SVM, which is trained on a feature space consisting of motif match bits. The results were produced from a single run of 5-fold cross-validation.

Table 5. Support Vector Machine Simulations with Augmented Spectrum Kernelsa motif patterns

tp

tn

fp

fn

SucR

Sens

Spec

MCC

None Loose NLSs Tight NLSs DNA Binding Domains All

2167 2156 2166 2168

1895 1923 1896 1900

711 683 710 706

675 686 676 674

0.746 0.749 0.746 0.747

0.762 0.759 0.762 0.763

0.753 0.759 0.753 0.754

0.490 0.497 0.490 0.492

2157 1924 682 685 0.749 0.759 0.760 0.497

a Performance statistics for the composite spectrum kernel SVM, which is then augmented with a feature space consisting of motif match bits. The results were produced from a single run of 5-fold cross-validation, using a base spectrum kernel with K ) [1,4,5].

In the second set of simulations, we explore the predictive ability of the motifs outlined in Table 2. First we trained a models that used only the motif-matched vectors for the feature space. The results of which are shown in Table 4. We see here that the best individual motif set are the ‘loose NLSs’; however, the combination of all three considerably outperforms any of the individual motif sets. Finally, we performed a set of simulations combining the spectrums and motifs. For K, we used the set of k values that performed best in the first round of simulations and augmented it with different motif sets, by adding patterns via the P parameter. The results are shown in Table 5. Here we see that each of the motif sets makes only minor modifications to the overall performance, in each case doing so in slightly different ways. The final simulation using all of the motif sets provides the best overall performance, albeit only very slightly better than the loose NLS set alone. In Table 5, we see that the specificity has increased, but this has not shown up in the MCC recorded to three decimal places. We use this final kernel as the basis of the NUCLEO application and train it on the entire training data set.

research articles

Predicting Nuclear Localization Table 6. Nuclear Localization Models Independent Testa model

NUCLEO LOCtree (Animal) LOCtree (Plant) LOCtree (Overall) P2SL HSLPred (Hybrid) NucPred predictNLS SubLoc

data

tp

tn

fp

fn

SucR

Sens

Spec

MCC

Dual Single All Dual Single All Dual Single All Dual Single All Dual Single All Dual Single All Dual Single All Dual Single All Dual Single All

121 309 430 96 224 320 1 22 23 97 246 343 62 190 252 84 165 249 106 270 376 37 116 153 116 290 406

246 246 246 180 180 180 91 91 91 271 271 271 293 293 293 283 283 283 233 233 233 369 369 369 211 211 211

152 152 152 112 112 112 13 13 13 125 125 125 101 101 101 115 115 115 165 165 165 29 29 29 187 187 187

56 78 134 76 127 203 4 14 18 80 141 221 115 197 312 66 222 288 71 117 188 140 271 411 61 97 158

0.64 0.71 0.70 0.59 0.63 0.61 0.84 0.81 0.79 0.64 0.66 0.64 0.62 0.62 0.57 0.67 0.57 0.57 0.59 0.64 0.63 0.71 0.62 0.54 0.57 0.64 0.64

0.68 0.80 0.76 0.56 0.64 0.61 0.20 0.61 0.56 0.55 0.64 0.61 0.35 0.49 0.45 0.56 0.43 0.46 0.60 0.70 0.67 0.21 0.30 0.27 0.66 0.75 0.72

0.62 0.62 0.62 0.62 0.62 0.62 0.88 0.88 0.88 0.68 0.68 0.68 0.74 0.74 0.74 0.71 0.71 0.71 0.59 0.59 0.59 0.93 0.93 0.93 0.53 0.53 0.53

0.28 0.42 0.38 0.17 0.25 0.22 0.05 0.49 0.45 0.22 0.32 0.29 0.10 0.24 0.19 0.25 0.14 0.18 0.17 0.28 0.25 0.20 0.29 0.25 0.17 0.29 0.25

a Independent test of the available nuclear localization prediction services. Results are generated using our test data set taken from Swiss-Prot R50, selecting proteins added from 2005 onward. We have split the results into two versions, one in which the positives are known to be dually localized and one in which they are not. In the middle section, we show the number of True Positives (tp), False Positives (fp), True Negatives (tn), and False Negatives (fn) for each model on each data set. In the final section, we give four standard measures of performance, as defined previously; see Section 2. Note that the total number of proteins in each test varies because some of the classifiers would not accept proteins with certain constraints; i.e., P2SL does not recognize the symbol X in a sequence and HSLPred would not return a result if a protein was too big. Second, the numbers for tn and fp are identical for each of the three tests because the negative set is identical in each case.

4.1. Independent Test. The quarantined test data was split into dually and nondually localized proteins to evaluate the models on the task they were trained on as well as the broader task of detecting NPC imported proteins. For submission to LOCtree, we split the test set into plant and non-plant proteins, with the following numbers: nuclear, plant, 41; non-plant, 523; non-nuclear, plant, 105; non-plant, 293. The P2SL model produces an output that indicates the localizations considered most likely in terms of the number of votes for that location in their ensemble SVM system. In the results returned from P2SL it was very common for the nucleus to be included in the prediction; of the 398 non-nuclear proteins, 300 contained some component of nuclear localization in the prediction. Hence, for this evaluation, we accepted proteins to be predicted as nuclear only if the majority vote of 3 was achieved. The pSLIP model is absent from this test because, during the testing period, the pSLIP Web service was unavailable. In Table 6, we see that for all of the models bar HSLPred the performance was better on the proteins with a ‘single’ localization (non-dual), suggesting that the dually localized prediction task is somewhat different. Nevertheless, in every case, the performance revealed by our tests on the nondually localized data is significantly less than that originally published by the developers. Note that the number of tn and fp produced by a predictor for each of the three tests is identical because the negative set is identical in each case.

5. Discussion If we examine the results of our independent test, the LOCtree plant model was the only model that had a better MCC

than NUCLEO, supporting their decision to produce separate models for plants and animals. However, the overall evaluation of the LOCtree model on identifying nuclear proteins produced an MCC of 0.29; hence, in general, NUCLEO outperforms LOCtree. We consider the downward reappraisal of prediction ability on this problem to be due to several factors. First, and primarily, our knowledge of the number of functional nuclear targeting signals is still expanding, indicating that previous data sets likely had only a partial picture of the complete problem. Second, many of the previous predictors ignored dually localized data, which our analysis suggests can contribute to the overall prediction results. Finally, some of the previous models used redundant data sets that caused the model to overspecialize and the performance estimates to be considerable overestimates. It is a telling fact that the LOCtree model by Nair and Rost that used the most rigorous data curation techniques suffered the smallest drop in performance on the independent test set. The performance of our model was stronger than that of other models on both dually and nondually targeted proteins, suggesting that there is enough overlap between the two classes that a model can identify the sequence features pertinent to localization via the Nuclear Pore Complex. We suspect that the inclusion of the dually localized proteins has steered the model away from basing its predictions on overall protein function, as is common in homology and purely amino acid composition based approaches. Instead, the model is forced to look to the common sequence features that affect localization via the NPC. The fact that all but one of the models performed worse on the dually localized proteins suggests that either the mechanism Journal of Proteome Research • Vol. 6, No. 4, 2007 1407

research articles of localization differs between the two categories of protein, or the models are relying on properties of the sequence that relate to function rather than transport. Our suspicion is that the truth is a mixture of these alternatives. Given that regulatory proteins are more likely to be dually localized, we foresee that further work in identifying the differences between the nuclear localization mechanisms will provide new insight into the role localization has to play in gene regulation.

6. Conclusion The task of predicting nuclear proteins is complicated by the numerous targeting signals and the existence of shuttling proteins. Nuclear localization is a fundamental part of cellular dynamics and crucial to the regulation of the genome. Previously, it has been generally approached as any other organelle that has a discrete set of proteins that reside and work permanently within its confines. We have argued that nuclear localization is a special case that deserves an approach sensitive to the dynamical subtleties of the process. We have curated a data set to this end and developed a model that utilizes the power of the k-mer representation, extended through the use of motif-matched features extracted from a variety of sources. The use of stringently redundancy-reduced data and a motif-based machine learning model encourages the system to base its predictions on the presence of the nuclear localizing sequences rather than homology to known proteins. Under 5-fold crossvalidation, our best model had an MCC of 0.50. To critically evaluate the generalization ability of our own model and the other models that predict nuclear localization, we kept an independent nonredundant test set aside from the start. When we used this test set on the other available models, we observed that in all cases performance was far worse than initially estimated. Furthermore, all bar one of the models performed poorly at predicting dually localized proteins, indicating that, although dually localized proteins contain information that can assist the prediction of nondually localized proteins, there may be distinctions between the targeting mechanisms of proteins in either class. The final version of NUCLEO, with a success rate of 0.70 and an MCC of 0.38 on the independent test set, is currently the only prediction service trained on the full task of predicting nuclear matrix proteins imported via the Nuclear Pore Complex. Contrary to the impression that might be gained by looking at the previous performance estimates of the existing nuclear localization predictors, we have demonstrated that the task of reliable computational identification of nuclear proteins is far from complete.

Acknowledgment. This research was supported by the ARC Centre for Complex Systems. The authors thank Rohan D. Teasdale for his advice on data curation, and Biter Bilen and Volkan Atalay for kindly taking the time to run our data through their predictor. References (1) Fujihara, S. M.; Nadler, S. G. Modulation of Nuclear Protein Import: A Novel Means of Regulating Gene Expression. Biochem. Pharmacol. 1998, 56, 157-161. (2) Macara, I. G. Transport into and out of the Nucleus. Microbiol. Mol. Biol. Rev. 2001, 65, 570-594. (3) Cokol, M.; Nair, R.; Rost, B. Finding nuclear localization signals. EMBO Rep. 2000, 1, 411-415.

1408

Journal of Proteome Research • Vol. 6, No. 4, 2007

Hawkins et al. (4) Heddad, A.; Brameier, M.; MacCallum, R. M. Evolving Regular Expression-Based Sequence Classifiers for Protein Nuclear Localisation. In Applications of Evolutionary Computing; Springer: Berlin/Heidelberg, 2004; Vol. 3005, pp 31-40. (5) Suntharalingam, M.; Wente, S. R. Peering through the Pore: Nuclear Pore Complex Structure, Assembly, and Function. Dev. Cell 2003, 4, 775-789. (6) Paine, P. L.; Moore, L. C.; Horowitz, S. B. Nuclear Envelope Permeability. Nature 1975, 254, 109-114. (7) Holmer, L.; Worman, H. Inner Nuclear Membrane Proteins: Functions and Targeting. Cell. Mol. Life Sci. 2001, 58, 1741-1747. (8) Weis, K. Importins and Exportins: How To Get in and out of the Nucleus. Trends Biochem. Sci. 1998, 23, 185-189. (9) Christophe, D.; Christophe-Hobertus, C.; Pichon, B. Nuclear Targeting of Proteins: How Many Different Signals? Cell. Signalling 2000, 12, 337-341. (10) Nair, R.; Carter, P.; Rost, B. NLSdb: Database of Nuclear Localization Signals. Nucleic Acids Res. 2003, 31, 397-399. (11) Mueller, J. C.; Andreoli, C.; Prokisch, H.; Meitinger, T. Mechanisms for Multiple Intracellular Localization of Human Mitochondrial Proteins. Mitochondrion 2004, 3, 315-325. (12) Hicks, G. R.; Raikhel, N. V. Protein Import into the Nucleus: An Integrated View. Annu. Rev. Cell Dev. Biol. 1995, 11, 155-188. (13) Nakai, K.; Minoru, K. Expert System for Predicting Protein Localization Sites in Gram-Negative Bacteria. Proteins: Struct., Funct., Genet. 1991, 11, 95-110. (14) Nakai, K.; Minoru, K. A Knowledge Base for Predicting Protein Localization Sites in Eukaryotic Cells. Genomics 1992, 14, 897911. (15) Nakai, K.; Horton, P. Better Prediction of Protein Cellular Localization Sites with the k Nearest Neighbors Classifier. In Proceedings of the Fifth International Conference on Intelligent Systems for Molecular Biology; Gaasterland, T., Karp, P., Karplus, K., Ouzounis, C., Sander, C., Valencia, A., Eds.; AAAI Press: Menlo Park, CA, 1997. (16) Nakai, K.; Horton, P. PSORT: A Program for Detecting Sorting Signals in Proteins and Predicting Their Subcellular Localization. Trends Biochem. Sci. 1999, 24, 34-35. (17) Matthews, B. W. Comparison of Predicted and Observed Secondary Structure of T4 Phage Lysozyme. Biochim. Biophys. Acta 1975, 405, 442-451. (18) Baldi, P.; Brunak, S. Bioinformatics: The Machine Learning Approach, 2nd ed.; MIT Press: Cambridge, MA, 2001. (19) Rost, B. Twilight Zone of Protein Sequence Alignments. Protein Eng. 1999, 12, 85-94. (20) Sarda, D.; Chua, G.; Li, K.-B.; Krishnan, A. pSLIP: SVM Based Protein Subcellular Localization Prediction Using Multiple Physicochemical Properties. BMC Bioinf. 2005, 6, 152. (21) Park, K.-J.; Kanehisa, M. Prediction of Protein Subcellular Locations by Support Vector Machines Using Compositions of Amino Acids and Amino Acid Pairs. Bioinformatics 2003, 19, 1656-1663. (22) Chou, K.-C.; Cai, Y.-D. Prediction of Protein Subcellular Locations by GO-FunD-PseAA Predictor. Biochem. Biophys. Res. Commun. 2004, 320, 1236-1239. (23) Garg, A.; Bhasin, M.; Raghava, G. P. S. Support Vector MachineBased Method for Subcellular Localization of Human Proteins Using Amino Acid Compositions, Their Order, and Similarity Search. J. Biol. Chem. 2005, 280, 14427-14432. (24) Hua, S. J.; Sun, Z. R. Support Vector Machine Approach for Protein Subcellular Localization Prediction. Bioinformatics 2001, 17, 721728. (25) Lu, Z.; Szafron, D.; Greiner, R.; Lu, P.; Wishart, D.; Poulin, B.; Anvik, J.; Macdonell, C.; Eisner, R. Predicting Subcellular Localization of Proteins Using Machine-Learned Classifiers. Bioinformatics 2004, 20, 547-556. (26) Atalay, V.; Cetin-Atalay, R. Implicit Motif Distribution Based Hybrid Computational Kernel for Sequence Classification. Bioinformatics 2005, 21, 1429-1436. (27) Nair, R.; Rost, B. Mimicking Cellular Sorting Improves Prediction of Subcellular Localization. J. Mol. Biol. 2005, 348, 85-100. (28) Cleves, A. E. Protein Transport: The Nonclassical ins and outs. Curr. Biol. 1997, 7, R318-R320. (29) Sara Arab, C. A. L. Intracellular Targeting of the Endoplasmic Reticulum/Nuclear Envelope by Retrograde Transport May Determine Cell Hypersensitivity to Verotoxin via Globotriaosyl Ceramide Fatty Acid Isoform Traffic. J. Cell. Physiol. 1998, 177, 646-660. (30) Bode´n, M.; Hawkins, J. Prediction of Subcellular Localisation Using Sequence-Biased Recurrent Networks. Bioinformatics 2005, 21, 2279-2286. (31) Vapnik, V. N. Statistical Learning Theory; Wiley: New York, 1998.

research articles

Predicting Nuclear Localization (32) Leslie, C.; Eskin, E.; Grundy, W. S. The Spectrum Kernel: A String Kernel for SVM Protein Classification. In Proceedings of the Pacific Symposium on Biocomputing; Altman, R. B., Dunker, A. K., Hunter, L., Lauerdale, K., Klein, T. E., Eds.; World Scientific: Hackensack, NJ, 2002. (33) Kaneko, H.; Orii, K. O.; Matsui, E.; Shimozawa, N.; Fukao, T.; Matsumoto, T.; Shimamoto, A.; Furuichi, Y.; Hayakawa, S.; Kasahara, K.; Kondo, N. BLM (the Causative Gene of Bloom Syndrome) Protein Translocation into the Nucleus by a Nuclear Localization Signal. Biochem. Biophys. Res. Commun. 1997, 240, 348-353.

(34) Chang, C.-C.; Lin, C.-J. “LIBSVM: a Library for Support Vector Machines”, 2001 Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm. (35) Davis, L.; Hawkins, J.; Maetschke, S.; Bode´n, M. Comparing SVM Sequence Kernels: A Protein Subcellular Localization Theme. In Proceedings of the Workshop on Intelligent Systems for Bioinformatics; Boden, M., Bailey,T. L., Eds.; CRPIT, Vol. 73; ACS: Sydney, Australia, 2006; pp 39-47.

PR060564N

Journal of Proteome Research • Vol. 6, No. 4, 2007 1409