Protein Family Classification with Partial Least Squares Stephen O. Opiyo† and Etsuko N. Moriyama*,‡ Department of Agronomy and Horticulture, University of Nebraska-Lincoln, Lincoln, Nebraska, 68583-0915, and School of Biological Sciences and Plant Science Initiative, University of Nebraska-Lincoln, Lincoln, Nebraska, 68588-0660 Received October 11, 2006
The quality of protein function predictions relies on appropriate training of protein classification methods. Performance of these methods can be affected when only a limited number of protein samples are available, which is often the case in divergent protein families. Whereas profile hidden Markov models and PSI-BLAST presented significant performance decrease in such cases, alignment-free partial leastsquares classifiers performed consistently better even when used to identify short fragmented sequences. Keywords: partial least square • physico-chemical properties • amino acid composition • profile hidden Markov model • G-protein coupled receptors
Introduction Predicting protein functions is one of the most important problems in the post-genomic analyses. However, a large number of proteins in many genomes are still not annotated for their functions. The majority of protein classification methods currently used rely on building alignments: e.g., PSIBLAST,1 Pfam,2 and PROSITE.3 Some homologous proteins are highly diverged and lack enough sequence similarities even though they still share similar structures, biochemical properties, and functions. Obtaining reliable alignments among these protein sequences is difficult. Another disadvantage of these classifiers is that they are trained using only “positive” samples (proteins of interest). “Negative” samples (unrelated proteins) cannot be included in the alignments their models are built from. Recently efforts have been put to increase sensitivities against such non-alignable similarities. Support vector machines (SVMs)4 are used in, e.g., Karchin et al.5 and Liao and Noble.6 These methods combine profile hidden Markov models (HMMs) or pairwise alignments with SVMs. Although they are not completely alignment free, discrimination powers are increased by incorporating both positive and negative sample information in their training. They performed better than profile HMMs for classifying G-protein coupled receptors (GPCRs). Using various amino acid properties effectively avoids the alignment-building process. Kim et al.7 and Moriyama and Kim8 used parametric and nonparametric discriminant function analysis methods, and obtained 96% or higher true positive rates for discriminating GPCRs from non-GPCRs. These methods also worked well for identifying short sequences of GPCRs as short as 50 amino * To whom correspondence should be addressed. Etsuko N. Moriyama, School of Biological Sciences and Plant Science Initiative, University of Nebraska-Lincoln, N107 Beadle Center, Lincoln, NE 68588-0660. E-mail:
[email protected]; Phone: 402-472-4979; Fax: 402-472-3139. † Department of Agronomy and Horticulture. ‡ School of Biological Sciences and Plant Science Initiative.
846
Journal of Proteome Research 2007, 6, 846-853
Published on Web 12/20/2006
acids. Bhasin and Raghava9 used dipeptide composition with SVMs. Strope and Moriyama (submitted) reported that SVMs with simple amino acid composition outperformed profile HMMs for identifying remote similarities. A multivariate method, partial least-squares (PLS) regression, was used by Lapinsh et al.10 Their method was effective in classifying Class A GPCR sequences into its subfamilies. Using the covariant-discriminant algorithm with amino acid composition as descriptors, Chou and Elrod11 were able to classify 566 Class A GPCR sequences into its subfamilies. More recently, Chou12 used it to classify the three main GPCR classes: Class A, Class B, and Class C. To perform whole proteome prediction of protein function effectively and accurately, two often related problems need to be overcome: remote similarity detection and model-building based on limited samples. Such examples are found in extremely diverged protein families such as GPCRs. For example, the “mildew resistance locus O” (MLO) family is plant specific and currently only 15 member proteins are known. In total, only 22 GPCRs are known in the Arabidopsis thaliana genome, in a stark contrast to 1000 or more GPCRs found in human and mouse. It is possible that plants do not require this protein superfamily as much as animals. However, it is also possible that classifiers used to identify these proteins (mostly profile HMMs) are affected by insufficiently represented training datasets. It is thus important for the users to recognize how different classifiers perform in such difficult situations. In this study, we examined how the size of training datasets affects the classifier performance and also how different classifiers identify sequence fragments with different lengths. A large amount of expressed sequence tag (EST) sequences are now available, and their numbers are rapidly increasing. These ESTs usually contain only fragments of protein coding sequences. Because many recent genome projects, especially for plant genomes, concentrate on sequencing ESTs rather than complete genomes, it is important to utilize this information 10.1021/pr060534k CCC: $37.00
2007 American Chemical Society
research articles
Protein Family Classification with Partial Least Squares Table 1. Numbers of Samples Included in GPCR Datasets GPCRs datasets
class A
class B
class C
class D
class E
non-GPCRs
total
Training10 Training20 Training50 Training100 Training200 Test
1 3 8 23 73 136
1 2 7 17 17 50
1 2 7 7 7 10
1 2 2 2 2 4
1 1 1 1 1 0
5 10 25 50 100 2000
10 20 50 100 200 2200
source effectively. We therefore need to understand not only the sensitivity of various classifiers against non-alignable remote similarities but also against fragmented similarities. The GPCR superfamily has been used to test classifier performance in many previous studies. This is because GPCR proteins share little sequence similarities except a structural feature of having seven transmembrane regions. No reliable multiple alignment can be generated when different GPCR families are included, and attempting to find new GPCRs from new genomic or EST sequences is often hindered due to such extreme diversity. This provides a good test case for examining classifier performance in difficult situations. We thus used GPCR sequences as one example and compared the performance among profile HMMs, PSI-BLAST, and classifiers based on PLS with alignment-free descriptors: physicochemical properties and composition of amino acids. As another example, we used Cytochrome b561 proteins for examining EST mining performance. The results of this comparative study will be useful for choosing appropriate methods for identifying protein functions in various protein-mining situations.
Materials and Methods Data Sources. GPCR Data. All GPCR sequences were retrieved from the Swiss-Prot protein database.13 The GPCR superfamily is divided into five major classes (A, B, C, D, and E) based on the ligand types, functions, and sequence similarities (the Information System for G protein-Coupled Receptors; GPCRDB14). The sequences of different GPCR classes are highly diverged from each other. Class A is the largest class with more than 1800 entries in the Swiss-Prot database. Each class is divided into families, subfamilies, and so on. Non-GPCR Data. Non-GPCR sequences (negative samples) longer than 100 amino acids were randomly sampled from Swiss-Prot. The sequence identities, GPCR or not, were confirmed based on the Swiss-Prot annotations. Cytochrome b561 Data. Forty eight Cytochrome b561 (Cytb561) protein sequences were retrieved from GenBank.15 They included eleven sequences from plants (four from A. thaliana, one from maize, and six from rice) and 37 sequences from animals. Dataset Preparation. Training and Test Datasets. Five sets of training data each including both positive (GPCR) and negative (non-GPCR) samples were generated as shown in Table 1. Three non-overlapping samples were prepared for each training dataset. For the smallest training set, Training10, five replications were prepared. The test dataset included 200 positive and 2000 negative samples. Note that for SAM and PSIBLAST, only positive samples were used for training. Subsequence Test Datasets. From each sequence of the test dataset, 50, 75, or 100 amino acid subsequences were taken from its N- and C-terminals. These six subsequence test sets
Table 2. Five Principal Component Scores for the 20 Amino Acids principal component scores (% variance)a amino acids
PC1 (40.4)
PC2 (28.7)
PC3 (10.9)
PC4 (8.9)
PC5 (4.2)
alanine (A) arginine (R) aspartate (D) asparagine (N) cystine (C) glutamate (E) glutamine (Q) glycine (G) histidine (H) isoleucine (I) leucine (L) lysine (K) methionine(M) phenylalanine (F) proline (P) serine (S) threonine (T) tryptophan (W) tyrosine (Y) valine (V)
-1.74 0.68 -2.24 -1.82 -0.22 -0.61 -0.69 -4.02 0.04 2.16 1.93 -1.31 1.83 3.56 -2.07 -2.63 -1.36 4.96 2.59 0.98
-2.24 3.56 0.84 0.98 0.30 1.55 1.65 -2.56 2.36 -2.24 -1.85 2.26 0.07 -0.97 -0.11 -1.01 -0.44 0.41 0.50 -3.05
-1.88 0.79 0.23 0.53 -0.72 -1.14 -0.75 1.12 -1.18 0.02 -1.21 -0.91 -1.76 0.28 2.45 0.47 0.62 1.22 2.12 -0.28
0.41 2.28 -1.99 -0.53 -0.67 -2.07 -0.55 0.58 -0.05 0.51 0.26 2.51 -0.37 -0.23 -0.22 0.13 0.03 -0.46 -0.46 0.91
-0.06 1.08 0.51 0.27 -0.81 0.77 0.65 -0.42 -1.67 0.34 -0.60 -0.41 0.00 0.00 -1.39 0.24 0.91 -0.56 0.29 0.87
a Percent total variance of the original 12 physicochemical properties. See Table S2, Supporting Information, for the loadings for each principal component.
including 2200 sequences of a given length were used to examine classifier performance against short sequences. Cyt-b561 Training Sets. Three sets of training data were created from the Cyt-b561 protein family: “Arabidopsis only” including four A. thaliana Cyt-b561s, “Plants” including 11 plant Cyt-b561s (including four from A. thaliana), and “Plants and animals” including 48 Cyt-b561s (11 from plants and 37 from animals). All these datasets included the same number of negative (non-Cyt-b561) sequences. Arabidopsis Expressed Sequence Tag (EST) Sequences. 362, 202 A. thaliana EST sequences were downloaded from The Arabidopsis Information Resource database (ftp.arabidopsis.org/home/tair/Sequences/blast_datasets/AtEST.Z derived from the dbEST release for November, 2004). Each EST sequence was translated into three reading frames. ESTs containing the four known A. thaliana Ctyb-561 coding sequences (At1g14730, At1g26100, At4g25570, and At5g38630) were identified using the Smith-Waterman local alignment algorithm16 implemented in SSEARCH.17 Similarly ESTs containing multidomain protein sequences with Cyt-b561 domains were identified. Table S1 (see Supporting Information) lists the accession numbers of these ESTs. These EST sequences were considered to be true positives. Descriptor Development. In Lapinsh et al.,10 five principal components (z-scales) originally derived by Sanberg et al.18 were used as descriptors for each amino acid. However, these Journal of Proteome Research • Vol. 6, No. 2, 2007 847
research articles
Opiyo and Moriyama
composition, and amino acid composition with principal component analysis. Transformation of an unaligned set of amino acid sequences into a uniform matrix makes PLS and other multivariate methods applicable without aligning sequences Transformation of Amino Acid Sequences. Mean Transformation. After each amino acid sequence was transformed to a set of five PC scores, the average was taken for each of the five PC scores. Regardless of the length, each amino acid sequence can be represented by an array of five average PC scores as follows:
(
1
L
1
L
L
1
1
L
L
1
∑ PC1 , L ∑ PC2 , L ∑ PC3 , L ∑ PC4 , L ∑ PC5 L
Figure 1. Training dataset size and classifier performance. The average values from three (or five) replications of training are plotted with error bars. (A) accuracy rate (%) as well as false positive rate (%, at the bottom); (B) false negative rate (%). For each classifier, the results are shown from left to right using the following training sets: Training10 (X), Training20 (0), Training50 (*), Training100 (9), and Training200 (+).
z-scales gave poor results for our classification problem. Lapinsh et al.10 classified Class A GPCRs into subfamilies, but our problem is to classify GPCRs from non-GPCRs. Therefore, we chose our own twelve physicochemical properties considering the methods used, protein family used, etc. They include: mass,19 volume,20 surface area,20 hydrophilicity,21 hydrophobicity,22 isoelectric point,23 transfer of energy solvent to water,24 refractivity,25 nonpolar surface area,26 frequencies of R-helix,27 β-sheet,27 and reverse turn.27 These physicochemical property values were first scaled to the unit variance. Principal component analysis (PCA) was performed on these twelve properties for 20 amino acids. The first five principal components (PCs) explaining 93.2% of the total variance were selected (Table 2 and Table S2, see Supporting Information). The first principal component (PC1) covered 40.4% of the total variance of the original 12 physicochemical properties. It represents all properties except isoelectric point and nonpolar surface. PC2 (28% of the total variance) has negative relationships with the hydrophobicity properties. PC3 and PC5, although their contributions are only 15% of the total variance, represent secondary structure properties. Using these five PC scores, each amino acid sequence was converted to a sequence of 5L descriptors, where L is the number of amino acids in the sequence. Because unaligned sequences have varied lengths, to transform them to a uniform matrix, four transformation methods were used: mean, auto/cross-covariance, amino acid 848
Journal of Proteome Research • Vol. 6, No. 2, 2007
i
i)1
i
i
i)1
i)1
i
i)1
i)1
i
)
(1)
where L is the length of the amino acid sequence and PC1i, ..., PC5i are five PC values for the ith amino acid. Auto and Cross-Covariance (ACC) Transformation. Auto/ cross covariance (ACC) transformation of amino acid sequences was proposed by Wold et al.28 and used in Lapinsh et al.10 The ACC describes the average correlations between residues a certain lag apart. After each amino acid sequence was transformed to a set of five PC scores, auto-covariances (AC) and cross-covariances (CC) for each sequence were calculated as follows. The auto-covariance of PC1 at the amino acid position i with the lag size 1, AC1, i(1), is calculated with PC1i multiplied by PC1i+1, where PC1i is the PC1 value of the ith amino acid. The auto-covariance of PC1 for a sequence with the lag size 1, AC1(1), is the average of these products from the position 1 to the position L-1 (L is the length of the sequence). The crosscovariance of PC1 and PC2 at the amino acid position i with the lag size 1, CC12, i(1), is calculated by multiplying PC1i with PC2i+1. The cross-covariance of PC1 and PC2 for a sequence with the lag size 1, CC12(1), is the average of these products from the position 1 to the position L-1. The following equations summarize these calculations: ACj(d) )
L-d
1
∑ (PCj )(PCj L-d i
i+d)
(2)
i)1
CCjk(d) )
L-d
1 L-d
∑ (PCj )(PCk i
i+d)
(3)
i)1
where d is the lag size, PCji and PCki are the jth and kth PC value for the ith amino acid, respectively (j * k; j, k ) 1, 2, 3, 4, or 5), and L is the length. Note that Lapinsh et al.10 used slightly modified and normalized version of ACC and reported improvement in classification performance compared to using the original Wold et al.’s ACC. We used the R function, acf with the “covariance” option, for the ACC implementation (R version 2.2.029). It uses the following equations: ACj(d) )
1
L-d
∑ (PCj - PCj)(PCj L i
i+d
- PCj)
(4)
- PCk)
(5)
i)1
CCjk(d) )
1 L
L-d
∑ (PCj - PCj)(PCk i
i+d
i)1
where PCj and PCk are the means of PCji and PCki, respectively. Use of the auto/cross-correlation (with normalization) did not show any difference in classification performance.
Protein Family Classification with Partial Least Squares
research articles
Figure 2. ROC graphs for classifiers using different sizes of training datasets. Plots are based on one training replication. ROC curves were similar between training replications. Classifier are shown as follows: PLS-ACC (O), PLS-Mean (X), PLS-AA (*), PLS-AA_PCA (0), SAM (9), and PSI-BLAST (+).
Whereas the auto-covariances emphasize the interactions between amino acids, interactions between different amino acid properties are incorporated into the cross-covariances. Note that the ACC transformation incorporates positional information from sequences. Amino Acid Composition Transformation. From each amino acid sequence, frequencies of 20 amino acids were simply calculated. We used 19 frequencies as a set of descriptors since the 20th amino acid frequency can be explained completely by the other 19. Amino Acid Composition/PCA Transformation. After each sequence is transformed to a 20 amino acid composition array, its dimension was further reduced by using PCA. As before, five top principal components were selected. Each sequence is represented with an array of five PC scores. Classifiers. Partial Least-Squares (PLS). PLS is a projection method similar to PCA where the independent variables, represented as the matrix X, are projected onto a low dimensional space. The main difference between PLS and PCA is that PCA uses only the independent variables X, and PLS uses both independent variables X and dependent variables Y30. We used an R implementation, the PLS package developed by Wehrens and Mevik,31 with the SIMPLS method and the cross-validation option. Both PCA and PLS reduce the high dimensional data to a fewer components called scores that explain much of the variations of the original variables. In PCA this is achieved without the regard of the dependent variables, whereas in PLS correlations between independent and dependent variables are also taken into account. Because PLS reduces the high dimension of the variables to a much lower dimension with scores, PLS performs well when there are more variables than observations. It deals with the
multicollinearity problem by choosing orthogonal components even if many original variables are correlated to each other. It does not rely on assumption of normality, either. Because of these advantages found in PLS, this method is frequently used in analyzing gene expression data that usually have extremely high dimensionality. In our protein classification application, PLS methods can be similarly advantageous because especially ACC transformation described above produces a high dimension descriptor array. In this study, for each of the training samples, a response variable is assigned as 1 for the positive (GPCR) label, and 0 for the negative (non-GPCR) label. Therefore, the training dataset including N protein sequences is represented with two matrices X and Y with dimensions N × K (for K descriptors) and N × 1, respectively. The group membership, GPCR or nonGPCR, of a new sequence is predicted based on its K descriptors and calculating the X-score, X-residual, and y-value. Group assignment is done based on the y-value. In this study, four PLS classifiers were examined. They are based on different descriptor sets described before: “PLS-ACC” using ACC descriptors, “PLS-mean” based on mean PC scores, “PLS-AA” based on simple amino acid composition, and “PLS-AA_PCA” using amino acid composition transformed with PCA. Goodness of Fit of PLS Model. The goodness of fit, R2, describes how well the dataset can be mathematically reproduced by the fitted model: R2 ) 1 - RSS/SSY
(6)
where RSS is the residual (error) sum of squares, SSY is the total sum of squares of observed Y. R2 varies between 0 and 1. The closer to 1, the better the model is. Journal of Proteome Research • Vol. 6, No. 2, 2007 849
research articles
Opiyo and Moriyama
Figure 3. Subsequence lengths and classifier performance. Results of N-terminal subsequence tests are shown. A: accuracy rate (%) as well as false positive rate (%, at the bottom); B: false negative rate (%). Method symbols are described in Figure 2 legend.
Goodness of Prediction of PLS Model. A model is not good enough if it has a high goodness of fit to the training data. More important is the predictive ability of the model. The goodness of prediction, Q2, describes how well the model can predict a data. It is calculated similar to R2, but using the cross-validation procedure: Q2 ) 1 - PRESS/SSY
(7)
where PRESS is the predictive residual sum of squares, which is calculated from the difference between observed and predicted Y values. Q2 > 0.5 is considered good. In this study, the leave-one-out cross-validation procedure was used for the Q2 calculation. Detailed results of PLS analyses are given in Tables S2 and S3 (see Supporting Information) for PLS-ACC and PLSAA, respectively. PSI-BLAST. PSI-BLAST1 builds position-specific scoring matrices (PSSMs) from a multiple alignment. In the regular implementation, PSI-BLAST first performs a regular protein similarity search (BLASTP) against a protein database using a single protein query. The first PSSM is built from a set of highly similar sequences obtained from this search. Subsequent searches are done based on the PSSM iteratively built from high-score hits. In this study, we used prealigned positive sequences as the first input. Multiple alignments were generated using ClustalX version 1.8332 with the default parameters. Ten iterations with E value ) 10 as the threshold for building PSSM were performed against the test dataset. 850
Journal of Proteome Research • Vol. 6, No. 2, 2007
Profile Hidden Markov Models (HMMs). Profile HMMs are the full probabilistic representation of sequence profiles.33 The profile HMMs are built using only positive samples. In this study, profile HMMs were built using the w0.5 script of the Sequence Alignment and Modeling Software System (SAM version 3.5).34 Multiple alignments used with the w0.5 script were built using the builmodel and align2model programs. The profile HMMs were built from multiple alignments using the modelfromalign program with Dirichlet mixture priors (recode3.2comp)35 to improve the models by assigning prior probabilities of amino acids to the models, and the weight option 0.5 was used to save 0.5 bits of information per column of the multiple alignment. The test sequences were scored against models using hmmscore with the “calibrate” option (for more accurate E value calculation) and with the option -sw 2 for local scoring. EST Mining. Classifiers were trained using the three datasets described earlier (“Arabidopsis only”, “Plants”, “Plants and animals”). The trained classifiers were applied to identify Cytb561 containing ESTs. The E-value thresholds for PSI-BLAST and SAM, 1.01 and 1.24, respectively, were chosen based on the minimum error points (described later) obtained from GPCR classification using the training sets of 200 samples. For the PLS classifiers the cutoff of 0.4665 was chosen based on the minimum error points. Performance Analysis. Statistics. Predictions are grouped as follows: • True positives (TP): the numbers of actual positives predicted as positives. • False positives (FP): the numbers of actual negatives predicted as positives. • True negatives (TN): the numbers of actual negatives predicted as negatives. • False negatives (FN): the numbers of actual positives predicted as negatives. The performance statistics are calculated as follows: • Accuracy ) (TP + TN)/ (TP +TN + FP + FN) • False positive rate ) FP/(FP + TN) ) 1 - specificity • False negative rate ) FN/(FN + TP) • True positive rate ) TP/(TP + FN) ) sensitivity Receiver Operating Characteristics Graph. Receiver operating characteristics (ROC) graph is useful for visualizing the performance of various methods.36 An ROC curve is a graphical representation of the trade off between sensitivity and specificity (any increase in sensitivity will be accompanied by a decrease in specificity). It is illustrated as a plot with false positive rates (1-specificity) against true positive rates (sensitivity). The closer the curve follows the left-hand border and then the top border of the ROC space, the better classifiers perform. We can quantify the classifier performance by measuring the area under the curve. The closer the area to 1.0, the better the classifier is, and the closer the area to 0.5, the worse the classifier is (it is no better than random assignment). Minimum Error Point. Minimum error point (MEP) was used by Karchin et al.5 It is the threshold score where the method produces the minimum number of errors (false positives + false negatives).
Results Training Set Size and Classifier Performance. Datasets of five different sizes were used for training classifiers as shown in Table 1. PLS classifiers were compared against SAM and PSIBLAST for their GPCR identification performance. Figure 1
research articles
Protein Family Classification with Partial Least Squares Table 3. Identification of Cyt-b561 Containing Arabidopsis ESTs numbers of ESTs identified for each Cyt-b561a methods
At5g38630 (5)
At1g26100 (3)
At4g25570 (9)
At1g14730 (1)
PLS-ACC PLS-AA SAM PSI-BLAST PLS-ACC PLS-AA SAM PSI-BLAST PLS-ACC PLS-AA SAM PSI-BLAST
3 2 1 1 5 5 2 2 5 5 5 5
3 3 1 1 3 3 1 1 3 3 3 3
5 5 5 5 7 7 4 4 7 7 4 4
1 1 0 0 1 1 0 1 1 1 1 1
training datasets
Arabidopsis only
Plants
Plants and animals
average lengthsb
435 (359) 432 (359) 509 (411) 509 (411) 517 (359) 517 (359) 534 (406) 528 (406) 536 (359) 536 (359) 522 (401) 522 (401)
a The numbers of ESTs identified by SSEARCH are shown in the parentheses. b The average lengths (bp) of ESTs correctly identified. The minimum lengths are shown in the parentheses.
Table 4. Identification of multidomain Cyt-b561 Proteins from Arabidopsis ESTsa numbers of ESTs identified for each Cyt-b561 training datasets
Arabidopsis only
Plants
Plants and animals
a
methods
At5g47530 (7)
At3g07570 (3)
At3g61750 (1)
PLS-ACC PLS-AA SAM PSI-BLAST PLS-ACC PLS-AA SAM PSI-BLAST PLS-ACC PLS-AA SAM PSI-BLAST
7 7 5 5 7 7 7 7 7 7 7 7
3 3 3 3 3 3 3 3 3 3 3 3
1 1 0 0 1 1 1 1 1 1 1 1
average lengths
530 (430) 532 (430) 535 (438) 535 (438) 530 (430) 530 (430) 530 (430) 530 (430) 530 (430) 530 (430) 530 (430) 530 (430)
See the footnotes for Table 3.
summarizes the performance of the six classifiers. All classifiers were affected by the training dataset size but at varied degrees. The accuracy rates at the MEP among classifiers were affected by the size of training datasets similarly (Figure 1A top). With the Training200 dataset, all classifiers showed 97% or higher accuracy. With smaller datasets, the accuracy rates decreased as low as 88-92% (Training10). When small datasets were used for training, PLS classifiers maintained low false negative rates (1-sensitivity; 4-8.5% with Training10 and 2-6% with Training20) as shown in Figure 1B. SAM and PSI-BLAST, however, had much more false negatives (66-78% with Training10 or Training20). Performance of SAM and PSI-BLAST also fluctuated much more than PLS classifiers among replicated trainings as indicated by the long error bars in Figure 1. Note that even though SAM and PSI-BLAST had very high false negative rates, they had very low false positive rates (1-specificity) even when very small training datasets were used (3-4% with Training10 or Training20; Figure 1A bottom). PLS classifiers, on the other hand, had false positive rates ranging from 7 to 11% when trained with such small datasets. The ROC plots in Figure 2 show the difference in performance between SAM/PSI-BLAST and PLS classifiers more clearly. When trained on small training datasets, SAM and PSIBLAST performed almost as a random classifier. They seemed to require more than 50 positive samples in the training sets to produce performance equivalent to PLS classifiers.
Identification of Short Subsequences. In the second experiment, we examined how different classifiers can identify fragments of protein sequences with various lengths. The classifiers were trained with the dataset including 200 samples (Training200; see Table 1). Figure 3 shows the results of N-terminal subsequence tests. Although the difference in accuracy was not large, PLS-ACC and PLS-AA performed slightly better than SAM and PSI-BLAST when sequence lengths were 75 amino acids (aa) or shorter (88-94% by PLS-ACC or PLSAA vs 86-89% by SAM or PSI-BLAST). The accuracy rate of PLS-ACC was 91.4% even against 50-aa sequences. For short sequences, the false negative rates of SAM and PSI-BLAST were very high (72-84% for 50aa sequences and 56-66% for 75aa sequences). False negatives given by PLS classifiers, especially PLS-ACC and PLS-AA, were much fewer (12% or fewer). Consistent with the results obtained against full-length sequences, SAM and PSI-BLAST had low false positive rates (below 5.5%).PLS classifiers had slightly higher false positive rates (9-15% for 50aa sequences and 7-10% for 75aa sequences). Results against C-terminal subsequences were also very similar (data not shown). Identification of Cytochrome b561 from A. thaliana EST Sequences. To examine the performance of these classifiers against the actual short partial sequences, we applied the classifiers for identifying A. thaliana ESTs. ESTs are usually 5′ or 3′ fragments of cDNA sequences, and in addition to 5′ or 3′ Journal of Proteome Research • Vol. 6, No. 2, 2007 851
research articles non-coding transcribed region, they could include short coding regions at the start or end of genes. For this test, we used another protein family, Cytochrome b561 (Cyt-b561), as the second example. Cyt-b561 is an integral membrane protein that is found in various organisms from humans to plants. It has six transmembrane regions as well as a pair of hemes. Cytb561 sequences also exist in some multidomain proteins, linked to other domains such as the Dopamine β-hydroxylase (DOH) domain.37 When we used simply BLASTP1 for similarity search using the four Arabidopsis Cyt-b561 sequences (At5g38630, At1g26100, At4g25570, and At1g14730) as queries and using the default E-value threshold of 10, we could identify only 6 ESTs (one derived from At5g38630, two derived from At1g26100, and three derived from At5g25570) of 18 found by SSEARCH. With three multidomainproteinscontainingtheCyt-b561domain(At5g47530, At5g47530, and At361750) as queries, we could identify only 4 ESTs (three derived from At5g47530 and one from At3g07570) of 11 found by SSEARCH. Next, we trained classifiers for Cyt-b561 identification and used these classifiers to find Arabidopsis ESTs that include Cytb561 fragments. As shown in Table 3, PLS-ACC and PLS-AA were able to identify EST sequences as short as 359 bp including 225 bp of a 5′ Cyt-b561 coding region. Such short ESTs could not be correctly identified by SAM and PSI-BLAST. When classifiers were trained with more Cyt-b561 including other plant and animal sequences, PLS-ACC and PLS-AA only missed two ESTs derived from At4g25570, but SAM and PSIBLAST missed five. All eleven ESTs from multidomain Cyt-b561 sequences were identified by PLS-ACC and PLS-AA regardless of the size of the training sets, whereas SAM and PSI-BLAST could identify only eight when they were trained using the dataset including only four Arabidopsis Cyt-b516 samples (Table 4).
Discussion The number of sequences used in the training dataset did not affect the performance of alignment-free PLS classifiers. Small samples from diverged sequences such as GPCRs must have caused both over-fitting as well as unreliable alignments to decrease sensitivities of the alignment-based methods. The consistent results were shown with Cyt-b561 protein identification from Arabidopsis EST sequences. PLS classifiers did not require a large training set to gain optimal performance on these EST sequences. It confirmed that PLS classifiers are more sensitive (fewer false negatives) than SAM and PSI-BLAST regardless of the training data sizes, sequence lengths, or protein families. Note, however, that PLS classifiers are in general less specific (with more false positives) than SAM and PSI-BLAST. More work is required to reduce the number of false positives by alignment-free PLS-classifiers. Among PLS classifiers PLS-ACC and PLS-AA performed slightly better than others. Although PLS-ACC can incorporate the interaction between amino acid positions, the performance gain with using ACC over simpler amino acid composition was not very significant. It is reasonable to assume that amino acid residues in a protein sequence are not independent and such relationships are specific to each protein group considering the existence of functional domains or sites. Such correlations among sites depending on functional domains can be identified and utilized by classifiers using ACC transformation. Therefore, it is surprising to see much simpler descriptors like amino acid composition indeed worked as well or better than ACC descrip852
Journal of Proteome Research • Vol. 6, No. 2, 2007
Opiyo and Moriyama
tors. One drawback of ACC is that calculating it is computationally expensive. With lag ) 30, the input vector for each sequence includes 775 descriptors. Using much simpler descriptors as amino acid composition appears to be more attractive, for example, for the first-step classification of the whole proteome analyses. As our results showed, the alignment-free PLS classifiers can be used in a situation where there are not enough example sequences. For example, currently only 12 members of Class D and 4 members of Class E GPCRs are available. Those few sequences can be effectively used to train PLS classifiers to search new GPCR sequences in those classes from whole proteome data as well as EST sequences.
Acknowledgment. We thank Dr. Steve Kachman (Department of Statistics, UNL) for statistical advice and Dr. Han Asard (Department of Biology, University of Antwerp, Belgium) for helpful discussion on Cytochrome b561 mining from Arabidopsis ESTs. This work was funded by Nebraska EPSCoR Women in Science grant and NSF EPSCoR Type II grant to E.N.M. and by Bioinformatics Interdisciplinary Research Scholars sponsored by NSF EPSCoR Infrastructure Improvement grant: Bioinformatics Research Laboratory to S.O.O. Supporting Information Available: Table S1 lists the A. thaliana ESTs containing known Cyt-b561 coding sequences. Table S2 lists the loadings of the physicochemical properties of amino acids for the five principal components. Table S3 lists the number of PLS components and the predictive ability of PLS-ACC from the leave-one-out cross validation procedure. Table S4 lists the number of PLS components and the predictive ability of PLS-AA from the leave-one-out cross validation procedure. Table S5 lists the number of PLS components and the predictive ability of PLS-ACC and PLS-AA from the leaveone-out cross validation procedure obtained from Arabidopsis EST mining. This material is available free of charge via the Internet at http://pubs.acs.org. References (1) Altschul, S. F.; Madden, T. L.; Schaffer, A. A.; Zhang, J.; Zhang, Z.; Miller, W.; Lipman, D. J. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25, 3389-3402. (2) Bateman, A.; Coin, L.; Durbin, R.; Finn, R. D.; Hollich, V.; GriffithsJones, S.; Khanna, A.; Marshall, M.; Moxon, S.; Sonnhammer, E. L. L.; Studholme, D. J.; Yeats, C.; Eddy, S. R. The Pfam protein families database. Nucleic Acids Res. 2004, 32, D138-141. (3) Hulo, N.; Sigrist, C. J. A.; Le Saux, V.; Langendijk-Genevaux, P. S.; Bordoli, L.; Gattiker, A.; De Castro, E.; Bucher, P.; Bairoch, A. Recent improvements to the PROSITE database. Nucleic Acids Res. 2004, 32, D134-D137. (4) Vapnik, V. N. The Nature of Statistical Learning Theory, 2nd ed.; Springer-Verlag: New York, 1999. (5) Karchin, R.; Karplus, K.; Haussler, D. Classifying G-protein coupled receptors with support vector machines. Bioinformatics 2002, 18, 147-159. (6) Liao, L.; Noble, W. S. Combining pairwise sequence similarity and Support vector machines for detecting remote protein evolutionary and structural relationships. J. Comput. Biol. 2003, 10, 857-868. (7) Kim, J.; Moriyama, E. N.; Warr, C. G.; Clyne, P. J.; Carlson, J. R. Identification of novel multi-transmembrane proteins from genomic databases using quasi-periodic structural properties. Bioinformatics 2000, 16, 767-775. (8) Moriyama, E. N.; Kim, J. Protein family classification with discriminant function analysis. In Genome Exploitation: Data Mining the Genome; Gustafson, J. P., Shoemaker, R., Snape, J. W., Eds.; Springer: New York, 2005; pp 121-132. (9) Bhasin, M.; Raghava, G. P. S. GPCRsclass: a web tool fro the classification of amine type of G-protein-coupled receptors. Nucleic Acids Res. 2005, 33, W143-W147.
research articles
Protein Family Classification with Partial Least Squares (10) Lapinsh, M.; Gutcaits, A.; Prusis, P.; Post, C.; Lundstedt, T.; Wikberg, J. E. S. Classification of G-protein coupled receptors by alignment-independent extraction of principal chemical properties of primary amino acid sequences. Protein Sci. 2002, 11, 795805. (11) Chou, K. C.; Elrod, D. W. Bioinformatical analysis of G-protein coupled receptors. J. Proteome Res. 2002, 1, 429-433. (12) Chou, K. C. Prediction of G-protein-coupled receptors classes. J. Proteome Res. 2005, 4, 1413-1418. (13) Boeckmann, B.; Bairoch, A.; Apweiler, R.; Blatter, M.-C.; Estreicher, A.; Gasteiger, E.; Martin, M. J.; Michoud, K.; O’Donovan, C.; Phan, I.; Pilbout, S.; Schneider, M. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 2003, 31, 365-370. (14) Horn, F.; Bettler, E.; Oliveira, L.; Campagne, F.; Cohen, F. E.; Vriend, G. GPCRDB information system for G protein-coupled receptors. Nucleic Acids Res. 2003, 31, 294297. (15) Benson, D. A.; Karsch-Mizrachi, I.; Lipman, D. J.; Ostell, J.; Wheeler, D. L. GenBank. Nucleic Acids Res. 2006, 34, D16-20. (16) Smith, T. F.; Waterman, M. S. Identification of common molecular subsequences. J. Mol. Biol. 1981, 147, 195-197. (17) Pearson, W. R. Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. Genomics 1991, 11, 635-650. (18) Sanberg, M.; Eriksson, L.; Jonsson, J.; Wold, S. New chemical descriptors relevant for the design of biologically active peptides. A multivariate characterization of 87 amino acids. J. Med. Chem. 1998, 41, 2481-2491. (19) Biemann, K. Sequencing of peptides by tandem mass spectrometry and high-energy collision-induced dissociation. Methods Enzymol. 1990, 193, 455-479. (20) Chothia, C. The nature of the accessible and buried surfaces in proteins. J. Mol. Biol. 1976, 105, 1-12. (21) Parker, J. M.; Guo, D.; Hodges, R. S. New hydrophilicity scale derived from high-performance liquid chromatography peptide retention data: correlation of predicted surface residues with antigenicity and X-ray-derived accessible sites. Biochemistry 1986, 25, 5425-5432. (22) Eisenberg, D.; Schwarz, E.; Komaromy, M.; Wall, R. Analysis of membrane and surface protein sequences with the hydrophobic moment plot. J. Mol. Biol. 1984, 179, 125142. (23) Zimmerman, J. M.; Eliezer, N.; Simha, R. The characterization of amino acid sequences in proteins by statistical methods. J. Theor. Biol. 1968, 21, 170-201. (24) Nozaki, Y.; Tanford, C. The solubility of amino acids and two glycine peptides in aqueous ethanol and dioxane solutions. Establishment of a hydrophobicity scale. J. Biol. Chem. 1971, 246, 2211-2217.
(25) Jones, D. D. Amino acid properties and side-chain orientation in proteins: a cross correlation approach. J. Theor. Biol. 1975, 50, 167-183. (26) Bordo, D.; Argos, P. Suggestions for “safe” residue substitutions in site-directed mutagenesis. J. Mol. Biol. 1991, 217, 721-729. (27) Levitt, M. Conformational preferences of amino acids in globular proteins. Biochemistry 1978, 17, 4277-4285. (28) Wold, S.; Jonsson, J.; Sjostrom, M.; Sandberg, M.; Rannar, S. DNA and Peptide Sequences and Chemical Processes Multivariately Modeled by Principal Component Analysis and Partial LeastSquares Projections to Latent Structures. Anal. Chim. Acta 1993, 277, 239-253. (29) R Development Core Team. R: A language and environment for statistical computing; R Foundation for Statistical Computing: Vienna, Austria, 2006; http://www.R-project.org. (30) Geladi, P.; Kowalski, B. R. Partial least squares regression: A tutorial. Anal. Chim. Acta 1986, 185, 1-17. (31) Wehrens, R.; Mevik, B. pls: Partial Least Squares Regression (PLSR) and Principal Component Regression (PCR), R package version 1.2-1; 2006; http://mevik.net/work/software/pls.html. (32) Thompson, J. D.; Higgins, D. G.; Gibson, T. J. Clustal-W Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994, 22, 4673-4680. (33) Durbin, R.; Eddy, S.; Krogh, A.; Mitchison, G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids; Cambridge University Press: Cambridge, 1998. (34) Hughey, R.; Krogh, A. Hidden Markov models for sequence analysis: Extension and analysis of the basic method. Comput. Appl. Biosci. 1996, 12, 95-107. (35) Sjo¨lander, K.; Karplus, K.; Brown, M.; Hughey, R.; Krogh, A.; Mian, I. S.; Haussler, D. Dirichlet mixtures: a method for improving detection of weak but significant protein sequence homology. Comput. Appl. Biosci. 1996, 12, 327-345. (36) Hanley, J. A.; McNeil, B. J. The meaning and use of the area under the receiver operating characteristics (roc) curve. Radiology 1982, 143, 29-36. (37) Ponting, C. P. Domain homologues of dopamine b-hydroxylase and ferric reductase: roles for iron metabolism in neurodegenerative disorders. Hum. Mol. Genet. 2001, 10, 1853-1858.
PR060534K
Journal of Proteome Research • Vol. 6, No. 2, 2007 853