Prediction of G-Protein-Coupled Receptor Classes - Journal of

obtained merely represents some sort of optimal estimation.9,10,14,17 Nevertheless, ...... Delaunay triangulation with partial least squares proje...
1 downloads 0 Views 521KB Size
Prediction of G-Protein-Coupled Receptor Classes Kuo-Chen Chou* Gordon Life Science Institute, 13784 Torrey Del Mar, San Diego, California 92130 Received April 4, 2005

Being the largest family of cell surface receptors, G-protein-coupled receptors (GPCRs) are among the most frequent targets of therapeutic drugs. The functions of many of GPCRs are unknown, and it is both time-consuming and expensive to determine their ligands and signaling pathways. This forces us to face a critical challenge: how to develop an automated method for classifying the family of GPCRs so as to help us in classifying drugs and expedite the process of drug discovery. Owing to their highly divergent nature, it is difficult to predict the classification of GPCRs by means of conventional sequence alignment approaches. To cope with such a situation, the CD (Covariant Discriminant) predictor was introduced to predict the families of GPCRs. The overall success rate thus obtained by jack-knife test for 1238 GPCRs classified into three main families, i.e., class A-“rhodopsin like”, class B-“secretin like”, and class C-“metabotrophic/glutamate/pheromone”, was over 97%. The high success rate suggests that the CD predictor holds very high potential to become a useful tool for understanding the actions of drugs that target GPCRs and designing new medications with fewer side effects and greater efficacy. Keywords: GPCR • rhodopsin like • secretin like • metabotrophic • glutamate • pheromone • evolutionary pharmacology • CD predictor • amino acid composition

I. Introduction One of the largest gene families in the human genome is that encoding the G-protein-coupled receptors (GPCRs), with approximately 450 genes identified to date. GPCRs are plasma membrane receptors, with a trademark of seven-transmembrane helices (Figure 1). They bind to and transduce signals for a huge variety of ligands including neurotransmitters, peptide hormones, growth factors, morphogens, odorants, tastants, photons, and other small molecules. The action mechanism of GPCRs is thru molecules called “second messengers” that relay signals received at receptors on the cell surfacessuch as the arrival of protein hormones, growth factors, etc.sto target molecules in the cytosol and/or nucleus. In addition to the job as relay molecules, second messengers also serve to amplify the strength of the signal. Being the largest family of cell surface receptors, GPCRs are a pharmacologically important protein family; pathways involving these receptors are the targets of hundreds of drugs, including antihistamines, neuroleptics, antidepressants, and antihypertensives. GPCRs also mediate the actions of certain medications used to treat disorders as diverse as cardiovascular disease, drug dependency, and mental illness.1 The functions of many of GPCRs are unknown, and determining their ligands and signaling pathways is both timeconsuming and costly. This difficulty has motivated and challenged the development of a computational method which can predict the classification of the families and subfamilies of GPCRs based on their primary sequences so as to help us * To whom correspondence should be addressed. E-mail: kchou@ san.rr.com 10.1021/pr050087t CCC: $30.25

 2005 American Chemical Society

Figure 1. Schematic representation of a GPCR with a trademark of seven-transmembrane helices, depicted as cylinders and connected by alternating cytoplasmic and extracellular hydrophilic loops. The 7-helix bundle thus formed has a central pore on its extracellular surface. The red entity located in the central pore represents a ligand messenger.

classify drugs, a technique which might be called “evolutionary pharmacology”. Actually, a statistical analysis has been performed for 566 GPCRs within the rhodopsin-like family classified into 7 subfamily classes: (1) adrenoceptor, (2) chemokine, (3) dopamine, (4) neuropeptide, (5) olfactory type, (6) rhodopsin, and (7) serotonin. Each of the 7 subtypes contains at least more than 30 sequences. The results thus obtained were quite encouragJournal of Proteome Research 2005, 4, 1413-1418

1413

Published on Web 06/02/2005

research articles

Chou

Figure 2. Schematic drawing to show three different main families of GPCRs: (a) class A-rhodopsin like, (b) class B-secretin like, and (c) class C-metabotrophic/glutamate/pheromone.

ing.2 The present study was initiated in an attempt to extend the statistical analysis from predicting the subfamily classification limited within only a special main family to predicting the classification among several different main families of GPCRs.

II. Materials and Method The proteins used for this study were collected from the GPCRDB (G Protein-Coupled Receptor Data Base).3,4 where GPCRs are classified into the following 6 main families: class A-rhodopsin like; class B-secretin like; class C-metabotrophic/ glutamate/pheromone; class D-fungal pheromone; class E-cAMP receptors; and class F-Frizzled/Smoothened family. The sequences of proteins in GPCRDB are derived from the SWISS-PROT and TREMBL Data Banks.5 All of the incomplete sequences that only contained fragments of the receptors were removed. Meanwhile, the NRDB program6 was used to check that none of the sequences was identical to any of others in the data set. Next, those families that contain too few sequences to have any statistical significance were dropped for further consideration. The remaining families obtained through such a screening procedure are as follows: (1) class A-rhodopsin like (Figure 2a); (2) class B-secretin like (Figure 2b); and (3) class C-metabotrophic/glutamate/pheromone (Figure 2c). They each contain at least more than 50 sequences. Listed in Table 1 are the accession numbers of the 1238 GPCRs, of which 1103 are of class A, 84 of class B, and 51 of class C. The accession number rather than the SWISS-PROT name is used here because the accession number is more stable for representing a unique protein sequence. It is instructive to conduct an analysis of the sequence identity for the proteins in a same family subset. The sequence identity percentage between two protein sequences is defined as follows. Suppose one sequence is N1 residues long and the other N2 residues long (N1 g N2), and the maximum number of residues matched by sliding one sequence along the other is M. The sequence identity percentage between the two sequences is defined as (M/N1)%. The treatment for gaps is according to ref 7. The sequence matches performed between all members in each subset of Table 1 have indicated that the average sequence identity percentages for classes A, B, and C are 18.05%, 22.67%, and 26.94% with a standard deviation of 8.43%, 17.44%, and 15.66%, respectively. These numbers indicate that the majority of pairs in each of the subsets concerned have low relative sequence identities. The amino acid composition8 is used to represent the sample of GPCR, and the CD (Covariant Discriminant) predictor adopted to perform the prediction of the GPCR families. For readers’ convenience, a brief introduction of the CD predictor is given below. For the details about the predictor and its development, refer to a series of previous papers.9-14 Suppose 1414

Journal of Proteome Research • Vol. 4, No. 4, 2005

the GPCRs in classes A, B, and C are categorized into classes 1, 2, and 3, respectively. Thus, class 1 contains only GPCRs rhodopsin like, class 2 only secretin like, class 3 only metabotrophic/glutamate/pheromone. Suppose the kth GPCR in the class m is represented by the following vector

[] ak,1m

Rkm )

ak,2m

(1)

l ak,20m

where ak,1m, ak,2m, ..., ak,20m are the amino acid-composition8-10 for the kth GPCR of class m, and nm the total number of GPCRs in class m. The standard vector for class m is defined by10

[] a jm 1

R hm )

a jm 2 , l m a j 20

(m ) 1, 2, ..., µ)

(2)

where a jm i )

1

nm

∑a

nm k)1

m k,i,

(i ) 1, 2, ..., 20)

(3)

Suppose R is a query GPCR whose family class is to be identified. It can also be represented by a point or vector in the 20-D (dimensional) space with the components of (a1, a2, m ..., a20), where ai has the same meaning as ak,i of eq 1 but is associated with the receptor R instead of Rm k . The scale in measuring the difference between the query receptor R and the norm R h m of class m is by the following covariant discriminant function, as defined by Chou et al.:12 ∆(R,R h m) ) DM2(R,R h m) + ln|Bm|, (m ) 1, 2, ..., µ)

(4)

DM2(R,R h m) ) (R - R h m)TBm-1(R - R h m)

(5)

where

is the squared Mahalanobis distance,10,15,16 T is the transposition operator, while |Bm| and Bm-1 are respectively the determinant and inverse matrix of Bm, which is the covariance matrix for class m and given by

[

bm 1,1

Bm )

bm 1,2

‚‚‚ bm 1,20

bm bm ‚‚‚ bm 2,1 2,2 2,20 l l ‚‚‚ l m m bm 20,1 b20,2 ‚‚‚ b20,20

]

(6)

research articles

GPCR Class Prediction Table 1. List of the Accession Numbers for the 1238 GPCRs Classified into Three Families

Journal of Proteome Research • Vol. 4, No. 4, 2005 1415

research articles

Chou

Table 1 (Continued)

where the matrix elements are given by m bi,j )

1

nm

∑[a

nm - 1 k)1

m k,i

m -a jm jm i ][ak,j - a j ], (i, j ) 1, 2, ..., 20) (7)

According to the principle of similarity, the smaller the differ1416

Journal of Proteome Research • Vol. 4, No. 4, 2005

ence between the query receptor R and the norm of class m, the higher the likelihood that receptor R belongs to class m. Accordingly, the identification rule can be formulated as follows ∆(R,R h Λ) ) Min{∆(R,R h 1), ∆(R,R h 2), ..., ∆(R,R h µ)}

(8)

where Λ can be 1, 2, ..., or µ, and the operator Min means

research articles

GPCR Class Prediction Table 2. Success Rates in Identifying the Main Families of GPCRs Class A Rhodopsin like

1092/1103 ) 99.00% 1092/1103 ) 99.00% 84/84 ) 100% 83/84 ) 98.81%

Class B Secretin like

Class C Metabotrophic/ glutamate/pheromone

testa

Re-substitution 83/84 ) 98.81% 51/51 ) 100% Jack-knife testa 74/84 ) 88.10% 40/51 ) 78.43% Random re-substitution testb 84/84 ) 100% 51/51 ) 100% Random jack-knife testb 79/84 ) 94.05% 42/51 ) 82.35%

Overall

1226/1238 ) 99.03% 1206/1238 ) 97.42% 219/219 ) 100% 204/219 ) 93.15%

a Prediction was made on the data set given in Table 1. The CD predictor (see eqs 1-8) was used to perform the prediction. b Prediction was made for the data set that consists of 84 class A GPCRs randomly picked from the 1103 class A GPCRs of Table 1, as well as its 84 class B and 51 class C GPCRs. See the above footnote for further explanation.

Table 3. List of the Accession Numbers for the 84 GPCRs Randomly Picked from the 1103 GPCRs of Class A in Table 1

taking the minimal one among those in the brackets. The value of the superscript Λ derived from eq 8 indicates which class the query receptor R belongs to. If there is a tie case, then Λ is not uniquely determined, but that did not happen for the datasets studied here. Before using the above equations for practical calculations, the following point should be realized. Owing to the normalization condition imposed by the definition of amino acidcomposition, of the 20 components in eq 1, only 19 are independent,10 and hence the covariance matrix Bm as defined by eq 7 must be a singular one.9 This would lead the Mahalanobis distance defined by eq 5 and the covariant discriminant function by eq 4 to be divergent and meaningless. To cope with such a situation, the dimension-reducing procedure10 was adopted in practical calculations; i.e., instead of 20-D space, a receptor is defined in a (20-1)-D space by leaving out one of its 20 amino acid components. The remaining 19 components would be completely independent, thereby the corresponding covariance matrix Bm being no longer singular. In other words, the Mahalanobis distance (eq 5) and the covariant discriminant function (eq 4) based on such a 19-D space can be uniquely defined without any trouble. However, a question might be raised: which one of the 20 components can be left out? The answer is: any one of them. Will it lead to a different predicted result by leaving out a different component? The answer is: no. According to the invariance theorem given in Appendix A of Chou,10 both the value of the Mahalanobis distance and the value of the determinant of Bm will remain exactly the same regardless of which one of the 20 components is left out. Accordingly, the final value of the covariant discriminant function (eq 4) can be uniquely defined through such a dimension-reducing procedure.

III. Results and Discussion Now let us use the predictor formulated in the last section to examine the success rates in identifying the family classes for the 1238 GPCRs listed in Table 1. The examinations were

conducted by two different approaches, the re-substitution test and the jack-knife test, as reported below. Re-Substitution Test. The re-substitution test is used to examine the self-consistency of a prediction method. During the re-substitution process, the class for each of the GPCRs in the data set is in turn identified using the rule parameters derived from the same data set, the so-called training data set. The success rates thus obtained for the 1238 GPCRs in Table 1 are given in Table 2, from which we can see that the overall success rate is 99.03%, indicating that the current prediction method is highly self-consistent. It should be pointed out that during the above process the rule parameters derived from the training data set include the information of the query GPCR later plugged back for testing itself. This will certainly enhance the success rate because the same samples are used to derive the rule parameters and to test themselves. Therefore, the success rate thus obtained merely represents some sort of optimal estimation.9,10,14,17 Nevertheless, the re-substitution test is useful because it reflects the self-consistency. A predictor with a poor self-consistency certainly cannot be deemed as a good one. However, to really reflect the power of a predictor, a cross-validation test by excluding the tested samples from the training data set is needed. Jack-knife Test. Three different examinations are often used in statistical prediction for cross-validation. They are independent data set test, sub-sampling test, and jack-knife test. Of these three, however, the jack-knife test is deemed as the most rigorous and objective one [see ref 18 for a comprehensive discussion about this, and ref 19 for the underlying mathematical principle]. For the cross-validation by jack-knifing, each of the proteins in the data set is in turn singled out as a tested sample and all the rule-parameters are calculated based on the remaining proteins without including the one being identified. Therefore, both the training data set and testing data set during the jack-knifing process are actually open, and a sample will in turn move from one to the other. The results of jack-knife test thus obtained for the 1238 GPCRs are also given in Table Journal of Proteome Research • Vol. 4, No. 4, 2005 1417

research articles 2, from which we can see the following. As expected, the success identification rates by jack-knife test are lower than those by the re-substitution test, particularly for the smallest subset of class C. This is because the cluster-tolerant capacity20 for small subsets is usually low. Therefore, the information loss due to jack-knifing will have a greater impact on the small subsets than the large ones. Nevertheless, the overall success rate by jack-knife test for the data set of 1238 GPCRs is still as high as 97.42%. It is anticipated that the success rate for identifying class C of GPCRs can be enhanced by adding into its subset more newly found proteins that have been found belonging to this class. Because the number of samples for class A is overwhelming in the current dataset, the following argument might be brought up against the above high success rates. If the identification was made by always choosing class A, the overall success rate thus obtained could also reach as high as 1103/1238 ) 89.10%, implying that the high overall success rate was resulted from the extreme uneven distribution of the dataset investigated but not the power of the predictor. To address this problem, a sizereduced subset for class A was formed by randomly picking 84 samples from the 1,103 GPCRs of the original subset for class A. The accession numbers for the 84 GPCRs thus generated are given in Table 3. Now let us use the data of class A in Table 3 as well as the data for class B and class C in Table 1 to form a new working dataset, which contains 84 + 84 + 51 ) 219 GPCRs. For such a new dataset, the same jack-knife test was performed and the results are also given in Table 2, from which we can see that the overall success rate could reach over 93%. In contrast to this, if the identification was made by blindly sticking to class A, the overall success rate would be only 84/219 ) 38.35%, which is more than 50% lower than that by the CD predictor.

IV. Conclusion GPCRs are the largest family of cell surface receptors, accounting for >1% of the human genome. They play a key role in cellular signaling networks that regulate various physiological processes. The critical physiological roles of GPCRs have made them among the most frequent targets of therapeutic drugs. Many efforts in pharmaceutical research have been aimed at understanding their structure and function. Unfortunately, so far, very few GPCR structures have been determined by either X-ray or NMR technique because it is difficult to crystallize them and most of GPCRs will not dissolve in normal solvents. In contrast, more than thousand GPCR sequences are known, and much more are expected to come soon. To timely use the uncharacterized GPCRs for drug discovery and basic research, it is highly desirable to develop a computational method that can rapidly and accurately predict the classification of their families. It is difficult to predict the classification of GPCRs by using the conventional sequence alignment approach owing to the nature of their high divergence. To tackle the sequence divergent problem, the CD predictor is introduced that is formulated based on a series of discrete numbers such as those constituting the amino acid composition.

1418

Journal of Proteome Research • Vol. 4, No. 4, 2005

Chou

The high success rates obtained in this study imply that the families of GPCRs are closely correlated with their amino acid composition, and that the CD predictor is quite promising and may become a powerful tool in this area.

Acknowledgment. The author wishes to thank the three anonymous referees whose constructive suggestions are very helpful in strengthening the presentation of this paper. References (1) Roth, B. L.; Willins, D. L.; Kroeze, W. K. G protein-coupled receptor (GPCR) trafficking in the central nervous system: relevance for drugs of abuse. Drug Alcohol Depend. 1998, 51, 7385. (2) Chou, K. C.; Elrod, D. W. Bioinformatical analysis of G-proteincoupled receptors. J. Proteome Res. 2002, 1, 429-433. (3) Horn, F.; Weare, J.; Beukers, M. W.; Horsch, S.; Bairoch, A.; Chen, W.; Edvardsen, O.; Campagne, F.; Vriend, G. GPCRDB: an information system for G protein-coupled receptors. Nucleic Acids Res. 1998, 26, 275-279. (4) Horn, F.; Vriend, G.; Cohen, F. E. Collecting and harvesting biological data: the GPCRDB and NucleaRDB information systems. Nucleic Acids Res. 2001, 29, 346-349. (5) Bairoch, A.; Apweiler, R. The SWISS-PROT protein sequence data bank and its supplement TrEMBL. Nucleic Acids Res. 2000, 25, 31-36. (6) Gish, W. http://blast.wustl.edu/pub/nrdb/1999. (7) Thompson, J. D.; Higgins, D. G.; Gibson, T. J. CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice Nucleic Acids Res. 1994, 22, 46734680. (8) Chou, J. J.; Zhang, C. T. A joint prediction of the folding types of 1490 human proteins from their genetic codons J. Theor. Biol 1993, 161, 251-262. (9) Chou, K. C.; Zhang, C. T. Predicting protein folding types by distance functions that make allowances for amino acid interactions. J. Biol. Chem. 1994, 269, 22014-22020. (10) Chou, K. C. A novel approach to predicting protein structural classes in a (20-1)-D amino acid composition space. Proteins: Struct. Funct. Genet. 1995, 21, 319-344. (11) Liu, W.; Chou, K. C. Prediction of protein structural classes by modified Mahalanobis discriminant algorithm. J. Protein Chem. 1998, 17, 209-217. (12) Chou, K. C.; Liu, W.; Maggiora, G. M.; Zhang, C. T. Prediction and classification of domain structural classes. Proteins: Struct. Funct. Genet. 1998, 31, 97-103. (13) Chou, K. C.; Elrod, D. W. Protein subcellular location prediction. Protein Eng. 1999, 12, 107-118. (14) Zhou, G. P. An intriguing controversy over protein structural class prediction. J. Protein Chem. 1998, 17, 729-738. (15) Mahalanobis, P. C. On the generalized distance in statistics. Proc. Natl. Inst. Sci. India 1936, 2, 49-55. (16) Pillai, K. C. S. In Encyclopedia of Statistical Sciences; Kotz, S., Johnson, N. L., Eds.; John Wiley & Sons: New York, 1985; Vol. 5, pp 176-181. (17) Zhou, G. P.; Assa-Munt, N. Some insights into protein structural class prediction. Proteins: Struct. Funct. Genet. 2001, 44, 57-59. (18) Chou, K. C.; Zhang, C. T. Review: Prediction of protein structural classes. Crit. Rev. Biochem. Mol. Biol. 1995, 30, 275-349. (19) Mardia, K. V.; Kent, J. T.; Bibby, J. M. Multivariate Analysis; Academic Press: London, 1979; Chapter 11 Discriminant Analysis; Chapter 12 Multivariate analysis of variance; Chapter 13 cluster analysis; pp. 322-381. (20) Chou, K. C. A key driving force in determination of protein structural classes. Biochem. Biophys. Res. Commun. 1999, 264, 216-224.

PR050087T