Possible Random Mechanism in Crystallization ... - ACS Publications

Aug 3, 2011 - Guangxi Key Laboratory of Biorefinery, Guangxi Academy of Sciences, 98 Daling Road, Nanning, Guangxi, 530007, China. ‡. DreamSciTech ...
0 downloads 0 Views 1MB Size
ARTICLE pubs.acs.org/crystal

Possible Random Mechanism in Crystallization Evidenced in Proteins from Plasmodium falciparum Shaomin Yan† and Guang Wu*,†,‡ †

State Key Laboratory of Non-food Biomass Enzyme Technology, National Engineering Research Center for Non-food Biorefinery, Guangxi Key Laboratory of Biorefinery, Guangxi Academy of Sciences, 98 Daling Road, Nanning, Guangxi, 530007, China ‡ DreamSciTech Consulting, 301, Building 12, Nanyou A-zone, Jiannan Road, Shenzhen, Guangdong, 518054, China

bS Supporting Information ABSTRACT: Protein crystallization is a process with considerably indescribable difficulties because many known and unknown factors contribute to the process to varying degrees, which are generally defined as amino acid attributes related to physicochemical properties. However, one might wonder whether randomness plays a role in crystallization. Following this, two questions are (i) can we find out the role of randomness in the crystallization process, and (ii) does more randomness or less randomness in a protein make it easily crystallized? In this study, we used logistic regression and neural network with each of 535 amino acid attributes including randomness attributes to fit the successful rate of crystallization of 118 proteins from Plasmodium falciparum; then we developed a predictive model for checking the role of random attributes in predicting protein crystallization, and we compared crystallized proteins with noncrystallized proteins in terms of random amino acid attributes. The results provide three pieces of clear evidence that randomness plays a role in protein crystallization and a protein that has more randomness is more easily crystallized.

’ INTRODUCTION The knowledge of protein three-dimensional (3D) structures is vitally important for rational drug design. Although X-ray crystallography is a powerful tool in determining protein 3D structures, it is time-consuming and expensive. In particular, not all proteins can be successfully crystallized. For example, membrane proteins are very difficult to crystallize, and most of them will not dissolve in normal solvents. Therefore, so far very few membrane protein structures have been determined. Although NMR is indeed a very powerful tool in determining the 3D structures of membrane proteins,1 8 it is also time-consuming and costly. To acquire the structural information in a timely manner, one has to resort to various structural bioinformatics tools.9 11 Unfortunately, the number of templates for developing high-quality 3D structures by structural bioinformatics is very limited.9 In view of this, it would be very useful to develop an effective method to enhance the success rate of crystallizing proteins. The present study was aimed to address and analyze such an important problem. Many efforts are made to determine and define the factors that impact the process of protein crystallization. Some studies, for example, are directed to explore the relationship between physicochemical properties and crystallization success,12,13 while other studies try to build a predictive model between various thinkable factors and crystallization success.14,15 So far, the defined factors playing roles in the crystallization process are almost exclusively related to amino acid physicochemical r 2011 American Chemical Society

properties as well as other well-documented properties, but we wonder whether randomness could also play a role in the crystallization process. Our consideration is so not only because there are many unknown factors that could be considered as random factors in the crystallization process, but also a protein should contain some randomness, which can be represented as an amino acid attribute. In this context, two questions raised here are (i) can we find out the role of randomness in the crystallization process, and (ii) does more randomness or less randomness in a protein make it easily crystallized? To answer the first question, an easy way is to use a model to fit the successful rate of protein crystallization with different amino acid attributes including random attributes each time to compare whether the fittings with random attributes are equal to or better than the fittings with other attributes. To answer the second question, we need a measure that is also an attribute of protein to determine the degree of randomness in a protein in question, and then compare the random measure with respect to whether a protein can be crystallized. Technically, such studies should not include too many samples from various species and resources because the mixed samples might offset random factors in one or another way, which we may never know. In this study, we attempt to address Received: June 28, 2011 Revised: July 25, 2011 Published: August 03, 2011 4198

dx.doi.org/10.1021/cg200814k | Cryst. Growth Des. 2011, 11, 4198–4204

Crystal Growth & Design

ARTICLE

Table 1. Difference between Nonrandom and Random Amino Acid Attributesa number amino acid

P1

BAEK050101

P2

P1

P2

BAEK050101  no.

distribution probability

future composition, %

P2

P2

P1

P1

P2

P2

A

2

3

0.0166

0.0166

0.1111

0.1111

0.0332

0.0498

3.25

3.42

R

3

7

0.0762

0.0762

0.1071

0.1071

0.2286

0.5334

7.22

7.70

N

17

10

0.0786

0.0786

0.0381

0.0381

1.3362

0.7860

6.70

6.30

D

4

4

0.1278

0.1278

0.1875

0.1875

0.5112

0.5112

4.45

3.65

C

8

1

0.5724

0.5724

1.0000

1.0000

4.5792

0.5724

2.70

2.47

E

7

4

0.1051

0.1051

0.5625

0.5625

0.7357

0.4204

3.27

3.39

Q

2

8

0.1794

0.1794

0.1682

0.1682

0.3588

1.4352

3.94

4.40

G H

5 8

5 9

0.0442 0.1643

0.0442 0.1643

0.1920 0.0007

0.1920 0.0007

0.2210 1.3144

0.2210 1.4787

4.13 3.60

3.28 4.48

I

14

10

0.2758

0.2758

0.1524

0.1524

3.8612

2.7580

7.48

6.90

L

9

12

0.2523

0.2523

0.1163

0.1163

2.2707

3.0276

8.26

10.07

K

16

15

0.2134

0.2134

0.0374

0.0374

3.4144

3.2010

6.08

5.23

M

5

4

0.0197

0.0197

0.1406

0.1406

0.0985

0.0788

2.33

2.18

F

5

11

0.3561

0.3561

0.1077

0.1077

1.7805

3.9171

3.39

3.51

P

3

1

0.4188

0.4188

1.0000

1.0000

1.2564

0.4188

3.25

4.33

S T

8 4

8 10

0.1629 0.0701

0.1629 0.0701

0.2523 0.1905

0.2523 0.1905

1.3032 0.2804

1.3032 0.7010

7.85 6.62

7.45 7.07

W

3

0

0.3836

0.3836

0.0000

0.0000

1.1508

0.0000

1.11

0.67

Y

4

7

0.2500

0.2500

0.2142

0.2142

1.0000

1.7500

4.13

3.79

V

4

2

0.1782

0.1782

0.5000

0.5000

0.7128

0.3564

5.34

5.00

a

The accession numbers are Pfal002849AAA for protein 1 (P1) and Pfal008572AAA for protein 2 (P2). BAEK050101 is an attribute obtained from AAIndex,29 No. is the number of given type of amino acids. The random attribute, amino acid distribution probability, was computed according to the equation r!/(q0!  q1!  .  qn!)  r!/(r1!  r2!  .  rn!)  n r, where ! is the factorial function, r is the number of a type of amino acid, q is the number of partitions with the same number of amino acids, and n is the number of partitions in the protein for a type of amino acid.24 The computation can be found on the web site.28 The future amino acid composition can be calculated on the web site.61

these two questions based on the crystallization of proteins from Plasmodium falciparum. According to a recent comprehensive review,16 to establish a really useful statistical predictor or model for a protein system, we need to consider the following procedures: (i) construct or select a valid benchmark data set to train and test the predictor; (ii) formulate the statistical samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the attribute to be predicted; (iii) introduce or develop a powerful algorithm (or engine) to operate the prediction; (iv) properly perform cross-validation tests to objectively evaluate the anticipated accuracy of the predictor; (v) establish a user-friendly web server for the predictor that is accessible to the public. Below, let us describe how to deal with these steps with respect to the protein crystallization.

’ EXPERIMENTAL SECTION Data. A total of 118 proteins from Plasmodium falciparum were found in TargetDB17 under the purified criterion before 2010, of which 42 were found under the crystallized criterion before 2010. These two criteria were once used for the development of the web server for the prediction.15 Random Attributes. For our purpose, we must have the attributes that can represent the aspect of randomness in protein as the same as many well-documented attributes represent a particular aspect of amino acid as well as protein, for example, physicochemical properties of amino acids. Over the past decade, we have developed three random attributes representing different aspects of amino acids and protein as a whole and have published around a hundred papers using these three attributes.18 22

The first attribute of randomness is the amino acid pair predictability based on permutation. For instance, there are 16 lysines (K), 17 asparagines (N), and 8 histidines (H) in Pfal007889AAH protein. According to the permutation, the amino acid pair KN would appear twice (16/131  17/130  130 = 2.076), and there are indeed two KNs in this protein so the pair KN is predictable. However, the amino acid pair HH would not appear (8/131  7/130  130 = 0.427), but it appears five times in reality, so the pair HH is unpredictable. In this way, all amino acid pairs are classified into predictable or unpredictable becoming a measure to represent a protein, whose computation can be found on the web site.23 For Pfal007889AAH protein, its predictable and unpredictable portions are 74.25% and 25.75%. The second attribute of randomness is the amino acid distribution probability based on the occupancy of subpopulations and partitions derived from a statistical mechanism that classifies the distribution of elementary particles in energy states according to three assumptions with respect to whether or not to distinguish each particle and energy state, that is, Maxwell Boltzmann, Fermi Dirac, and Bose Einstein assumptions.24 Two worked examples are listed in columns 8 and 9 of Table 1. The third attribute of randomness is the amino acid future composition based on the relationship between RNA codens and their translated amino acids (Supporting Information),25 27 whose computation can be found on the web site.28 Two worked examples are listed in columns 10 and 11 of Table 1. Naturally, each above attribute describes a different aspect of randomness in protein. Nonrandom Attributes. The nonrandomness attributes are welldocumented and can be found in AAIndex,29 which contains 540-plus amino acid properties representing various aspects of amino acids, such 4199

dx.doi.org/10.1021/cg200814k |Cryst. Growth Des. 2011, 11, 4198–4204

Crystal Growth & Design

ARTICLE

Table 2. Results from Logistic Regression no.

amino acid attribute

log likelihood 2*[LL(N)-LL(0)] true positive false positive false negative true negative accuracy % sensitivity % specificity %

1

future composition, %

50.78

52.08

25.20

16.80

16.80

59.20

0.72

0.60

0.78

1

current composition, %

51.13

51.38

24.83

17.17

17.17

58.83

0.71

0.59

0.77

1

BAEK050101

53.23

47.18

24.08

17.92

17.92

58.08

0.70

0.57

0.76

439 amino acid number

53.31

47.03

24.05

17.95

17.95

58.05

0.70

0.57

0.76

13

RADA880106

53.46

46.73

23.99

18.02

18.02

57.99

0.69

0.57

0.76

1

MIYS990101

53.47

46.71

24.01

17.99

17.99

58.01

0.70

0.57

0.76

2

QIAN880126

53.67

46.31

23.90

18.10

18.10

57.90

0.69

0.57

0.76

2 1

NADH010105 RICJ880115

53.70 53.73

46.25 46.20

23.95 23.90

18.06 18.10

18.06 18.10

57.95 57.90

0.69 0.69

0.57 0.57

0.76 0.76

1

RICJ880114

53.82

46.00

23.85

18.15

18.15

57.85

0.69

0.57

0.76

1

QIAN880130

53.85

45.95

23.87

18.13

18.13

57.87

0.69

0.57

0.76

4

JOND750101

54.06

45.54

23.82

18.18

18.18

57.82

0.69

0.57

0.76

2

QIAN880132

54.31

45.02

23.74

18.27

18.27

57.74

0.69

0.57

0.76

1

MIYS990102

54.32

45.00

23.68

18.32

18.32

57.68

0.69

0.56

0.76

6

FAUJ880108

54.34

44.97

23.69

18.31

18.31

57.69

0.69

0.56

0.76

1 3

GEIM800109 RADA880104

54.64 54.66

44.36 44.34

24.05 23.67

17.95 18.33

17.95 18.33

58.05 57.67

0.70 0.69

0.57 0.56

0.76 0.76

2

QIAN880117

54.94

43.77

24.05

17.95

17.95

58.05

0.70

0.57

0.76

1

ROBB760107

55.61

42.43

23.30

18.71

18.71

57.30

0.68

0.55

0.75

3

QIAN880107

55.94

41.77

23.00

19.00

19.00

57.00

0.68

0.55

0.75

6

TANS770102

56.37

40.92

22.80

19.20

19.20

56.80

0.67

0.54

0.75

1

RADA880107

56.43

40.78

23.15

18.85

18.85

57.15

0.68

0.55

0.75

1

ZASB820101

56.61

40.44

23.19

18.81

18.81

57.19

0.68

0.55

0.75

1 1

KHAG800101 TANS770108

56.82 56.88

40.01 39.89

23.22 23.17

18.78 18.83

18.78 18.83

57.22 57.17

0.68 0.68

0.55 0.55

0.75 0.75

1

MAXF760105

56.98

39.68

23.00

19.00

19.00

57.00

0.68

0.55

0.75

1

MAXF760104

57.35

38.95

22.47

19.53

19.53

56.47

0.67

0.53

0.74

1

ISOY800107

57.46

38.73

22.44

19.56

19.56

56.44

0.67

0.53

0.74

1

BUNA790103

57.86

37.93

22.32

19.68

19.68

56.32

0.67

0.53

0.74

1 1 1 8 1 1 3 1 1 1 1 1 1 1 1 1

VELV850101 COSI940101 QIAN880113 YUTK870103 CHOC760104 TANS770107 QIAN880129 PALJ810113 GOLD730101 CHAM820102 distribution probability CHAM830104 RACS820103 CHAM830108 GRAR740101 FAUJ880109

58.02 58.03 58.37 58.42 58.68 59.21 59.85 60.50 61.22 62.16 62.62 63.01 63.61 65.37 65.62 67.11

37.60 37.59 36.90 36.81 36.29 35.23 33.95 32.65 31.20 29.32 28.41 27.62 26.43 22.91 22.40 19.43

22.29 22.29 22.47 22.24 22.19 21.72 21.87 21.40 21.27 21.17 20.27 20.68 20.20 19.60 19.60 19.14

19.71 19.71 19.71 19.76 19.81 20.28 20.13 20.60 20.73 20.83 21.73 21.32 21.80 22.40 22.40 22.86

19.71 19.71 19.71 19.76 19.81 20.28 20.13 20.60 20.73 20.83 21.73 21.32 21.80 22.40 22.40 22.86

56.29 56.29 56.29 56.24 56.19 55.72 55.87 55.40 55.27 55.17 54.27 54.68 54.20 53.60 53.60 53.14

0.67 0.67 0.67 0.67 0.66 0.66 0.66 0.65 0.65 0.65 0.63 0.64 0.63 0.62 0.62 0.61

0.53 0.53 0.53 0.53 0.53 0.52 0.52 0.51 0.51 0.50 0.48 0.49 0.48 0.47 0.47 0.46

0.74 0.74 0.74 0.74 0.74 0.73 0.74 0.73 0.73 0.73 0.71 0.72 0.71 0.71 0.71 0.70

1 1 1

CHAM830105 FAUJ880110 NOZY710101

68.52 68.87 70.75

16.61 15.91 12.15

18.24 18.56 17.59

23.76 23.44 24.41

23.76 23.44 24.41

52.24 52.56 51.59

0.60 0.60 0.59

0.43 0.44 0.42

0.69 0.69 0.68

1 1 1 1

RACS820109 MITS020101 VENT840101 CHAM830107

71.37 72.04 73.27 73.72

10.90 9.57 7.11 6.20

17.18 17.05 16.54 16.20

24.82 24.95 25.46 25.80

24.82 24.95 25.46 25.80

51.18 51.05 50.54 50.20

0.58 0.58 0.57 0.56

0.41 0.41 0.39 0.39

0.67 0.67 0.67 0.66

1

FAUJ880111

74.60

4.45

15.89

26.11

26.11

49.89

0.56

0.38

0.66

1

KLEP840101

75.09

3.46

15.71

26.29

26.29

49.71

0.55

0.37

0.65

1

FAUJ880112

76.23

1.20

15.21

26.79

26.79

49.21

0.55

0.36

0.65

4200

dx.doi.org/10.1021/cg200814k |Cryst. Growth Des. 2011, 11, 4198–4204

Crystal Growth & Design

ARTICLE

Figure 1. Accuracy, sensitivity, and specificity of crystallization classified by logistic regression with 535 amino acid attributes. as different composition, physicochemical properties, spatial properties,30 electronic properties,31 hydrophobic properties,32 predictors for secondary structures,33 and so on. Actually, these attributes are so classical that some are not available for several types of amino acids, so 531 attributes of amino acids are useful.

Methods to Determine Role of Randomness in Crystallization. Because whether a protein can be crystallized is a yes no event, which can be associated with various attributes of 20 types of amino acids, this relationship can be modeled with either logistic regression or neural network.34 We use logistic regression and then neural network to model the relationship between amino acid attributes and the successful rate of crystallization because logistic regression and neural network have different mechanisms. Statistics. The fitted and predicted results are classified as true positive, true negative, false positive, and false negative. Thus accuracy, sensitivity, and specificity are calculated as follows (true positive + true negative)/(true positive + false positive + true negative + false negative) * 100, (true positive)/(true positive + false negative) * 100, and (true negative)/(true negative + false positive) * 100. SYSTAT35 is used to perform logistic regression. MatLab36 is used to perform neural network. The Student’s t-test and Mann Whitney U-test are used for parametric and nonparametric comparison.

’ RESULTS AND DISCUSSION In principle, we have no way to exclude random factors from the crystallization process, so we should invent ways to measure randomness in proteins. Hence, we need to look at the difference between random and nonrandom attributes, which is not only related to their underlined mechanisms but also related to their values; that is, random attributes are calculated case by case, while nonrandom attributes were measured in the past. Intuitively, we would expect that an amino acid would have different values at different positions in a protein because we know that an amino acid plays different roles at different positions in a protein. However, nonrandom attributes do not reflect this important aspect because they are constants for each type of amino acid, whereas the random attributes do reflect this aspect because they are different with respect to amino acid position, protein length, neighboring amino acids, etc. Table 1 shows such a difference with two proteins as worked examples. As can be seen, the linker index (BAEK050101) is an

Figure 2. Accuracy, sensitivity, and specificity of modeling the relationship between the amino acid attribute and the successful rate of crystallization using 20 1 feedforward backpropagation neural network.

amino acid attribute that predicts protein interdomain linker regions using sequence alone.37 It is constant for a certain type of amino acid no matter where the amino acid is located in a protein, which neighboring amino acid is in a protein, etc. However, this attribute can be different for the same type of amino acid if we weigh it with multiplying its number in protein (columns 6 and 7 in Table 1). On the other hand, the amino acid distribution probability (columns 8 and 9 in Table 1) and future composition (columns 10 and 11 in Table 1) are different for different proteins. After clarifying the difference between random and nonrandom attributes, we began to model the relationship between each amino acid attribute and crystallization. Table 2 details the results from logistic regression, during which we each time used an amino acid attribute to replace 20 types of amino acids weighed with their composition as shown in Table 1 to logistically regress the successful rate of crystallization. Actually, many amino acid attributes have the same regressed results; for example, the fifth row shows that 439 amino acid attributes from AAIndex29 have the same result, while 13 amino acid attributes result in the same in the sixth row. The fourth column in Table 2 ranks various amino acid attributes, where we can see that a random attribute, the future composition, provides the highest accuracy, sensitivity, and specificity. Figure 1 is the further elaboration of grouped results in Table 2 with respect to the accuracy, sensitivity, and specificity, where we can see the positions of two random attributes compared with nonrandom attributes. Both Table 2 and Figure 1 are the first piece of evidence we have to prove our hypothesis that randomness 4201

dx.doi.org/10.1021/cg200814k |Cryst. Growth Des. 2011, 11, 4198–4204

Crystal Growth & Design

Figure 3. Accuracy, sensitivity, and specificity obtained from delete-1 jackknife validation with 535 amino acid attributes.

might play a role in protein crystallization (for details, see Supporting Information). At this stage, one might feel that the relationship defined using logistic regression is somewhat simple although it was used in crystallization prediction.4 We therefore use a more powerful and sophisticated tool, neural network, to model the relationship between amino acid attributes and crystallization, because the neural network can in principle account for various implicate or explicate relationships.36,38 Figure 2 illustrates the results obtained from modeling the relationship between amino acid attribute and crystallization. As done in logistic regression, we each time used an amino acid attribute to replace 20 types of amino acids weighed with their composition to model the successful rate of crystallization. As can be seen in Figure 2, the random amino acid attribute, the amino acid distribution probability, works best in terms of accuracy and sensitivity. Meanwhile, many amino acid attributes have similar or the same accuracy, sensitivity, and specificity. Figure 2 provides us with the second piece of evidence that randomness plays a role in protein crystallization (for details, see Supporting Information). Here, one might wonder why the results from logistic regression and neural network favor different random attributes, say, the future amino acid composition is the best in logistic regression, whereas the amino acid distribution probability is the best in a neural network. Actually, this difference indicates well that different random attributes should be modeled with different mechanisms. The above two pieces of evidence are obtained through modeling; this is necessary because each attribute should have

ARTICLE

its own particular mechanism to contribute to the crystallization process. For example, some can be classified as cause consequence relationship, whereas others can be classified as a phenomenological relationship. Therefore, the use of mechanism-free model for fitting is crucial for the confirmation of the random effect on crystallization. In this sense, the neural network model is well suited for this purpose. As the above methods reveal the role of randomness in crystallization, the next step we take is to use neural network as a predictive model to divide 118 proteins into two data sets, one for model development and other for model validation. Figure 3 shows the model validation using delete-1 jackknife, which is considered to be the most robust validation method.16 As can be seen, the crystallization predictions using random attributes are workable with respect to the accuracy, sensitivity, and specificity. This once again provides us with the third piece of evidence that randomness plays a role in the crystallization process (for details, see Supporting Information). A statistical issue would be the cutoff threshold because to avoid homology bias and remove the redundant sequences from the benchmark data set, a cutoff threshold of 25% as indicated in the literature39 43 was imposed to exclude those proteins from the benchmark data sets that have equal to or greater than 25% sequence identity to any other in the same subset. However, in this study we did not use such a stringent criterion not only because the currently available data do not allow us to do so— otherwise the numbers of proteins for some subsets would be too few to have statistical significance—but also our data are not mixed with proteins other than from Plasmodium falciparum. Still, in statistical prediction, the following three cross-validation methods are often used to examine a predictor for its effectiveness in practical application: independent data set test, subsampling test, and jackknife test.44 However, of the three test methods, the jackknife test is deemed the most objective.45 The reasons are as follows: (i) For the independent data set test, although all the proteins used to test the predictor are outside the training data set used to train it so as to exclude the “memory” effect or bias, the way of how to select the independent proteins to test the predictor could be quite arbitrary unless the number of independent proteins is sufficiently large. This kind of arbitrariness might result in completely different conclusions.44 (ii) For the subsampling test, the actual approach used in the literature is usually the 5-, 7-, or 10-fold cross-validation. The problem with this kind of subsampling test is that the number of possible selections in dividing a benchmark data set is extremely huge even for a very simple data set, as elucidated in the literature44 and demonstrated by eqs 28 30 in the literature.16 Therefore, in any actual subsampling cross-validation tests, only a very small fraction of the possible selections are taken into account. Since different selections will always yield different results even for the same benchmark data set and the same predictor, the subsampling test cannot avoid the arbitrariness either. A test method unable to yield a unique outcome can certainly not be considered as a good one. (iii) In the jackknife test, all the proteins in the benchmark data set will be singled out one-by-one and tested by the predictor trained by the remaining protein samples. During the process of jackknifing, both the training data set and testing data set are actually open, and each protein sample will be in turn moved between the two. The jackknife test can exclude the “memory” effect. Also, the arbitrariness problem as mentioned above for the independent data set test and subsampling test can be avoided because the outcome obtained by the jackknife 4202

dx.doi.org/10.1021/cg200814k |Cryst. Growth Des. 2011, 11, 4198–4204

Crystal Growth & Design

ARTICLE

Figure 4. Predictable portions of amino acid pairs in crystallized and noncrystallized proteins grouped according to protein length. The number of each group is shown above the bar. * and ** indicate the statistically significant difference compared with the crystallized group at p < 0.05 and p < 0.01 level (the Student’s t-test).

cross-validation is always unique for a given benchmark data set. Accordingly, the jackknife test has been increasingly and widely used by those investigators with strong math background to examine the quality of various predictors.41,42,46 59 Accordingly, the jackknife test will also be used in this study to examine the accuracy of our prediction method. Since user-friendly and publicly accessible web servers represent the future direction for developing practically more useful models, simulated methods, or predictors,60 we also list the web server for the method presented in this paper.23,28,61 We believe the above three pieces of evidence are valid because they are obtained through very labor-intensive work; that is, we used each of 535 amino acid attributes throughout the above three modeling processes, which is different from other studies in development of predictive models, where several hundreds of amino acid attributes were used together to correlate with crystallization, while numerous attributes could overshadow the tiny effect of random attributes. The above three results certainly answer our first question; that is, randomness does play a role in crystallization process of protein. Now let us look at the second question, does more randomness or less randomness in a protein make it easily crystallized? In order to answer the second question, we use the random attribute, amino acid pair predictability, as a measure to determine the portion of randomness in a protein. As explained in Experimental Section, a protein has a certain percentage of predictable/unpredictable amino acid pairs, and we therefore compare the predictable portion in proteins with respect to whether they are crystallized. The previous studies indicated that the crystallization has a strong correlation with protein length;12 thus we grouped proteins according to their length. Figure 4 shows the statistical comparison in this regard. As can be seen in Figure 4, there are statistical differences between crystallized and noncrystallized proteins in 78 160 and 321 400 groups; that is, the predictable portion in crystallized proteins is larger than that in noncrystallized proteins. Although this trend in other groups has yet to be statistically significant, this is a meaningful trend; that is, the more random the protein structure is, the easier the crystallization is. When a protein has more randomness, it would be easier and faster to reach the crystallization. To furthermore confirm this conclusion, Figure 5 shows such analysis: (i) we divided 118 proteins into crystallized and noncrystallized groups; (ii) we did the statistical comparison for both

Figure 5. Accuracy of fitting and delete-1 jackknife validation in crystallized proteins (upper panel) and noncrystallized proteins (middle panel), and statistical comparison of their predictable portion of amino acid pairs (lower panel, the Mann Whitney U-test). The data were presented as median with interquartiles. The dotted lines indicate the cutoff point for separating the groups with low and high accuracy.

neural network fitting and delete-1 jackknife validation; and (iii) as the classification included true positive, true negative, false positive, and false negative, we compared high accuracy versus low accuracy. What Figure 5 really suggests is that the more randomness in the protein is the more accurate the crystallization prediction is. Nevertheless, there are certainly other unknown factors, which play roles in the crystallization process of protein, and great efforts are needed in future research.

’ CONCLUSION We used both logistic regression and neural network to fit the relationship between amino acid attributes and protein crystallization, and the neural network for developing the predictive model, which provide three clear pieces of evidence that randomness plays a role in protein crystallization. The further statistical analysis suggests that a protein that has more randomness is more easily crystallized. ’ ASSOCIATED CONTENT

bS

Supporting Information. All of 118 proteins from Plasmodium falciparum, their accuracy, sensitivity, and specificity of crystallization

4203

dx.doi.org/10.1021/cg200814k |Cryst. Growth Des. 2011, 11, 4198–4204

Crystal Growth & Design obtained from logistic regression, fitting by neural network, and detete-1 jackknife validation for each amino acid attribute. This information is available free of charge via the Internet at http:// pubs.acs.org/.

’ AUTHOR INFORMATION Corresponding Author

*E-mail: [email protected].

’ ACKNOWLEDGMENT This study was partly supported by Guangxi Science Foundation (07-109-001A, 08-115-011, 0907016, 09322001, 10-046-06, 11-031-11, 2010GXNSFF013003 and 2010GXNSFA013046). ’ REFERENCES (1) Berardi, M. J.; Shih, W. M.; Harrison, S. C.; Chou, J. J. Nat. Immunol. 2011, 476, 109–113. (2) Schnell, J. R.; Chou, J. J. Nature 2008, 451, 591–595. (3) Oxenoid, K.; Chou, J. J. Proc. Natl. Acad. Sci. U. S. A. 2005, 102, 10870–10875. (4) Call, M. E.; Wucherpfennig, K. W.; Chou, J. J. Nat. Immunol. 2010, 11, 1023–1029. (5) Pielak, R. M.; Chou, J. J. Biochem. Biophys. Res. Commun. 2010, 401, 58–63. (6) Pielak, R. M.; Jason, R.; Schnell, J. R.; Chou, J. J. Proc. Natl. Acad. Sci. U. S. A. 2009, 106, 7379–7384. (7) Wang, J.; Pielak, R. M.; McClintock, M. A.; Chou, J. J. Nat. Struct. Mol. Biol. 2009, 16, 1267–1271. (8) Pielak, R. M.; Chou, J. J. Biochim. Biophys. Acta 2011, 1808, 522–529. (9) Chou, K. C. Curr. Med. Chem. 2004, 11, 2105–2134. (10) Chou, K. C. Biochem. Biophys. Res. Commun. 2004, 316, 636– 642. (11) Chou, K. C. Biochem. Biophys. Res. Commun. 2004, 319, 433– 438. (12) Canaves, J. M.; Page, R.; Wilson, I. A.; Stevens, R. C. J. Mol. Biol. 2004, 344, 977–991. (13) Kantardjieff, K. A.; Rupp, B. Bioinformatics 2004, 20, 2162– 2168. (14) Overton, I. M.; Padovani, G.; Girolami, M. A.; Barton, G. J. Bioinformatics 2008, 24, 901–907. (15) Slabinski, L.; Jaroszewski, L.; Rychlewski, L.; Wilson, I. A.; Lesley, S. A.; Godzik, A. Bioinformatics 2007, 23, 3403–3405. (16) Chou, K. C. J. Theor. Biol. 2011, 273, 236–247. (17) Chen, L.; Oughtred, R.; Berman, H. M.; Westbrook, J. Bioinformatics 2004, 20, 2860–2862. (18) Wu, G.; Yan, S. Mol. Biol. Today 2002, 3, 55–69. (19) Wu, G.; Yan, S. Protein Pept. Lett. 2006, 13, 377–384. (20) Wu, G.; Yan, S. Acta Pharmacol. Sin. 2006, 27, 513–526. (21) Wu, G.; Yan, S. Lecture Notes on Computational Mutation; Nova Sciences Publishers: New York, 2008. (22) Yan, S.; Wu, G. J. Guangxi Acad. Sci. 2010, 17, 145–150. (23) http://www.dreamscitech.com/Web-Based-Computation/AA.htm, 2011. (24) Feller, W. An Introduction to Probability Theory and Its Applications, 3rd ed.; Wiley: New York, 1968; Vol. I. (25) Wu, G.; Yan, S. Biochem. Biophys. Res. Commun. 2005, 337, 692–700. (26) Wu, G.; Yan, S. Protein Pept. Lett. 2006, 13, 601–609. (27) Wu, G.; Yan, S. In Leading-Edge Messenger RNA Research Communications; Ostrovskiy, M. H., Eds.; Nova Science Publishers: New York, 2007, Chapter 3; pp 47 65. (28) http://www.dreamscitech.com/Web-Based-Computation/FAAC. htm, 2011.

ARTICLE

(29) Kawashima, S.; Pokarowski, P.; Pokarowska, M.; Kolinski, A.; Katayama, T.; Kanehisa, M. Nucleic Acids Res. 2008, 36, D202–D205. (30) Darby, N. J.; Creighton, T. E. J. Mol. Biol. 1993, 232, 873–896. (31) Dwyer, D. S. BMC Chem. Biol. 2005, 5, 2. (32) Cooper, G. M. The Cell: A Molecular Approach; ASM Press: Washington, DC, 2004; p 51. (33) Chou, P. Y.; Fasman, G. D. Adv. Enzymol. Relat. Subj. Biochem. 1978, 47, 45–148. (34) Delucas, L. J.; Hamrick, D.; Cosenza, L.; Nagy, L.; McCombs, D.; Bray, T.; Chait, A.; Stoops, B.; Belgovskiy, A.; William Wilson, W.; Parham, M.; Chernov, N. Prog. Biophys. Mol. Biol. 2005, 88, 285–309. (35) SYSTAT for Windows, version 11.00.01; SYSTAT Software Inc.: Chicago, IL, 2004. (36) MatLab-The Language of Technical Computing, version 6.1.0.450, release 12.1; MathWorks Inc.: Natick, MA, 1984 2001. (37) Bae1, K.; Mallick, B. K.; Elsik, C. G. J. Bioinformatics 2005, 21, 2264–2270. (38) Demuth, H.; Beale, M. Neural Network Toolbox for Use with MatLab. User’s Guide, version 4; MathWorks Inc.: Natick, MA, 2001. (39) Chou, K. C.; Shen, H. B. Analyt. Biochem. 2007, 370, 1–16. (40) Xiao, X.; Wu, Z. C.; Chou, K. C. J. Theor. Biol. 2011, 284, 42–51. (41) Xiao, X.; Wu, Z. C.; Chou, K. C. PLoS ONE 2011, 6, e20592. (42) Chou, K. C.; Shen, H. B. PLoS ONE 2010, 5, e11335. (43) Chou, K. C.; Wu, Z. C.; Xiao, X. PLoS ONE 2011, 6, e18258. (44) Chou, K. C.; Zhang, C. T. Crit. Rev. Biochem. Mol. Biol. 1995, 30, 275–349. (45) Chou, K. C.; Shen, H. B. Nat. Prot. 2008, 3, 153–162. (46) Esmaeili, M.; Mohabatkar, H.; Mohsenzadeh, S. J. Theor. Biol. 2010, 263, 203–209. (47) Chen, C.; Chen, L.; Zou, X.; Cai, P. Protein Pept. Lett. 2009, 16, 27–31. (48) Georgiou, D. N.; Karakasidis, T. E.; Nieto, J. J.; Torres, A. J. Theor. Biol. 2009, 257, 17–26. (49) Chou, K. C. Proteins: Struct., Funct., Genet. 2001, 43, 246–255. (50) Ding, H.; Luo, L.; Lin, H. Protein Pept. Lett. 2009, 16, 351–355. (51) Lin, H. J. Theor. Biol. 2008, 252, 350–356. (52) Gu, Q.; Ding, Y. S.; Zhang, T. L. Protein Pept. Lett. 2010, 17, 559–567. (53) Mohabatkar, H. Protein Pept. Lett. 2010, 17, 1207–1214. (54) Mohabatkar, H.; Mohammad, B. M.; Esmaeili, A. J. Theor. Biol. 2011, 281, 18–23. (55) Yu, L.; Guo, Y.; Li, Y.; Li, G.; Li, M.; Luo, J.; Xiong, W.; Qin, W. J. Theor. Biol. 2010, 267, 1–6. (56) Zeng, Y. H.; Guo, Y. Z.; Xiao, R. Q.; Yang, L.; Yu, L. Z.; Li, M. L. J. Theor. Biol. 2009, 259, 366–372. (57) Qiu, J. D.; Huang, J. H.; Shi, S. P.; Liang, R. P. Protein Pept. Lett. 2010, 17, 715–722. (58) Zhang, G. Y.; Fang, B. S. J. Theor. Biol. 2008, 253, 310–315. (59) Zhou, X. B.; Chen, C.; Li, Z. C.; Zou, X. Y. J. Theor. Biol. 2007, 248, 546–551. (60) Chou, K. C.; Shen, H. B. Nat. Sci. 2009, 2, 63–92. (61) http://www.dreamscitech.com/Web-Based-Computation/ADP. htm, 2011.

4204

dx.doi.org/10.1021/cg200814k |Cryst. Growth Des. 2011, 11, 4198–4204