Chapter 2
Reliability of X-ray Crystallographic Structures
Downloaded by UNIV OF CALIFORNIA SAN DIEGO on February 27, 2016 | http://pubs.acs.org Publication Date: December 14, 1994 | doi: 10.1021/bk-1994-0576.ch002
Richard Bott Department of Enzymology, Genencor International, 180 Kimball Way, South San Francisco, CA 94080
The process of X-ray crystallographic structure determination requires growing crystals and visualizing these structures in electron density maps. This process introduces some limitations in the reliability of these structures. The resolution limit, crystallographic R-factor and atomic temperature factors provide important clues in assessing the confidence a researcher can have in the coordinates of any particular segment in the protein structure. The structure of subtilisin determined independently in a number of laboratories from crystals grown in different conditions provides a means to obtain an empirical estimate of error. In the case of subtilisin BPN', the structure of which has been determined at resolutions ranging from 1.8-1.6Å resolution with R-factors ranging from 0.18-0.14, there is very good agreement between structures determined from different crystal forms. This observation suggests that the individual models are fair representations of the structure of the enzyme in solution.
The three-dimensional structures of macromolecules, predominantly proteins, determined using X-ray crystallography now figure prominently in all biochemical textbooks. The number of crystallographic structures available in the Brookhaven Protein Data Bank is increasing exponentially, driven by commercial as well as academic interests to determine protein structures to serve as the basis for rational drug design and protein engineering. This growth is also a consequence of the increasing number of active crystallography laboratories and the continuous improvement in crystallographic techniques and hardware. All biochemists are familiar with the quaternary changes that occur to effect the allosteric regulation of oxygen uptake by haemoglobin in the bloodstream and the proposed mechanism of action for hydrolysis of peptide bonds in serine proteases. Both are derived from numerous crystallographic studies of the protein structures. Knowledge of the three-dimensional structure of a biological 0097-6156/94/0576-0018$08.00/0 © 1994 American Chemical Society In Molecular Modeling; Kumosinski, Thomas F., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1994.
Downloaded by UNIV OF CALIFORNIA SAN DIEGO on February 27, 2016 | http://pubs.acs.org Publication Date: December 14, 1994 | doi: 10.1021/bk-1994-0576.ch002
2. BOTT
Reliability of X-ray Crystallographic Structures
19
macromolecule is an essential component in dissecting the relationship between the structure of a particular molecule and its functionality in medical or chemical applications. A better understanding of the limitations placed on the structure by the process of X-ray crystallographic structure determination should be of value to all who use the results. It is not possible to reduce all of X-ray crystallography into a short article. It is not my intention to duplicate many far more thorough developments of the theory and formula of X-ray crystallography (i-5). Instead I will attempt to present an overview of the process of structure determination namely crystallization, data collection, interpretation of electron density maps and refinement of model coordinates. I will also show how each of these factors provide clues regarding the overall confidence levels for the coordinates. The aim is to provide a general audience with a better insight into the evaluation of the reliability of coordinates based on the experimental results of the structure determination. I will rely heavily on the work done by a number of laboratories on subtilisin where data on how the coordinates might compare with the molecule in "solution" is available. Crystallographic Structure Determination The first step is to grow crystals of the protein or macromolecule of interest. The necessity for growing crystals is based on the radiation employed, X-rays, having a wavelength comparable to the interatomic distance of covalently bonded atom (0.15nM). X-rays are ionizing radiation, creating free radicals that will randomly break covalent bonds throughout the molecule, degrading the protein sample. The crystals serve as diffraction gratings, where x-rays scattered from all molecules in the crystal positively interfere. This gives diffraction patterns such as the one presented in Figure 1. The spots in this figure represent as much as a 10 billionfold amplification of the x-rays scattered from a single molecule, depending of the number of repeats or "unit cells" present in the crystal. The diffraction will be limited by the degree of long-range periodicity that exists from molecule to molecule throughout the crystal. This limit is referred to as the "resolution". The time required for growing crystals of sufficient size for data collection varies from, in the best cases, a matter of hours for the most pure and facile proteins to many months, although crystals grown for a year or more have in some cases been used. The conditions giving the crystals are empirically selected often with pH near the isoelectric point of the enzyme. The selection is opportunistic and may not coincide with the conditions under which the enzyme is optimally active. In some cases, the crystals can induce some interesting alterations such as the elevation of the pK of the catalytic histidine in α-lytic protease, giving rise to the suggestion that mere was not a hydrogen bond between the catalytic serine and histidine in the active site of that enzyme (4). This result can now be reconciled with NMR studies where thefindingis that the hydrogen bond does form in solution but under the conditions of the crystallization the histidine is in fact protonated at pH 7.9. Once the crystals are grown, diffraction data can be collected. The diffraction data arises from the coherent interference of all molecules in the crystal.
In Molecular Modeling; Kumosinski, Thomas F., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1994.
Downloaded by UNIV OF CALIFORNIA SAN DIEGO on February 27, 2016 | http://pubs.acs.org Publication Date: December 14, 1994 | doi: 10.1021/bk-1994-0576.ch002
MOLECULAR MODELING
Figure 1. X-ray diffraction pattern showing the OhOl projection of data from subtilisin ΒΡΝ' grown at pH 6.0
In Molecular Modeling; Kumosinski, Thomas F., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1994.
Downloaded by UNIV OF CALIFORNIA SAN DIEGO on February 27, 2016 | http://pubs.acs.org Publication Date: December 14, 1994 | doi: 10.1021/bk-1994-0576.ch002
2. BOTT
Reliability of X-ray Crystallographic Structures
21
The diffraction data is a radially distributed array of diffraction maxima whose positions are determined by the lattice repeat and any internal symmetry found in the crystal. This data can most easily visualized a series of spots that are uniformly spaced but have different intensities. The variation of the intensities of the spots (diffraction maxima) in Figure 1 arises from the interference of scattered X-rays from atoms within a single molecule and as a consequence contains information on the three-dimensional arrangement of the atoms within the protein molecule. This information can then be used to visualize the scattering matter, electrons, by calculating an electron density map. The "data" used in this calculation includes the intensities measured directly from the crystal along with "phases" that go into a Fourier summation. The crystal is a periodic function of matter in three-dimensions. Fourier summations can define any periodic function as the summation of an large number of wave functions. The amplitude of each wave in the summation is proportional to a particular diffraction maxima, modulated by a phase displacement. A much more thorough discussion can be found in Ref 1. While this is NOT intuitive, unlike the correspondence of a particular atom to a peak in NMR, the net effect is an electron density map that has quite detailed information about the protein in the crystal. The "detail" is dependent on the resolution limit regulating the fineness of waves that are included in the Fourier summation. With single counter diffractometers, "high" resolution usually meant 2.8 Â while, with the appearance of area detectors cable of collecting data much faster and with greater sensitivity, high resolution now means 2.0-1.6 Â or better resolution. Even at 2.0-1.6 Â resolution it is not possible to differentiate covalently bonded atoms, but at this resolution, side chain and main chain moieties can often be recognized as shown in Figure 2. This figure looks at the electron density of a tyrosine side chain. The molecular orbital that forms a doughnut shaped ring can be seen with the expected vacant central cavity for this side chain. Resolution limits provide useful information regarding the relativerigidityof a particular molecule. Structures determined from crystals that diffract to high resolution will have better overall reliability limits. The diffraction data represents the ensemble diffraction from all molecules in the crystal. It follows that the resulting electron density map from this data will represent an "averaged" electron density of all molecules in the crystal. If some residues vary from molecule to molecule within the crystal then these residues or portions thereof may have an average electron density falling below the cutoff level for "noise". These segments will not be easily fitted and will remain ambiguous. In fact there is a continual gradation of relativerigiditywithin the molécule in general the atoms in the interior are more rigid than those on the surface. The electron density map serves as a guide to fit an atomic model having the expected amino acid sequence to best match this electron density. Once a model has been fitted, it is possible to use the coordinates of the model to calculate the diffraction pattern arising from this model. This calculated diffraction pattern having calculated intensities for each diffraction maxima can be compared with the observed diffraction pattern from the crystal. The model is imperfect, a single
In Molecular Modeling; Kumosinski, Thomas F., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1994.
Downloaded by UNIV OF CALIFORNIA SAN DIEGO on February 27, 2016 | http://pubs.acs.org Publication Date: December 14, 1994 | doi: 10.1021/bk-1994-0576.ch002
22
MOLECULAR MODELING
model representing an averaged structure and incomplete, lacking the disordered bulk solvent molecules and hydrogen atoms (due to the restricted resolution). With the exception of some of the most high resolution structures, corresponding to 1.0 Â or better, it is usually not possible to "resolve" hydrogen atoms. Despite these limitations there is good overall agreement between the relative magnitudes of the calculated diffraction intensity and the observed intensities that are observed from the crystal, suggesting that the model still fairly represents the molecule in the crystal. It is possible to refine the model by adjusting the position of the atomic coordinates to minimize the difference between calculated and observed diffraction patterns. In the case of high resolution data the number of observations/variables can be 2-3:1 for reasonably complete data sets at 2.0-1.6 Â resolution. Added to these are the quasi observations or additional restrains placed on the molecules by the stereochemistry of bond lengths, angles, planarity and stereochemistry of amino acids and polypeptide linkages. The refinement procedure not only gives a better agreement between the data calculated from the model and the observed data, but when the refined model is used to calculate a new electron density map the resulting map usually indicates new information relating to errors in the model and additional features such as tightly bound solvent molecules. There are algorithms to estimate the overall mean error (5,6) and these rely on the same agreement between the observed and calculated structure factors that are refined. The structure factors are proportional to the square root of the intensity of the diffraction maxima. The agreement between the observed and calculated diffraction intensities is measured by an R factor defined by equation 1. R =
Σ (lF (h)l-lF (h)l)/lF (h)l 0
c
0
(1)
h
In this equation, Fo(h) and Fc(h) are the absolute values of the observed and calculated structures factors for the spot having indices (h) corresponding to hjcj each index in turn representing the integral number of spacings along the axial repeats from the center of the diffraction pattern in Figure 1. The estimated mean error is directly related to the value of the R-factor for the higher resolution data. The lower the R-factor the lower the estimated mean error will be. If more than one potential crystallographic model is available, a researcher would then be usually better off choosing the one determined at high resolution and giving the lowest R-factor for high resolution data. For high resolution (1.8- 1.6A) X-ray crystallographic structures, giving R-factors in the range of 0.18-0.14, the methods give estimates of the mean error on the order of 0.2 Â. In the course of the refinement an atomic temperature factor is also refined for each atom in the model. This atomic temperature factor is a measure of the relative vibrational motion different atoms have in the molecule. Internal atoms in the rigid, center of the molecule will have lower temperature factors than atoms on the surface. It would be expected that in any coordinate set, the electron density for atoms having high temperature factors will be more diffuse and fitted with lower confidence than the well ordered atoms having low temperature factors.
In Molecular Modeling; Kumosinski, Thomas F., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1994.
2. BOTT
Reliability of X-ray Crystallographic Structures
23
Downloaded by UNIV OF CALIFORNIA SAN DIEGO on February 27, 2016 | http://pubs.acs.org Publication Date: December 14, 1994 | doi: 10.1021/bk-1994-0576.ch002
Accuracy and Reliability of Crystallographic Models: An Example It is important to consider how well this crystal structure matches the same molecule in solution. To address this question one would ideally need to have the crystal structure of the same molecule determined independently. Ideally the different structures would be determined from crystals grown under different conditions. By comparing the crystal structures of the same enzyme determined from different crystal forms, grown under different conditions we can infer how each might differ from the solution structure of the enzyme. The different crystal lattice interactions would be expected to distort the crystal structure in different ways. Thus the divergence we see between crystal structures determined in different crystal forms should diverge more from each other than the solution structure. The results from subtilisin meet most of these conditions and offer a chance to answer this question. In part because of the high commercial interest in subtilisin engineering, the number of independent crystallographic models is larger with most now available in the protein data bank. The coordinate sets have been determined from crystallographic data collected from crystals that differ in space group and the conditions for crystal growth ranging in pH from 6.0 to 9.5 and in precipitant from ammonium sulfate to acetone. What is compared here is the native enzyme grown at pH 6.0 from ammonium sulfate and the three-dimensional structure of a variant enzyme having six site-specific substitutions; Met 50 replaced by Phe, Asn 76 replaced by Asp, Gly 169 replaced by Ala, Gin 206 replaced by Cys, Tyr 217 replaced by Lys and Asn 218 replaced by Ser (7). We have seen that site-specific substitutions produce limited and often, very subtle conformational changes that are localized at the site of substitution (5), and thus we expect that the perturbations resulting from these additional differences to be minimal. The structure of native and variant subtilisin BPN' determined under the different extremes of pH 6.0 versus 9.0 and different precipitating agents, ammonium sulfate versus acetone, are still, very similar in their overall folding (Figure 3) and in conformation of the side chain atoms. An example of this is presented for atoms in the active site (Figure 4). The overall rms variation from Ca atoms is 0.38 Â. The algorithms to estimate the overall mean error give estimates of the mean error of 0.2 Â. While knowing the mean error is quite useful it would be more useful to establish the variation about this mean with the aim of establishing criteria for what constitutes statistically significant differences between these structures. These criteria would also correspond to a confidence level for scientists using these models to adjust the thresholds for docking algorithms that take into account the overall deformation that models might be allowed to undergo before a significant disruption in the structure occurred. We have employed an empirical method for estimating the mean error for atoms having the same degree of thermal mobility as measured by the refined crystallographic temperature factor. This method relies on the finding that a linear relationship exits between the logarithm of the distance between equivalent atoms and the temperature factors of those atoms (9). The crystallographic temperature factors reflect the relative reliability of any particular
In Molecular Modeling; Kumosinski, Thomas F., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1994.
Downloaded by UNIV OF CALIFORNIA SAN DIEGO on February 27, 2016 | http://pubs.acs.org Publication Date: December 14, 1994 | doi: 10.1021/bk-1994-0576.ch002
24
MOLECULAR MODELING
Figure 2. Stereographic representation of electron density map at 1.6 Â resolution. The side chain of Tyr 208 of subtilisin from Bacillus lentus is superposed on the electron density.
Figure 3. Stereographic view comparing C a trace of native subtilisin BPN' determined at pH 6.0, 40% sat. ammonium sulfate (thick lines) and subtilisin BPN' variant (M50F/ N76D/ G169A/ Q209C/ Y217K and N218S) at pH 9.0 55 % acetone (thin lines).
In Molecular Modeling; Kumosinski, Thomas F., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1994.
Downloaded by UNIV OF CALIFORNIA SAN DIEGO on February 27, 2016 | http://pubs.acs.org Publication Date: December 14, 1994 | doi: 10.1021/bk-1994-0576.ch002
2. BOTT
Reliability of X-ray Crystallographic Structures
25
segment of a protein. An equation can then be determined for the line (Figure 5) representing the mean error and the root mean squared deviation from this line. Residues in the two structures compared having significant departures from the mean error can be identified. This method determines, by linear least squares fit, the equations for mean error as a function of the crystallographic temperature factor Β and the root mean squared deviation. This is analogous to the standard analysis of variance conducted for the mean error between equivalent atoms in two crystallographic coordinate sets. Using this method we have identified regions that represent potentially significant differences between different site-specific variants and the native enzyme (10). In this paper the interest is not in the particular differences but rather with the threshold of difference between equivalent atoms that represent significant departures from random variation. This method provides an empirical estimate of the error that would be found in the structure in the same or different crystal forms. The error boundaries should highlight the confidence levels appropriate under these circumstances. The values from the equations obtained from comparing crystal structures in the same and different forms are compared in Table 1. In this table, we compare the values of the mean error and the variation of the error about the mean in two pairwise comparisons. Thefirstcomparison is between the native enzyme and a variant having a single site-specific substitution, phenylalanine replacing methionine at position 222 (M222F). The second is between the native enzyme determined from a crystal grown at pH 6.0 from ammonium sulfate and a variant with six site-specific substitutions (M50F/ N76D/ G169A/ Q206C/ Y217K/ N218S) determined from a crystal grown at pH 9.0 from acetone. All structures in these comparisons have a mean value of 10. The table presents the estimates of the mean error, along with the standard deviation from the mean for atoms having temperatures factors of 5, 10, 15 and 20 representing atoms that are relatively more rigid than the average atoms, the mean atoms and atoms that are more or very much more, disordered than the "average" atoms in these pairwise comparisons. The error between structures in the same crystal lattice might represent the error expected if the structures were re-determined using independent data sets that would be representative of experimental error. While the error between structures determined in different crystal lattices and grown under different conditions might reflect structures that independently diverge from the solution structure. The divergence between these structures would in fact overestimate the deviation that either of the structure might have with the structure in solution. The values for the threshold of a significant difference between equivalent atoms having the mean temperature factor of 10.0 range from 0.23 to 0.59 Â depending on whether the coordinate sets were determined in the same crystal form, under similar conditions of pH and precipitating agent or under significantly different crystallization conditions. In the second case, when comparing subtilisin structures determined at different extremes of pH and precipitant, coordinates for "average", well ordered atoms should vary by no more than 0.6 Â before differences above this threshold may be regarded as significant. This also defines a confidence level for these structures relative to a "solution" structure. It should be noted that for the most
In Molecular Modeling; Kumosinski, Thomas F., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1994.
Downloaded by UNIV OF CALIFORNIA SAN DIEGO on February 27, 2016 | http://pubs.acs.org Publication Date: December 14, 1994 | doi: 10.1021/bk-1994-0576.ch002
MOLECULAR MODELING
Figure 4. Stereographic comparison of the active site of native subtilisin BPN' determined at pH 6.0, 40% sat. ammonium sulfate (thick lines) and subtilisin BPN' variant (M50F/ N76D/ G169A/ Q209C/ Y217K and N218S) at pH 9.0 55 % acetone (thin lines).
]Q~ I 2
0
l_
:
I
5
10
I
I
15 20 Temperature Factor
I
1
1
25
30
35
Figure 5. Semi-logarithmic plot of distances between equivalent atoms in native subtilisin BPN from crystals grown at pH 6.0 from ammonium sulfate and a variant subtilisin BPN'(M50F/ N76D/ G169A/ Q209C/ Y217K and N218S) grown at pH 9.0 from acetone. 1
In Molecular Modeling; Kumosinski, Thomas F., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1994.
2. BOTT
Reliability of X-ray Crystallographic Structures
27
Table I. Values of the estimated error and variance for selected temperature factors ( in À ) ΒΡΝ'
1
(pH6)
BPN (pH 6) ν BPN' hextuple* (pH 9)
ν BPN' M222S (pH 6)
Downloaded by UNIV OF CALIFORNIA SAN DIEGO on February 27, 2016 | http://pubs.acs.org Publication Date: December 14, 1994 | doi: 10.1021/bk-1994-0576.ch002
B = 5
mean error + RMSE
0.19 0.35
0.09 0.17 Β = 10
mean error + RMSE
0.31 0.59
0.13 0.23 Β = 15
mean error + RMSE
0.53 1.00
0.17 0.32 Β = 20
mean error + RMSE
0.24 0.44
0.90 1.68
* hextuple variant (M50F/ N76D/ G169A/ Q209C/ Y217K and N218S)
flexible regions, having higher temperature factors, the threshold for the confidence level can be expected to be considerably higher. The possibility of this range of variation points to the importance of considering the crystallographic temperature factors in any modeling experiment.
Conclusions The methods of crystallographic analysis are providing higher resolution structures at an accelerating rate as a consequence of improved methodologies and equipment for the acquisition of crystallographic data, faster computers which have stimulated development of refinement software and faster molecular graphics to analyze the data. The same technological advances have placed the tools to manipulate these structures in the hands of every biochemist.
In Molecular Modeling; Kumosinski, Thomas F., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1994.
Downloaded by UNIV OF CALIFORNIA SAN DIEGO on February 27, 2016 | http://pubs.acs.org Publication Date: December 14, 1994 | doi: 10.1021/bk-1994-0576.ch002
28
MOLECULAR MODELING
For this reason it is equally necessary to disseminate information concerning the intrinsic sources of error and the methods for estimating this error. In addition to this overall error, it should be remembered that there are important clues regarding the reliability of specific segments of the structure in the crystallographic temperature factors. Overall the estimates for the mean error of structure determined at high resolution (2.0 Â or better) range from 0.13 to 0.31 Â which are in relatively good agreement with estimates derived by other methods. R-factors with their associated estimates of the mean error can be useful guides for selecting model structures if more than one is available. However the mean error in the example of subtilisin would not be representative of the criterion for significance or for confidence bounds to be used for evaluating feasibility in modeling. The empirical estimate of the confidence levels outlined in this paper sets 0.6 Â the threshold for significant difference between relatively well ordered segments in closely related structures. This threshold varies as a function of the crystallographic temperature factors and can exceed 1.5 Â for segments with high crystallographic temperature factors. Current protein engineering experience suggests that the location and relative flexibility of a segment is of great importance in modeling and relating structure to function. While this paper has focused on the deficiencies of the X-ray crystallographic models, it would be remiss to fail to note that overall these X-ray crystallographic structures from quite different crystal environments and conditions do share very high similarity. Any of the crystallographic models would constitute an excellent model for subtilisin under the differing conditions and thus would also serve as a reliable model for the molecule in solution. Literature Cited 1. Eisenberg, D. In The Enzymes Boyer, P. D. Ed.; Academic Press: New York, NY, 1970 Vol. 1; pp 1-89. 2. Glusker, J. P. and Trueblood, Κ. N. Crystal Structure Analysis; Oxford University Press, London 1972. 3. Blundell, T. L. and Johnson, L. N. Protein Crystallography, Academic Press New York NY, 1976. 4. Smith, S. O., Farr-Jones, S., Griffin, R. G. and Bachovchin, W. W. Science 1989, 244, 961-964. 5. Cruikshank, D. W. J. Acta Crystllogr. 1949, 2, 65-82. 6. Luzzati, V. Acta Crystallogr. 1952, 5, 802-810. 7. Pantoliano, M. W., Whitlow, M., Wood, J. F., Dodd, S. W., Hardman, K. D., Rollence and Bryan, P. N. Biochemistry 1989, 28, 7205-7213. 8. Bott, R. and Ultsch, M. In Fifth International Symposium on the Genetic of Industrial Microorganisms; M Alacevic, D. Hranueli and Z. Toman Eds. Pliva Press Zagreb: 1987, pp 375-385. 9. Bott, R. and Frane, J. Protein Engineering 1990, 3, 649-657. 10. Bott, R., Dauberman,J.,Caldwell, R., Mitchinson, C., Wilson, L., Schmidt, B., Simpson, C., Power, S., Lad, P., Sagar, H., Garycar, T. and Estell, D.
In Annals of the New York Academy of Sciences, 1992, Vol 672, pp 10-19. R E C E I V E D August 26, 1994
In Molecular Modeling; Kumosinski, Thomas F., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1994.