Using Graphs to Represent Crystallization Conditions - Crystal Growth

Jan 30, 2013 - We have been working toward building experimental protocols and software tools to help the C3 users and others venturing into protein c...
0 downloads 13 Views 720KB Size
Article pubs.acs.org/crystal

Using Graphs to Represent Crystallization Conditions Published as part of the Crystal Growth & Design virtual special issue on the 14th International Conference on the Crystallization of Biological Macromolecules (ICCBM14) Michelle Chan, Vincent J. Fazio, and Janet Newman* CSIRO Materials Science and Engineering, 343 Royal Parade, Parkville, VIC, Australia 3052 ABSTRACT: We describe a novel graphical representation of a crystallization condition that provides an intuitive guide to setting the upper and lower concentration boundaries in a fine screen based around that condition.



INTRODUCTION Having the structure of an interesting protein or complex provides a framework for understanding the biological function of that system. Of all the ways of producing an atomic level three-dimensional (3D) structure, X-ray diffraction techniques are by far the most used; almost eight times as many structures have been solved by X-ray methods than NMR methods, as seen from the deposition statistics of the Protein Data Bank (PDB, www.rcsb.org). The limitation of X-ray methods is the requirement for highly ordered crystals of the protein or complex of interest.1 The production of protein crystals is by no means straightforward and generally involves an initial random search through crystallization space followed by one or more cycles of optimization.2,3 The initial screening is often done with commercially available screens but there is no “industry standard” on how best to screen (i.e., what techniques or what screens or even how much one should screen).4 The situation with optimization is even harder, as there are no facile commercial solutions that can be applied. Initial screening and optimization experiments are generally set up in the same manner: an aqueous mixture of three to five chemicals (often called the crystallization “cocktail”5 or “condition”) is mixed with a concentrated solution of the protein of interest, and the results are monitored over time. Each of the suite of experiments set to determine how a given protein will crystallize varies in which chemicals, and at what concentrations and pH, are used in the crystallization cocktail and other tweaks to the experimental setup, such as varying temperature and the ratio of protein solution to crystallization cocktail.6 What makes crystallization all the more challenging is that there is no guarantee that any given protein construct is able to crystallize,7,8 so that there is no clear end point to the crystallization assay. Furthermore, there exist well-documented © XXXX American Chemical Society

cliffs in crystallization space, so that a very small transition in one parameter (often pH) can result in crystals or the absence of them.9 Furthermore, it should also be recognized that the production of crystals is not an end point in itself; it is the production of biological insight from a crystal structure that is the desired result. Crystals are a necessary intermediate, but the pressing demands for the resulting structural information (for publication, for example) often limit the time and thought that the crystallization process can be afforded. As protein structure becomes just one more biophysical analysis that is done in tandem with others, more structural work is being performed by researchers with less familiarity with the field and who thus lack a wealth of experience with crystallization. The Collaborative Crystallisation Centre (C3) is a medium throughput crystallization laboratory that has specialized in providing crystallization services to external users.10 Over the 6.5 year lifetime of the Centre, there have been over 250 users, and approximately 80 users are active at any one time. The Centre receives 2500−3000 samples per year, and the users range from crystallographers with decades of experience in protein crystallization to undergraduate students with no prior exposure to crystallization at all. Our experience suggests that many of the users at the center rely on the expertise available to them through the Centre to guide them through the crystallization process. Although there are some excellent resources available, in the literature11,12 and online (e.g., http:// hamptonresearch.com/growth_101_lit.aspx), there is no solid set of tools which will help to answer questions about problems encountered along the crystallization path for any given target. Received: November 29, 2012 Revised: January 24, 2013

A

dx.doi.org/10.1021/cg301755a | Cryst. Growth Des. XXXX, XXX, XXX−XXX

Crystal Growth & Design

Article

Figure 1. The blue trace shows the frequency distribution of the chemical polyethylene glycol 400 extracted from REMARK 280 field of the PDB. There were 2354 valid entries for this chemical. The average concentration over all valid instances of PEG 400 is 23.5% (w/v). The red trace shows the same data after a smoothing and baseline offset process. This reduces the complexity of the data and shows that these data are trimodal, with the chemical being used predominantly at low [≈2% (w/v)], medium [≈18% (w/v)], or high [≈26% (w/v)] concentrations.

(e.g., what the limits of sampling should be for each chemical in the screen, the step size for each chemical, the relative weighting for the selection of a member chemical from any chemical group, and which chemicals can be sensibly clustered into a chemical group). In order to publish an X-ray structure, the coordinates must be deposited in a community repository, the PDB. Along with the deposition of the actual atomic coordinates, other information about the experiment is required such as the crystal parameters (e.g., unit cell, space group).15 Information about crystallization may be deposited but is not mandatory at this stage. Any crystallization data deposited in the PDB is found in the REMARK280 field of a deposition, and the information is free format. These freeform data are (within limits) able to be parsed and are the basis of more specific, crystallization-oriented data repositories.16,17 The information available from the PDB, and thus the derivative databases, focuses almost entirely on the nonprotein component of the crystallization experiment (i.e., the crystallization condition), and this paper discusses only this part of the experiment. Clearly, a successful optimization requires information not only about the crystallization cocktail but also about the protein formulation. Unfortunately, information about the protein formulation has not been captured to anywhere near the same extent as the condition information, but we hope that the community will see the value of collecting these data for future data mining exercises. We use crystallization information from the PDB to create a “mash up” of any condition in which a user may be interested and provide guides as to the upper and lower limits found in a

We have been working toward building experimental protocols and software tools to help the C3 users and others venturing into protein crystallization. Our first major software tool was a web tool that allows crystallization conditions to be compared and commercial screens to be analyzed and compared.4 The tool is built on an empirical distance calculator that assigns a measure of similarity between two crystallization cocktails; from this we can extend the tool to provide a similarity measure between two sets of cocktails or screens. This tool helps in the selection of conditions for screening, but does not help users with the problem of what to use for optimizing any hit discovered through screening. Optimization is the process of refining the chemistry (and/or other parameters) of the initial hit to produce more and/or better crystals. There are a few standard ways of doing this: fine screening and additive screening are two of the more common optimization tools.13 With additive screening, a small amount of another screen is added to the hit condition, generating a second generation screen in which all the conditions are minor variations of the same starting condition. Another common type of optimization takes each of the chemicals within the hit and varies each between two arbitrarily chosen limits, and the resultant fine screen is laid out as a grid or in a more random arrangement. A variation of this type of fine screening is to include chemicals which are similar to the chemicals in the initial hit. Sometimes the chemicals chosen for a fine screening experiment are grouped, and only one member of each group is chosen for any condition in the screen. This concept was first implemented widely in a program CRYSTOOLS.14 Fine screening always requires decisions to be made by the user B

dx.doi.org/10.1021/cg301755a | Cryst. Growth Des. XXXX, XXX, XXX−XXX

Crystal Growth & Design

Article

For each peak found via the smoothing/offsetting process, a mean and standard deviation for that peak is found by analyzing the corresponding region of the smoothed frequency distribution curve, with no baseline offset. For buffer factors used as buffers, this same process is performed but using pH values obtained from the cleaned PDB data rather than concentrations. Defining a pH Histogram. Initially, a frequency histogram of some common buffers was calculated to determine the concentrations used for buffering chemicals in crystallization space. Table 1 shows the

successful crystallization space for each of the chemicals in the condition of interest. This information is displayed as a spiderplot with axes corresponding to each chemical factor. The web lines of the spiderplot are upper and lower limits of concentration, obtained from an analysis of the concentration distributions from the PDB.



METHODS

The PDB crystallization data is regularized into a standard format, where each crystallization condition is defined as having one or more chemical factors and where each chemical factor has an associated concentration, concentration unit, chemical name, and potentially a pH value.16 These data are the basis for all subsequent analyses. The current work was done with a set of data extracted from the PDB in June of 2012, which contained just over 100000 valid records (that is, records that contain a concentration, unit, chemical name, and possibly a pH value). Defining a Spiderplot. A crystallization condition may be represented by a spiderplot, where each chemical is mapped to an axis of the spiderplot. The angular spacing of the axes on the plot is (360/number of chemical factors), unless there are only two chemical factors, in which case they are drawn 90 degrees apart for aesthetic reasons. We translate the concentration of each chemical factor in the crystallization cocktail into a common unit (% w/v) and order the axes such that the chemical factor with the highest weight percent is horizontal, and the remaining chemical factors are ordered by relative abundance in the cocktail clockwise from the horizontal axis. We call the chemical factor with the highest relative abundance the “primary factor”. Chemical factors are considered “buffer factors” where the concentration of the chemical is low (0.05−0.2M), an associated pH value exists, and the pH value is within one pH unit of a pKa for that chemical. Buffer factors axes are added to the spiderplot after the nonbuffer factors have been arranged and thus would be the first axes encountered by moving counter-clockwise from the primary factor. If there is more than one buffer factor in a cocktail, the buffer factor with the highest concentration is added first. Buffer factors axes are drawn with a heavier line, to help distinguish them visually, and the outermost point on a buffer factor axis is the value of the relevant pKa of the buffer factor chemical ± 1.5 pH unit. The concentrations axes are plotted using a logarithmic scale to allow details of the nonprimary factors to be shown without being swamped. Defining a Concentration Histogram. All the values for the concentration of a chemical factor were extracted from the clean PDB data and were plotted on a frequency histogram, which contains an arbitrary 100 bins. The frequency histograms are quite coarse, as they reflect the data from the PDB, which in turn is an indication of what is done in the laboratory. Optimisation concentrations tend to be integer values (i.e., a hit obtained from a commercial condition with a chemical at (say) 30% w/v would lead to an optimization around that condition that uses steps of 1−5% around this value) perhaps at 26%, 28% 30% and 32%. The raw frequency distribution histograms are smoothed using a sliding box (moving average) algorithm. The smoothed frequency distribution curves are used to find an appropriate average and standard deviation for each chemical in the spiderplot representation. For some chemicals, an overall mean value and standard deviation may not be a good choice for this. Consider chemicals such as the liquid polyethylene glycols (PEGs), which are found used at both low concentrations (as additives) and high concentrations (as precipitants).16 For these chemicals, the concentration distribution is bimodal (or even trimodal), and the appropriate mean and standard deviation which might guide subsequent optimizations will depend on the concentration of the initial hit. To determine the overall shape of the frequency distribution [that is, whether the frequency distribution is best described as (for example) unimodal or bimodal], the sliding window (moving average) technique is combined with baseline offsetting to show only the peaks of the histogram. This processes of smoothing and baseline offsetting is repeated until an unambiguous number of peaks is obtained, see Figure 1.

Table 1. Summary of Concentrations Found in the PDB for Three Commonly Used Buffer Chemicalsa Buffer Tris HEPES Sodium acetate

Total instances

Count ≤0.05M

Count ≤0.1M

Count ≤0.2M

>0.2M

7010 5625 2897

1526 981 362

5415 4593 2428

14 23 76

55 28 31

a

Tris includes chemicals factors which are likely to be Tris buffers; Tris, Tris chloride, Tris sulfate, and Tris malate. HEPES includes HEPES and sodium HEPES. Sodium acetate only includes instances where the pH is between 3.6 and 5.6. The concentration is greater than 0.1 M in fewer than 1% of instances of Tris and HEPES buffers and less than 4% in the case of sodium acetate.

distribution of concentrations for three common buffer chemicals. From this, we confirmed our initial assumption that most of the variation in buffer factors is in the pH, rather than in the concentration of the factor. The upper and lower pH values for display on the spidergraph is found by locating the nearest pKa value to the pH value at which the buffer factor was used and returning this pKa value with a standard deviation of ±1.5 pH units. The pH range over which a buffer may be considered useful is likely to be less than this and will certainly be somewhat buffer specific; however, the value of 1.5 was chosen to ensure that full pH range would be included. Adding Limit Data to the Spiderplot. For each axis in a spiderplot, a value for the mean and standard deviation (either concentration values or pH values) for that chemical is obtained from the smoothed frequency distribution curve. If the chemical factor has a bimodal or higher distribution (from the smoothed frequency histogram analysis), the appropriate peak is used to find a mean and standard deviation. The value of the appropriate mean ± 1 standard deviation is plotted along the same axis as the value for the chemical factor. The radial path for the chemical factors in the crystallization cocktail is colored blue; the radial path for the lower limit of each chemical is colored green, and the radial path for the upper limit of each chemical factor is colored red (see Figure 2).



RESULTS AND DISCUSSION We have described an intuitive graph-based visualization method to display a crystallization cocktail. The format of the spiderplot allows the easy identification of the most abundant chemical factor found in the cocktail (the primary factor), as it is found along the horizontal axis. The remaining chemical factors (excluding the buffer factors) are arranged clockwise in order of relative abundance. Before the graph is drawn, all of the chemical factors within the crystallization cocktail are converted from the original units to percent (weight/volume), as this is the only unit to which all others can be converted. A chemical factor in the cocktail, which is a potential buffering chemical (i.e., it is found a low concentrations, at a defined pH which is close to a pKa for that chemical), is displayed on the graph using a heavier linetype, and the value plotted on the axis is the pH, rather than the concentration, as it is usual to vary the pH rather than the concentration of the buffer factors. C

dx.doi.org/10.1021/cg301755a | Cryst. Growth Des. XXXX, XXX, XXX−XXX

Crystal Growth & Design

Article

Figure 2. Two representative spiderplots. (a) 10 w/v PEG 8K, 8 v/v ethylene glycol, and 0.1 M sodium HEPES pH 7.5. (b) 50 w/v PEG 400, 0.2 M lithium sulfate, 0.1 M sodium acetate, pH 4.5; 0.05 M MES, pH 6.5. The factor with the highest concentration (after translating units to w/v) is drawn along the horizontal axis, and factors are arranged counter clockwise according to concentration. If a chemical factor has a low concentration (0.2 M or less) and has an associated pH which is within 1 pH unit of a pKa for that chemical, it is assumed to be a buffer, and the pH rather than the concentration is considered to be the salient attribute. Buffer factors are drawn after the other factors, with a heavier line type. The concentration (or pH) values for each factor are joined by a blue line, to give a spiderplot. Along each axis, a value for the upper limit of that factor (either concentration or pH) is plotted, and the upper values (as determined from the smoothed frequency distribution) are joined with a red line. Along each axis, a value for the lower limit of that factor (either concentration or pH) is plotted, and the lower values (as determined from the smoothed frequency distribution) are joined with a green line. The dashed line provides a boundary for the plot.

CA), a chemical belonging to a class “precipitant” is optimized by taking the lower bound of the concentration range of that chemical to be the current value × 0.8 and the upper bound of the optimization range to be the current value × 1.1. Nonbuffer chemicals that are not associated with the class “precipitant” are given the default concentration range of 0 → current value × 2. Given robust class association, this optimization strategy would provide a sensible (if not particularly creative) heuristic for fine screening optimizations. However, there are some fundamental problems in the way the classes are associated with a chemical. There is no recognized standard for the association of any given chemical into a class. Most often, a class is the column header on a specification sheet for the chemical factors in the set of crystallization cocktails in a commercial screen, and the arrangement of chemical factors into columns is dictated by convenience as much as anything else. As a result, the association of chemicals into classes is somewhat arbitrary. Furthermore, classes are generally of two types: those that describe the action of the chemical and those that describe the type of chemical. For example, the class “precipitant” is generally used to mean the chemical factor, which is present in the greatest amount in the crystallization condition and which has the role of bringing the protein out of solution. The class “salt” defines the chemical factor as being a salt. Of course, these two classes are not exclusive: in a condition that contains 2 M ammonium sulfate and 0.1 M Tris-HCl, pH 8, the ammonium sulfate could validly be classed as both a salt and a precipitant. Without additional input from the user, the Rigaku automatic optimization strategy would set limits of either 0−4 M (if the class were salt) or 1.6−2.2 M (if the class were precipitant) for the ammonium sulfate in the example above.

Along with the value of the chemical factor, two other values are plotted on each axis: a value for the upper value of the normal range for that chemical, found over all instances of the chemical in the PDB, and a value for the lower limit of the normal range for that chemical. The upper and lower limits are chosen from the appropriate peak in the frequency distribution graph for that chemical from the PDB. For example, the frequency distribution histogram and smoothed frequency distribution graph for polyethylene glycol 400 (PEG 400) shows three peaks (e.g., the first centered at a concentration of 1.8% (w/v), the second centered at 17.8% (w/v), and the third at 26.6% (w/v), see Figure 1). If a chemical cocktail contained PEG 400 at 4% (w/v) then the upper and lower bounds shown on the spiderplot would represent the mean ± 1 standard deviation of the first peak. If the value for the concentration of PEG 400 in a chemical cocktail falls exactly in between the two peaks from the frequency distribution graph then the spread of the larger peak is used, around a center point defined by the concentration of the PEG 400 in the cocktail. There are two parameters which can be tuned to produce a smoothed plot: the extent of the sliding window and the number of iterations of the smoothing process. Currently, we use a window size of 8 and do two iterations of the smoothing process; however, in future work, the smoothing process (and baseline offsetting) should be more dynamically determined, to ensure that every smoothed plot results in clean results. Using Class to Define Optimization Limits. The concept of a class, or a role, for a chemical used in crystallization seems intuitive and has been used as the basis of optimization design in some commercial software packages. For example, in the Rigaku CrystalTrak application (Rigaku Automation, Carlsbad, D

dx.doi.org/10.1021/cg301755a | Cryst. Growth Des. XXXX, XXX, XXX−XXX

Crystal Growth & Design

Article

strategy. However, it does provide a reasonable approach and, as such, may well be a better starting point for optimization ̈ than one from a complete naive.

For a novice venturing into crystallization, the very concept of “class” is foreign, and our experience in C3 suggests that it is unlikely that a novice user will reliably select the appropriate class in order to obtain a reasonable optimization design. Spiderplot Ranges are Set from Data on Successful Crystallizations. In the spiderplot representation of a crystallization cocktail, the chemical factors in a crystallization cocktail are not labeled with a class. There is the concept of the primary factor, but that is to provide an anchor point for the plot and does not suggest a possible mode of action of that chemical factor. The goal of associating each chemical factor with an upper and lower limit gleaned from an analysis of successful crystallization conditions is to provide the crystallization optimizer, human or machine, with guides as to realistic values for the factor. These upper and lower values can guide future optimizations, if no other information is available as to the range of values that would be appropriate for subsequent fine screening. We do embrace the concept of class for buffer factors, but we define the class automatically from data, rather than from an arbitrary column heading. If a chemical has a low concentration (0.2 M or less) and has an associated pH value within 1 pH unit of a pKa for that chemical, it is assigned the class buffer and subsequent optimizations are performed by altering the pH of the factor, rather than the concentration. This follows the convention that most crystallographers would use. This definition of buffer thus excludes pH optimization of chemicals that are used both as buffers and salts when they are too far away from a pKa to provide any kind of pH buffering. For example, sodium acetate used at 0.2 M at pH 7 would be optimized on concentration, rather than on pH, whereas sodium acetate used at 0.2 M at pH 5 would be optimized on pH. The upper and lower limits for each of the chemical factors found on the spiderplot axes are the overall limits found for that chemical factor, rather than the limits found for that factor in combination with other chemical factors. Although it would be more appropriate to be able to set upper and lower limits of a chemical factor taking into account the other chemical factors in the cocktail, the data mined from the PDB are generally not rich enough to do this. As of June, 2012, the crystallization data from the PDB contained 107701 records of individual chemical factors from 44495 PDB entries. Notice that many PDB entries contain some incomplete crystallization data, but unless a (reasonable) concentration, unit, and recognizable chemical name could be parsed, the data were ignored. Of the 598 distinct chemicals found as chemical factors in the PDB, only 22 chemicals were found in more than 1000 records. Looking only at these very popular chemical factors, we counted the number of times the 231 pair combinations (i.e., 22C2) of these 22 popular chemical factors arose. Only 137 pairings of the 22 “rich” chemical factors were found over 100 times. If we were to look at triplets, the number of possible combinations explodes to 1540, and it becomes clear that the data from the PDB are not extensive enough to provide valid estimates of the limits. Any general tool may not be appropriate in specific cases; just because two proteins show initial crystallization hits in the same crystallization cocktail does not ensure that the refined conditions that produce crystals of the two proteins will be the same. We cannot guarantee that the optimization limits suggested by this data mining approach will be the best (or even a mediocre) starting point for any given fine-screening



CONCLUSION We have developed a novel representation of a crystallization condition that captures many of the salient features of a crystallization experiment: which chemical factor is most abundant and the relative abundances of the other factors. We overlay values for likely upper and lower bounds for each of the chemical factors onto this plot, which then acts as a sensible starting point for subsequent optimizations using fine screening. These plots are available through the C6 Web site (http:// c6.csiro.au)



AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS We thank Dr. Tom Peat for numerous discussions about the spiderplot representation and the VLSCI program and CSIRO for providing funding for M.C. during this project.



REFERENCES

(1) McPherson, A. Eur. J. Biochem. 1990, 189, 1−23. (2) Cudney, R.; Patel, S.; Weisgraber, K.; Newhouse, Y.; McPherson, A. Acta Crystallogr., Sect. D: Biol. Crystallogr. 1994, 50, 414−423. (3) McPherson, A. Methods 2004, 34, 254−265. (4) Newman, J.; Fazio, V. J.; Lawson, B.; Peat, T. S. Cryst. Growth Des. 2010, 10, 2785−2792. (5) Luft, J. R.; Wolfley, J. R.; Snell, E. H. Cryst. Growth Des. 2011, 11, 651−663. (6) Luft, J. R.; Wolfley, J. R.; Said, M. I.; Nagel, R. M.; Lauricella, A. M.; Smith, J. L.; Thayer, M. H.; Veatch, C. K.; Snell, E. H.; Malkowski, M. G.; DeTitta, G. T. Protein Sci. 2007, 16, 715−722. (7) Smialowski, P.; Schmidt, T.; Cox, J.; Kirschner, A.; Frishman, D. Proteins: Struct., Funct., Bioinf. 2005, 62, 343−355. (8) Doye, J. P.; Louis, A. A.; Vendruscolo, M. Phys. Biol. 2004, 1, P9. (9) McPherson, A. J. Appl. Crystallogr. 1995, 28, 362−365. (10) Newman, J. Methods 2011, 55, 73−80. (11) Bergfors, T. Protein Crystallization, 2nd ed.; IUL Biotechnology Series; International University Line: La Jolla, CA, 2009. (12) Chayen, N. E. International University Line. Protein Crystallization Strategies for Structural Genomics; International University Line: La Jolla, CA, 2007. (13) Newman, J.; Pham, T. M.; Peat, T. S. Acta Crystallogr., Sect. F: Struct. Biol. Cryst. Commun. 2008, 64, 991−996. (14) Segelke, B. W. J. Cryst. Growth 2001, 232, 553−562. (15) Berman, H.; Henrick, K.; Nakamura, H.; Markley, J. L. Nucleic Acids Res. 2007, 35, D301−D303. (16) Peat, T. S.; Christopher, J. A.; Newman, J. Acta Crystallogr., Sect. D: Biol. Crystallogr. 2005, 61, 1662−1669. (17) Gilliland, G. L.; Tung, M.; Blakeslee, D. M.; Ladner, J. E. Acta Crystallogr., Sect. D: Biol. Crystallogr. 1994, 50, 408−413.

E

dx.doi.org/10.1021/cg301755a | Cryst. Growth Des. XXXX, XXX, XXX−XXX