Implementation of an Algorithm for Modeling Disulfide Bond Patterns

Aug 28, 2003 - 'Signature Sets', Minimal Fragment Sets for Identifying Protein Disulfide Structures with Cyanylation-Based Mass Mapping Methodology. W...
0 downloads 10 Views 63KB Size
Implementation of an Algorithm for Modeling Disulfide Bond Patterns Using Mass Spectrometry R. Craig,† O. Krokhin,† J. Wilkins,† and R. C Beavis†,‡ Manitoba Centre for Proteomics, University of Manitoba, Winnipeg, MB Canada, and Institute for Biophysical Dynamics, University of Chicago, Chicago Illinois Received February 27, 2003

Abstract: The paper describes the implementation of a software system based on the Fenyo¨ disulfide bond assignment algorithm. The system allows an investigator to enter data derived from mass spectrum peak assignments, a target protein sequence and other experimental conditions. The output of the system is the set of disulfide bonding pattern models that are consistent with the experimental evidence. The software and code are available through a public web site, which also has a functioning, publicly accessible version of the disulfide bond modeler. This implementation was tested as part of a project to check homology-based assignments disulfide bonding patterns of human integrins. Keywords: disulfide • crosslinking • mass spectrometry • Fenyo algorithm • MALDI • sequence modeling

Introduction The formation of nonrandom disulfide bonds to stabilize the structures of exported proteins from eukaryotic and prokayotic organisms have been studied extensively in prokaryotes and eukaryotes.1,2 In eukaryotes, these bonds are formed in a multiple step process, catalyzed by specific enzymes and chaperonins in the eukaryotic endoplasmic reticulum.3 Determining the pattern of disulfide bonding in a particular sequence has remained a technically difficult experimental problem. Two experimental methods are commonly used to determine the disulfide bond cross-linking pattern in a protein. The first method is to obtain a high-resolution three-dimensional structure for the protein in question and directly observe the disulfide bonds between cysteine residues. This method may be the only possibility for very highly disulfide cross-linked proteins, such as wheat germ agglutinin.4 The second method is to digest an intact protein with a known amino acid sequence with a sequence-specific protease,5 e.g., trypsin. The cleaved peptides are then predicted theoretically, based on the enzyme specificity and the protein sequence. The mixture of peptides derived experimentally is then compared with the theoretical cleavage pattern. Any set of peptides that are covalently linked together can be used to produce a model of the cystine structure of the original sequence. Both of these methods have serious practical difficulties. Determining the three-dimensional structure of a native, intact † ‡

Manitoba Centre for Proteomics, University of Manitoba. Institute for Biophysical Dynamics, University of Chicago.

10.1021/pr034016a CCC: $25.00

 2003 American Chemical Society

protein is often impossible. To obtain a structure, a subdomain of the protein that is sufficiently soluble and which will form crystals may be expressed and studied. The full disulfide bonding pattern of a large protein is usually not available from the subdomain: only the subset of cystines contained in the subdomain can be inferred. In proteins with repetitive sequences, it is possible to use the information obtained from one subdomain to infer the disulfide bonding pattern in other, similar subdomains. This process of inference to other domains is often extended to other protein sequences with sufficient sequence similarity. Modeling by inference is widely used in the biological community and these models are often the only type available for many proteins. The major problem with the second method is the analysis of the peptide mixtures that result from the enzymatic cleavage of the protein’s peptide backbone. The peptides present in the mixture are difficult to predict, because proteins with intact cystines often have domains that are very resistant to proteolysis, and it is these domains that contain the cystines. Therefore, it is necessary to separate the resulting peptides chromatographically and perform detailed analysis of each chemical species to determine which backbone bonds have been cleaved and then infer a consistent disulfide bond model. The process of analyzing the peptides present has been considerable simplified by the availability of modern mass spectrometers with high sensitivity and mass accuracy.6-8 Naively, it would appear that simply cleaving the protein and measuring the masses of the resulting peptides would quickly produce a viable model, based on a known amino acid sequence for the target protein. In practice, the prediction of potential disulfide bonding models and comparison of these models to mass spectra is straightforward in cases where there are only a few potential disulfide bonds. The task quickly becomes difficult to perform by inspection as the number of cysteines becomes large. The number of disulfide bonding models, m, that can be constructed from a sequence with a given number of cysteines, N, and an expected number of disulfide bonds, a, can be calculated using the simple recursion relationship m(a, N + 1) ) m(a, N) + Nm(a - 1, N - 1)

(1)

The total number of models can be calculated by tabulating the matrix generated by application of eq 1, starting with the trivial conditions m(0,N) ) 1 for N g 0 and m(a,0) ) 0 for a > 0. Equation 1 can be proven by considering the case how many models are added to a collection of models m(a,N) by adding one more cysteine. Manually enumerating this case for a few Journal of Proteome Research 2003, 2, 657-661

657

Published on Web 08/28/2003

technical notes

Algorithm for Modeling Disulfide Bond Patterns

values of a and N demonstrate the validity of eq 1. The total number of models for any value of N, M(N), can then be calculated by summing the appropriate row of the matrix N/2

M(N) )

∑m(a, N)

(2)

a)0

It should be noted that M(N) is significantly larger than the numbers quoted by some standard discussions (e.g., Anfinsen)9. The order of the calculation is directly proportional to M, therefore the calculation is of order O(M) ≈ 100.88N for N > 20, making it by definition an NP-complete hard problem.10 The Fenyo¨ algorithm11 for calculating disulfide bonding patterns (which we implement here) reduces the complexity of the enumeration problem by breaking up the calculation into a set of sub-problems by determining the maximum number of cysteines, ni, that would be necessary to produce a particular peptide mass, mi, for any given protein sequence and cleavage specificity. By arranging the cleavage chemistry to minimize n, the algorithm allows an exhaustive calculation to be done on cleavage peptides data sets from proteins with large numbers of cysteine residues. In an unknown protein, one cannot assume that all of the cysteines are used to form cystines, for many reasons. Proteins often have odd numbers of cysteines, with the extra cysteine being used to form intermolecular bonds. Cysteines may be used to chelate metals (e.g., in zinc-finger motifs), attach heme groups (e.g., cytochrome C) or as nucleophiles in the active sites of enzymes (e.g., cysteine proteases) or they can simply be free cysteines. Integral membrane proteins often have both intra- and extra-cellular cysteines, and the former will not be available for disulfide bonding because of endogenous glutathione.

Experimental Section Protein Analysis. This section is provided as a reference for the experimental protocol that has been used to test this implementation of the modeling algorithm. This protocol was used to generate the example data, available on the user interface by pressing the button described in the note at the bottom of the page.12 It was also used to generate the data for testing the modeler for unanticipated behavior resulting from programming errors. RVβ1 and RVβ3 human integrins were immunoaffinity purified from human placenta and RIIbβ3 integrin from outdated human platelets.13 Bovine serum albumin (Sigma-Aldrich) was used as supplied. Purified integrins were deglycosylated with PNGaseF (Roche), digested with excision grade trypsin (Calbiochem). MS spectra were acquired for non reduced digests or for digests after reduction/alkylation (5 mM DTT/30mM iodoacetamide). Chromatographic separations of integrin digests were performed using Agilent 1100 Series µLC system. Samples (5 µl in 0.1% TFA) were loaded onto capillary columns (180 µm × 150 mm) packed with Vydac 218 TP C18 (5 µm diameter) and eluted (4 µl/min) with a linear gradient of 1-90% acetonitrile (0.1% TFA) in 90 min. Column effluent was mixed on-line with MALDI matrix solution and fractions were deposited on the MALDI target at 1 min intervals. The fractions were air-dried on the target and subjected to MALDI QqTOF MS (MS/MS) analysis. MS measurements were performed using a MALDI tandem quadrupole/time-of-flight (QqTOF) instrument14 with a mass resolving power about 10,000 (fwhm) and 658

Journal of Proteome Research • Vol. 2, No. 6, 2003

Figure 1. Total number of disulfide bonding models possible as a function of the number of cysteines present in a peptide, calculated using eqs 1 and 2.

accuracy n the range of 10 ppm in both single MS model and MS/MS mode. Software Development. This implementation of the Fenyo¨ algorithm was developed using Visual Studio C++.NET (Microsoft). It was written in a fully object-oriented, portable style that allowed it to be compiled either with the Microsoft Windows compiler (Microsoft, Redmond WA) or the GNU compiler gcc15 for Linux (Red Hat, Raleigh NC), with built-in multi-thread support. The Standard Template Library was used extensively to simplify the implementation. The component that performs the modeling was completely separated from the user interface and designed to be neutral to the communication interface, so that it can be inserted into larger systems. The component was written to be activated from the command line, with a single command line parameter, which is the full path name of an extensible Markup Language16 (XML) file that contains all of the information necessary to calculate the models. In addition to this information, the input XML file contains the full path name of an output XML file that will contain all of the model information when the calculation is complete, including the information that was contained in the input XML file. All of the source code for this component and a complete specification for the XML used for the input and output files (BIOML17) is made available as an open source project at a publicly available web site,18 with its use governed by the Artistic License provided by the Open Source Inititiative.19

Results and Discussion Table 1 shows the tabulated matrix of the number of possible disulfide bond models for N e 8 and Figure 1 shows the rate of growth of M(N) over a wider range of N values, as derived from eqs 1 and 2. It should be noted that this calculation is valid for any cross-linking reagent that covalently bonds such as residue side chains. The values of M(N) are significantly larger than those usually quoted for disulfide bonding models, for example the number of models for N ) 8 is quoted as being 105 by Anfinsen,9 whereas M(8) ) 764 in Table 1. The reason for this difference is that for small exported proteins, it is assumed that the folded protein only contains cystines, i.e., there are no free cysteines. Therefore, Anfinsen’s value of 105 represents only the m(4,8) cell in Table 1.

technical notes

Craig et al.

Table 1. Tabulation of the Number of Disulfide Bonding Models Calculated Using eqs 1 and 2

N)0 1 2 3 4 5 6 7 8

A)0

1

2

3

4

M

1 1 1 1 1 1 1 1 1

0 0 1 3 6 10 15 21 28

0 0 0 0 3 15 45 105 210

0 0 0 0 0 0 15 105 420

0 0 0 0 0 0 0 0 105

1 1 2 4 10 26 76 232 764

Figure 1 demonstrates that it is not practical to match mass spectrometry data derived from enzymatic or chemical cleavage mixtures by simply enumerating every possible disulfide bonding pattern for a protein with N > 20. It also demonstrates that the order of the calculation is very significantly reduced if the algorithm effectively reduces the values of N considered: for values of N < 20, the order of the calculation is much less than the log-linear relationship for values N > 20. The experimental results obtained using the experimental protocol described above produce models that have varying degrees of completeness. In the case of simple disulfide bonding structures, this approach will produce a complete model for the disulfide boding pattern, assuming that the mass spectrometer’s mass accuracy is sufficient to distinguish between potential models and there appropriate cleavage sites in the protein’s sequence to provide the necessary peptides. In practice, complex disulfide bonding patterns may produce peptide masses that cannot be definitively interpreted: either the mass accuracy or the availability of cleavage sites may make the unambiguous determination of all disulfide bonds impractical using this method. In such cases, more involved chemical protocols,20,21 or tandem mass spectrometry22-24 may be necessary to distinguish between competing models. The method described here is not meant to be exhaustive: it is a relatively simple survey method that can rapidly determine if a proposed sequence similarity model is consistent with the experimentally determined masses and propose potential models for peptides that cannot be fitted with a hypothetical model. The availability of the modeler makes it possible to plan experiments and determine the mass accuracy and cleavage chemistry necessary to confirm a hypothetical model. By calculating the theoretical masses of the disulfide bonded peptides from a given model, the modeler can be run iteratively to determine what range of experimental parameters may be used to test it. The results of the modeling run will show all of the models that cannot be distinguished with a given set of experimental conditions, allowing the investigator to refine their experimental design to suit a given sequence/model combination. Table 2 lists the available input parameters that can be used for planning an experiment, as well as a short description of the relevant output model information. The available parameters are sufficient for examining the results of more complex chemistry, such as that proposed by Schnaible, et al.21 Accessing the validity of any model derived from this type of software system is an important consideration in determining its usefulness. The implementation has been tested with data obtained from MALDI spectra of tryptic digests of disulfide bonded proteins. Proteins with well-known disulfide patterns, such as bovine serum albumin, were digested and the models obtained from the disulfide modeler were checked against the peptide masses that would be predicted from the known

Figure 2. Activity diagram for the disulfide modeling system, using Unified Modeling Language conventions. The method names shown are simply to explain their purpose and may correspond to different names in the actual implementation.

disulfide patterns. The models obtained agreed with the known models in all cases. The modeler was also used to check data obtained from ongoing studies of proteins with predicted disulfide bond patterns obtained by sequence similarity predictions. In all cases, the results from the modeler agreed with models that had been generated manually by exhaustive comparison of possible bonded peptides with the experimental data. The merit of the system was demonstrated by its ability to reduce the time required for manual analysis from weeks to seconds. One set of experimental data obtained for human R-integrin has been included in the publicly available interface to demonstrate the function of the system. Once a model has been obtained, the data and sequence that the model is based on can be tested iteratively to determine the limits of validity. A simple method for accessing the range of validity is to systematically vary model parameters, to see how stable the model is to the experimental conditions. Varying the mass accuracy is an obvious choice to determine the robustness of the model. Three interface parameters were added specifically for this purpose (see Table 2): the maximum numbers of free sulfhydryls, peptides, and cross links in a model. Particularly for large peptide masses that could represent several different bonding patterns, varying these parameters allows the investigator to determine the minimum and maximum complexity necessary to explain the observed data. The effect of contaminating protein sequences can be simulated by adding a potential contaminant protein to the sequence of the protein of interest, with a simple cleavage site linking the two, e.g., adding a lysine between the two sequences for a tryptic cleavage. Unanticipated cleavages, protein mixtures, residue modifications, and systematic experimental errors can always result in unreliable models and these factors should be carefully controlled for in any experiment. In addition to the modeling component, a demonstration system that allows the use of the modeling component through a hypertext transfer protocol server such as the Apache World Wide Web server is available at the same site. The system is illustrated in Figure 2. The user communicates with the server through a standard common gateway interface form. The form Journal of Proteome Research • Vol. 2, No. 6, 2003 659

technical notes

Algorithm for Modeling Disulfide Bond Patterns Table 2. Input and Output Parameters for the Disulfide Modelera inputs

cleavage reagent maximum missed cleavage sites

mass type experimental masses: average and monoisotopic

mass tolerance: average and monoisotopic

complete modifications

partial modifications

sequence maximum free sulfhydryls maximum peptides maximum cross links

outputs

measured mass computed mass error residue start/end missed cuts free cysteines linked cysteines peptide sequences a

The enzyme used to create peptides. This parameter is used to limit the scope of the calculation: the order of the calculation increases linearly with this parameter. Enables the use of either M + H or M. A white space separated list of measured peptide masses. Either the average mass or the monoisotopic mass of a peptide can be entered. The desired mass tolerance for the measurement, in Daltons, percentage or parts-per-million. The order of the calculation increases linearly with this parameter. The chemical modifications to all free cysteines, including any alkylation reagent used in the experiment. The chemical modifications that may have occurred to some unknown fraction of free cysteines, including any alkylation reagent used in the experiment. The protein sequence to be modeled, in single letter amino acid code. The maximum number of free sulfhydryls allowed in a result model. The maximum number of peptide fragments combined to create a result model. The maximum number of cross link attachments in a result model. comments

The measurement corresponding to each model is listed along with the model. The calculated mass for the disulfide bonded complex and the individual peptides are show. The mass difference between the measured and calculated masses. The positions for the start and end residues for each peptide in a disulfide bonded model. The number of missed cleavage sites in the model. The number of free cysteines in the model. The number of disulfide bonded pairs in a model. The sequences of the peptides in a model, with the cysteine residues highlighted.

Each of these parameters is explained in detail in the help documentation associated with project.18

was composed using the parameter names suggested in a draft proposal for standardized communication with proteomic software, ProMSML, composed by researchers from Bruker Daltonics, Protagen, Proteometrics, Genomic Solutions and Amersham Biosciences. The server then starts a simple XML interface (written in Perl) that converts the form parameters into a BIOML file, which is submitted to the modeling component. When the modeling component is finished, the resulting BIOML output file is rendered into HTML using an extensible Stylesheet Language Transformation25 (XSLT) style sheet by the XML interface layer. The server then returns the HTML to the client’s browser for inspection. This system can be configured for many types of architecture: the complete separation of the input and output interface from the modeling component and the use of published standards makes altering the user interface a matter of simple scripting. All of the necessary implementation files and copies of the draft standards for BIOML and ProMSML are all publicly available.18 A freely available working version of this implementation is also available at the same site. 660

comments

Journal of Proteome Research • Vol. 2, No. 6, 2003

As mentioned above, this algorithm was tested and used in studies to determine the disulfide bonding patterns in human integrins, which has been published separately,26 but are summarized here. The R chains examined in this study undergo posttranslational cleavages near the C terminus to generate disulfide linked heavy (H) and light (L) chains.27 The L chains contain transmembrane and cytoplasmic sequences. These serve to anchor the integrins to the cell surface. Differences in disulfide patterns compared to homology models were observed in this region. In the case of R5, an additional disulfide bond was observed between the H and L chains. It is unclear what the functional significance of these differences might be, although it has recently been demonstrated that both fibronectin and integrins display endogenous thiol isomerase activity.28,29 These observations raise the possibility of disulfide exchange during fibronectin fibril generation.

Acknowledgment. This work was supported by a grant from the Canadian Foundation for Innovation, the Government of Manitoba, the Canadian Institutes for Health Research and

technical notes a Yen Foundation Fellowship. The authors would also like to thank the Manitoba Centre for Proteomics for access to computers and mass spectrometers.

References (1) Collet, J. F.; Bardwell, J. C. Mol. Microbiol. 2002, 44, 1-8. (2) Fassio A.; Sitia R. Histochem. Cell Biol. 2002, 117, 151-157. (3) Frand, A. R.; Cuozzo J. W.; Kaiser, C. A. Trends Cell Biol. 2000, 10, 203-209. (4) Wright, C. S. J. Mol. Biol. 1977, 111, 439-457. (5) Sun, Y.; Bauer, M. D.; Keough, T. W.; Lacey M. P. Methods Mol. Biol. 1996; 61, 655-664. (6) Gray, W. R. Protein Sci. 1993, 2, 1732-1748. (7) Bean, M. F.; Carr, S. A. Anal. Biochem. 1992, 201, 216-226. (8) Smith, D. L.; Zhou Z. Methods in Enzymology 1990, 193, 374389. (9) Anfinsen, C. B. Science 1973, 181, 223-230. (10) Garey, M.; Johnson, D. Computers and Intractability - A Guide to the Theory of NP-Completeness W. H. Freeman & Co.: New York, 1979. (11) Fenyo¨, D. Comput. Appl. Biosci. 1997, 13, 617-618. (12) http://www.proteome.ca/xml/DisulphideModeler/DisulphideModeler.html. (13) Wilkins, J. A.; Krokhin, O. V.; Cheng, K.; Sousa, S. L.; Krokhina, T. G.; Ens, W.; Standing, K. G. Characterization of Disulphide Bond Pattern of Integrin Alpha Chains. Presented at the 50th Annual Meeting of the American Society for Mass Spectrometry, 2002.

Craig et al. (14) Loboda, A. V.; Krutchinsky, A. N.; Bromirski, M.; Ens, W.; Standing, K. G. Rapid Commun. Mass Spectrom. 2000, 14, 1047-1057. (15) The compiler can be obtained free-of-charge from http:// gcc.gnu.org/. (16) Holzner, S. Inside XML New Riders 2000. (17) Fenyo, D. Bioinformatics 1999, 15, 339-40. (18) All source code, documentation, and standards are available at http://www.proteome.ca/opensource.html. (19) The full text of the Artistic License is available at http:// www.opensource.org/licenses/artistic-license.php. (20) Wu, J.; Watson, J. T. Protein Science 1997, 6, 391-398. (21) Schnaible, V.; Wefing, S.; Bucker, A.; Wolf-Kummeth, S.; Hoffmann, D. Anal. Chem. 2002, 74, 2386-2393. (22) Jones, M. D.; Patterson, S. D.; Lu, H. S. Anal. Chem. 1998, 70, 136-143. (23) Schnaible, V.; Wefing, S.; Resemann, A.; Suckau, D.; Bucker, A.; Wolf-Kummeth, S.; Hoffmann, D. Anal. Chem. 2002, 74, 49804988. (24) Qin, J.; Chait, B. T. Anal. Chem. 1997, 69, 4002-4009. (25) Holzner, S. Inside XSLT New Riders 2002. (26) Krokhin, O. V.; Cheng, K.; Sousa, S. L.; Ens, W.; Standing, K. G. Biochemistry 2003 (submitted). (27) Hemler, M. E. Annu. Rev. Immunol. 1990, 8, 365-400. (28) Langenbach, K. J.; Sottile, J. J. Biol. Chem. 1999, 274, 7032-7038. (29) O’Neill, S.; Robinson, A.; Deering, A.; Ryan, M.; Fitzgerald, D. J.; Moran, N. J. Biol. Chem. 2000, 275, 36 984-36 990.

PR034016A

Journal of Proteome Research • Vol. 2, No. 6, 2003 661