Anal. Chem. 2009, 81, 8354–8364
Large-Scale Identification and Quantification of Covalent Modifications in Therapeutic Proteins Zhongqi Zhang* Process and Product Development, Amgen, M/S 30E-1-B, One Amgen Center Drive, Thousand Oaks, California 91320 Covalent modifications on therapeutic proteins are traditionally monitored by chromatographic techniques, which quantify limited number of protein modifications at a time. In this report, computer algorithms for automated analyses of liquid chromatography/tandem mass spectrometry (LC/MS/MS) data for large-scale identification and quantification of known and unknown modifications are described. Peptide identification is achieved by comparing the experimental fragmentation spectrum to the predicted spectrum of each native or modified peptide. Peak areas of related peptide ions under their selected-ion chromatograms (SIC) are used for relative quantification of modified peptides. A matched window function is used to generate SIC for more reliable quantification. In an LC/ MS/MS analysis of a tryptic digestion of an IgG2 monoclonal antibody, 1712 peptide ions were identified with a false-discovery rate of ∼0.4%, and 227 modifications were identified and quantified. The accuracy of the mass spectrometry-based quantification is evaluated by comparing the abundance of different glycoforms determined by mass spectrometry to that determined by a fluorescencebased chromatography method. This large-scale method may potentially replace many chromatographic methods for assessing the quality attributes of therapeutic proteins. Assessing the quality attributes of a therapeutic protein calls for identification and quantification of all covalent modifications on each residue of the protein, including post-translation modifications during cell culture as well as chemical modifications during purification and storage. Traditionally this task is accomplished by a battery of chromatographic and electrophoretic techniques. These techniques are used in virtually all aspects of the development cycle of a therapeutic protein, from selection of the protein construct to the release of the final drug product. During the development cycle, a great amount of effort is spent on the development, characterization, validation, and transfer of these chromatographic/electrophoretic methods. These techniques, usually separating different species based on their differences in charge, size, or hydrophobicity, are not ideal because elution or migration time of a protein variant is a poor identifier of the underlying modification. As a result, additional experiments are usually performed to isolate the peak of interest and characterize it with a structurally informative technique, typically proteolytic
digestion followed by liquid chromatography/tandem mass spectrometry (LC/MS/MS) analysis. In addition, most chromatographic peaks are not pure due to the often poor resolution of these techniques; therefore, it is often impossible to fully characterize a minor chromatographic peak. Because a single method is far from resolving all modified species, only a small number of modifications are monitored by each method. To get a more complete modification profile a protein, a series of orthogonal chromatographic methods is usually required and still not sufficient to detect most covalent modifications in a therapeutic protein. Time- and cost-effective development of high-quality protein therapeutics requires a large-scale method, which identifies and quantifies all major modifications in a protein in a single experiment. Mass spectrometry has been used widely in characterizing therapeutic proteins and their modified forms.1,2 LC/MS-based techniques have the inherent advantage in that modified species may be identified and quantified in a same analysis. Additionally, the ultrahigh resolution of modern mass spectrometers makes it possible to resolve virtually all peptide species in an LC/MS run of a protein digest. These virtues of the MS-based technique give it the potential as an ideal technique to assess all covalent modifications in a therapeutic protein, i.e., identification and quantification of all covalent modifications on each residue in a single experiment. Recent advances in proteomics provide users with a suite of computer programs for automated peptide identification3-6 and quantification.7-9 Most of these computer programs, however, are designed to identify/quantify the proteins of interest in a complex protein mixture through protein sequence database searching; they are not ideal for the complete sequence coverage necessary for the full characterization of a therapeutic protein. For example, if the MS/MS of a peptide ion does not contain enough sequence information, then most programs will ignore the spectrum, which one cannot afford to do when characterizing a therapeutic protein. Fortunately, the protein sequence search space is rather small (1) (2) (3) (4) (5) (6) (7) (8)
* To whom correspondence should be addressed. Phone: 805-447-7783. Fax: 805-376-2354. E-mail:
[email protected].
8354
Analytical Chemistry, Vol. 81, No. 20, October 15, 2009
(9)
Barnes, C. A. S.; Lim, A. Mass Spectrom. Rev. 2007, 26, 370–388. Zhang, Z.; Pan, H.; Chen, X. Mass Spectrom. Rev. 2009, 28, 147–176. Sadygov, R. G.; Cociorva, D.; Yates, J. R. Nat. Methods 2004, 1, 195–202. Forner, F.; Foster, L. J.; Toppo, S. Curr. Bioinf. 2007, 2, 63–93. Zhang, X.; Oh, C.; Riley, C. P.; Buck, C. Curr. Proteomics 2007, 4, 121– 130. McHugh, L.; Arthur, J. W. PLoS Comput. Biol. 2008, 4, e12. Lau, K. W.; Jones, A. R.; Swainston, N.; Siepen, J. A.; Hubbard, S. J. Proteomics 2007, 7, 2787–2799. Mueller, L. N.; Brusniak, M. Y.; Mani, D. R.; Aebersold, R. J. Proteome Res. 2008, 7, 51–61. America, A. H. P.; Cordewener, J. H. G. Proteomics 2008, 8, 731–749. 10.1021/ac901193n CCC: $40.75 2009 American Chemical Society Published on Web 09/18/2009
Figure 1. Flow diagram illustrating the relationship of algorithms used in MassAnalyzer.
for purified proteins. With the recently developed mathematical model to predict peptide collision-induced dissociation (CID) spectra,10,11 most ions can be confidently identified, even when their MS/MS spectra contain little sequence information. This article describes the computer algorithms and the program (MassAnalyzer) for automated identification and quantification of covalent modifications based on data acquired from LC/MS/MS analysis of the proteolytic digestion of a protein. The program is different from most proteomics-oriented programs in that it is designed for characterizing one or a few purified proteins, with a goal of identifying and quantifying all ions detected by the mass spectrometer. Results obtained when the program was used for analysis of monoclonal antibodies will be discussed. ALGORITHMS MassAnalyzer is developed as a platform for structural characterization of one or a few relatively purified proteins, including both top-down12 and bottom-up analysis. This section describes the algorithms developed for analyzing a data-dependent LC/MS/ MS run of a proteolytic digest of one or a few relatively pure proteins, including the detection, identification, and quantification of peptide ions in the run. Figure 1 shows a flow diagram describing the relationship of these algorithms. MassAnalyzer is developed in Microsoft Visual C++ and runs on Windows computers. It currently reads Thermo Scientific XCalibur raw files directly through XCalibur OCX controls provided with the XCalibur development kit. For a test of MassAnalyzer for noncommercial research purpose, please contact the author directly. Detection of Ions of Interest. Before performing ion detection, nearby full MS scans in the LC/MS/MS run are averaged by applying a moving Gaussian function to improve the signalto-noise ratio (S/N) of each scan. The width of the Gaussian function is important for optimizing the S/N of each scan, while preventing overaveraging the data. The width of the Gaussian function (user-defined) is usually set at one-third of a typical (10) Zhang, Z. Anal. Chem. 2004, 76, 3908–3922. (11) Zhang, Z. Anal. Chem. 2005, 77, 6364–6373. (12) Zhang, Z.; Shah, B. Anal. Chem. 2007, 79, 5723–5729.
chromatographic peak width, which is ideal for most of our data for maintaining both S/N and chromatographic resolution. For data with poor overall S/N, the width of the Gaussian function can increase to the typical chromatographic peak width for maximized S/N.13 After averaging, MS ion detection is then performed on each scan. For low-resolution data, ion detection is performed using an algorithm similar to previously described.14 For high-resolution data where all isotopic peaks are resolved, ion detection is achieved by examining the isotopic pattern of each ion; a successful determination of the charge state of the ion (based on the isotopic pattern) indicates positive ion detection. After ion detection is completed for all the scans, the selected-ion chromatogram (SIC) is plotted for each detected ion, starting from the most abundant ion and working down the list to the least abundant ion above a user-defined threshold. A detectable chromatographic peak in the SIC indicates a positive detection of a sample ion. This ion-detection procedure ensures that all background ions are excluded from the final ion list. For data acquired on a high-resolution instrument in which all isotopic peaks are resolved in the full-scan spectrum, the SIC is generated using the matched window function, as will be described later, for reducing any potential interferences. Determination of the Charge and Mass of a Peptide Ion. After an ion of interest is positively detected by observing a peak in its SIC, mass spectra across this chromatographic peak are combined by using a matched filter,13 with background subtraction, to optimize the S/N of the combined spectrum. The charge state of the ion of interest is determined using an algorithm similar to ZScore,14 based on isotope-resolved high-resolution MS scan, zoom scan, or charge distribution in the full scan, depending on the resolving power of the instrument used in the analysis. The average mass of the ion is determined by calculating the centroid of its isotope envelop. The monoisotopic mass is calculated by fitting the determined isotope pattern, either from high-resolution full scan or from zoom scan, to the isotope pattern predicted for the determined average mass as described previously.11 Before performing peptide identification on all detected ions, peptide identification of a few abundant ions is performed, based on their MS/MS data, as described below, using a relatively large mass tolerance for the search. With theoretical masses of confidently identified peptides, the determined masses of all detected ions are recalibrated with a two-parameter linear curve. Identification of Native Proteolytic Peptides. In order to quantify all modifications inside a protein, it is crucial to identify the large majority of ions that are either native or modified peptides from the protein of interest. As will be discussed later, a key to achieving this goal is the algorithm to accurately predict the fragmentation pattern of any peptide ions. The small search space of one or a few proteins is also important, as compared to a proteomics experiment, when the entire proteome of an organism is usually searched. To assign an ion to a specific proteolytic peptide, the determined and recalibrated mass of the ion is searched against the known protein sequence, and a list of potential peptide candidates is obtained. The mass search is performed so that at least one of (13) Zhang, Z.; McElvain, J. S. Anal. Chem. 1999, 71, 39–45. (14) Zhang, Z.; Marshall, A. G. J. Am. Soc. Mass Spectrom. 1998, 9, 225–233.
Analytical Chemistry, Vol. 81, No. 20, October 15, 2009
8355
the two cleavage sites must meet the user-defined specificity criteria of the protease. No limitation is applied to the maximum number of missed cleavages inside a peptide. The experimental MS/MS is then compared to the theoretically predicted MS/MS10,11 of all peptide candidates. The match between the experimental spectrum and predicted spectrum is evaluated by the similarity score10 between the two spectra. In this work, to put more emphasis to the sequence ions for more reliable peptide identification, instead of using the original similarity score, the average value of two scores is used. The first score is the original similarity score, and the second score is the similarity score between the two spectra after removing ions in the molecular ion and neutral loss regions. This averaged similarity score between the experimental and predicted spectra is converted to a probability value p (0 < p < 1, the probability to be the correct identification) by analyzing the similarity score distribution for different peptide masses and charge states in a data set containing about ∼10 000 spectra of known peptides, acquired on the same instrument type (Thermo Scientific LTQ in this work), as compared to the similarity score distribution for peptides with random sequences. The probability values calculated this way serve the purpose at this stage because the spectra in the data set were acquired under very similar conditions. The confidence levels of peptide identification of the top two hits are then calculated from the probabilities of the two hits (p1 and p2) based on the following two equations. confidence(1st) ) p1(1 - p2) p1(1 - p2) + p2(1 - p1) + (1 - p1)(1 - p2)
(1)
confidence(2nd) ) p2(1 - p1) p1(1 - p2) + p2(1 - p1) + (1 - p1)(1 - p2)
(2)
Equations 1 and 2 are used to ensure the top hit will only have a confidence level close to p1 when p2 is small. Please note the confidence level calculated here is based on the assumption that the peptide to be identified can take any random sequences. In real-world situations when only a few known proteins are present in the sample, the confidence levels calculated this way generally underestimate the confidence to some degree. If the charge state of an ion is not confidently determined, a range of charge states (typically from 1 to 5) are assumed for the ion before performing MS/MS spectral matching. A good match between the experimental MS/MS and the predicted MS/MS of a peptide indirectly determines the charge state of the ion. The accurate prediction of peptide MS/MS used in MassAnalyzer is crucial in the mission to identify all ions of interest. If the MS/MS contains large amount of sequence information, identification of the peptide is straightforward, and most algorithms are able to identify the peptide. However, when the MS/MS contains very little sequence information, such as shown in Figure 2, many algorithms will often fail to identify the peptide. However, as observed in Figure 2, the theoretically predicted spectrum matches the experimental spectrum very closely so that the ion can be confidently identified, although the spectrum contains little sequence information. In this case, the lack of sequence informa8356
Analytical Chemistry, Vol. 81, No. 20, October 15, 2009
Figure 2. Comparison of experimental (bottom) and predicted (top) CID spectrum of a doubly charged tryptic peptide (from CH3 domain of an antibody). The peptide is confidently assigned due to the close match between the predicted and experimental spectrum, although the experimental spectrum contains little sequence information.
tion in the tandem mass spectrum is used as the characteristic of the spectrum for the identification of the peptide. In other words, if the experimental spectrum did contain large amount of sequence information, then it would be determined that the peptide was not responsible for the spectrum because the peptide should generate a spectrum with little sequence information. The small search space of the purified protein(s) also helps to confidently identify the ion. Peptide identification based on comparison of experimental MS/MS and predicted MS/MS is potentially problematic for large peptides due to the difficulty to accurately predict the fragmentation pattern of large peptides, in which gas-phase conformation plays an important role in the fragmentation process of these peptides.15,16 Fortunately, large peptides often generate ions with several different charge states. The confidence of peptide identification of these large peptides is greatly improved by examining MS/MS of several of these charge states. Identification of Peptides with Specified Modifications. MassAnalyzer allows the users to define and select types of modification to search for, such as deamidation on asparagines and glutamines, oxidation on methionines and tryptophans, glycation on lysines, amino acid substitutions, and N-glycans on asparagines, etc. If an ion is not identified as a native proteolytic peptide, search will be performed to check if its determined mass matches any of the selected modifications of any proteolytic peptide, again, with at least one of the cleavage site meeting the specificity requirement of the protease. Similar to identification of native peptides, the modified peptide, as well as the modification site, is identified by comparing the experimental MS/MS to the predicted MS/MS of all possible modified peptides. When predicting theoretical MS/MS, amino acid residues with labile modifications such as oxidized methionine, glycated lysine, N-glycosylated aspargine, and some other common modifications such as carboxymethylated cysteine and carbamidomethylated cysteine11 are considered by the model as different residues with their own distinct properties. More modified residues, including phosphorylated residues, will be added to the model in the future for more accurate spectral prediction of modified peptides. For most other (15) Zhai, H.; Han, X.; Breuker, K.; McLafferty, F. W. Anal. Chem. 2005, 77, 5777–5784. (16) Zhang, Z.; Bordas-Nagy, J. J. Am. Soc. Mass Spectrom. 2006, 17, 786–794.
Figure 3. Comparison of the experimental (bottom) and predicted (top) MS/MS of an oxidized peptide (from Lys-C digestion of an antibody). The comparison indicates that the methionine, instead of the tryptophan residue, is oxidized.
modifications found in our laboratory, a change in a single residue usually does not change the overall fragmentation pattern significantly. Therefore, residues with other modifications are considered to have the average properties of all common amino acid residues. Again, the similarity scores of the top two modified peptides were converted to confidence levels. If the peptide contains more than one possible modification sites, the modification site is considered identified if the similarity score of the top candidate is significantly higher than the second candidate (by more than 0.04). If the exact modification site cannot be determined, an “∼” sign is place in front of the residue to indicate that it is an approximate location. Two possible scenarios may cause the failure to identify the exact location of the modification. One is the lack of enough sequence information in the MS/MS, and the other is a frequently observed phenomenon that peptides modified at different locations enter the mass spectrometer at the same time, therefore generating a mixed spectrum. Figure 3 shows an example of identification of modification site using this approach. The experimental MS/MS shown in Figure 3 contains very limited sequence information. The determined mass indicates that it is likely an oxidized form of the peptide SRWQQGNVFSCSVMHEALHNHYTQK from the CH3 domain of a monoclonal antibody. However, the peptide contains one methionine residue and one tryptophan residue, and the limited b- and y-ions present in the spectrum do not reveal the location of the oxidation site. To determine which of these two residues is oxidized, the theoretical CID spectra were predicted for both cases and compared to the experimental spectrum. In the prediction model, oxidized methionine was treated as a specific residue with its neutral loss of CH3SOH (-64 Da) built into the model. The predicted spectrum of the peptide with the methionine oxidized matches the experimental spectrum closely, indicating that the ion is from the peptide with the methionine oxidized. The dominant fragment ion is due to the neutral loss of CH3SOH from the oxidized methionine side chain. The identification is supported by the electron-transfer dissociation (ETD) spectrum of the quadruply charged same peptide (Supporting Information Figure S-1). Identification of glycated and glycosylated peptides needs some special treatment because the fragmentation of glycans usually dominates the CID spectrum of a glycopeptide. Similar to identification of other peptides, a glycopeptide is identified by comparing the experimental MS/MS to its theoretically predicted
Figure 4. Experimental (bottom) and predicted (top) CID spectrum of a tryptic peptide (3+) from a monoclonal antibody containing glycoform A2G0F on the aspargine residue. All labeled fragment ions are doubly charged except for a few triply charged fragments as indicated.
spectrum. Figure 4 shows the CID spectrum of a triply charged tryptic peptide from a monoclonal antibody with the most abundant glycoform (A2G0F), as compared to its predicted CID spectrum. The details of the mathematical model to predict CID spectrum of glycopeptides will be discussed in a separate communication. At this stage, glycoforms anticipated by MassAnalyzer include N-glycans with 1-4 antenna, each antenna terminating with sialic acid, galactose, or N-acetylglucosamine, with and without core-fucose, plus hybrid type and high-mannose type, a total of 147 possible N-glycans. These N-glycans cover most, if not all, possible glycans observed in IgG monoclonal antibodies. Only CID spectra are used for the identification of the glycopeptides at this stage. The lack of fragment ions in the peptide region increases the risk of misidentification of these glycopeptides. Work is in progress to utilize ETD17 to assess the peptide sequence information for more reliable identification of glycopeptides. Identification of Unspecified Modifications. Detection and identification of unknown modifications, i.e., modifications not specified by the user, are a crucial part of full characterization of a therapeutic protein. Due to the vast amount of possible unknown modifications, it is impractical to perform an exhaustive search for all possible mass changes. Fortunately, modifications of therapeutic proteins usually exist in small amount, and therefore, it is safe to assume that the unmodified native peptide must exist, and identified, in the same run. For dominant modifications such as alkylation of cysteine residues during sample preparation, the user can define the sequence in a way so that these residues are permanently modified. To identify a peptide with unspecified modification, MassAnalyzer first compares the determined mass of the unknown peptide against the mass of all identified peptides. If the mass difference is within a user-specified range, the identified peptide is then considered as a potential candidate as the unmodified form of the unknown peptide. To reduce computation burden, before attempting to determine the location of the modification site, MassAnalyzer first determines whether there is a correlation between the experimental CID spectrum of the unknown peptide and that of the identified peptide. To do that, MassAnalyzer predicts two CID spectra of the peptide with its N-terminal and C-terminal residues (17) Syka, J. E. P.; Coon, J. J.; Schroeder, M. J.; Shabanowitz, J.; Hunt, D. F. Proc. Natl. Acad. Sci. U.S.A. 2004, 101, 9528–9533.
Analytical Chemistry, Vol. 81, No. 20, October 15, 2009
8357
modified with the corresponding mass changes, respectively, and then adds the two spectra together as the predicted spectrum. By doing this, each fragment ion (b or y, for example) will have both unmodified and modified forms present in the predicted spectrum. If there is significant similarity between this predicted spectrum and the experimental spectrum, this unknown peptide is determined to be a modified form of the identified peptide. For singly charged precursor ions, since all fragment ions are singly charged, MassAnalyzer uses the experimental spectrum of the unmodified peptide to predict the spectrum of the modified peptide. Specifically, MassAnalyzer generates a new spectrum from the spectrum of the unmodified peptide by shifting the m/z values of all ions by the corresponding mass change, followed by adding this new spectrum to the original spectrum. Once the unknown peptide is determined to be a modified form of another peptide, each residue on the peptide is modified by the corresponding mass change, and then the CID spectrum of each of these modified forms is predicted and compared to the experimental spectrum. The modification site with highest similarity score between its predicted spectrum and the experimental spectrum indicates the location of the modification. The exact location of the modification, however, may not be determined confidently all the time because there is frequently not enough sequence information in the spectrum to have it unambiguously determined. Therefore, MassAnalyzer puts an “∼” sign in front of the modification site to indicate the approximate location of the unspecified modification, unless there is a very large difference (>0.1) in similarity scores between the most possible site and the second most possible site. The exact locations of these unspecified modifications can be determined manually offline, if necessary, after applying knowledge regarding the chemistry of the modification, with the help of routines provided with MassAnalyzer, including automated de novo sequencing.18 In many cases, the modification levels in different samples of a same molecule need to be compared. When two or more LC/ MS/MS runs are compared, MassAnalyzer automatically aligns the retention time of all detected ions, allowing peptide identification using available MS/MS data in other runs. Quantification of Modifications. Modified peptides are quantified using the peak areas under the SIC of the modified peptides and other related peptides. Therefore, it is important to calculate the peak area of each ion reliably and accurately, especially in the presence of interferences. This section describes the algorithms used in MassAnalyzer for calculating peak areas of peptide ions, as well as calculating the relative abundance of each modification using these peak areas. As one will see, these algorithms take full advantages of high-resolution data of modern mass spectrometers, which was explored only in limited studies.19 Generating the SIC Using a Matched Window Function. Traditionally, the SIC of an ion is constructed by adding intensities of all ions within a user-specified m/z window. Selection of this m/z window is important for the quality of the constructed SIC, and hence the determination of the peak area. If the window is set too wide, too much interfering ions will be included in the SIC, introducing unnecessary noise and interferences. Too narrow of a window will reduce the S/N of the SIC because many isotopic
Figure 5. Different window functions used for generating SIC for a triply charged peptide CCVECPPCPAPPVAGPSVFLFPPKPK.
peaks, which contain useful intensity information, are discarded. If an ion is located near the edge of the window, a slight shift in determined m/z value may shift the ion in and out of the window, causing large variations in the SIC. The above traditional approach of generating SIC can be viewed as a box-shaped window function applied to the spectrum (Figure 5B). An ion has a weight of zero when it is located outside the box and a weight of one when it is inside the box. The m/z range of this box-shaped window function must be determined carefully based on the isotopic distribution of the peptide (Figure 5A). Even with carefully determined m/z range, the problem previously described for the box-shaped window function always exists. The window function, however, does not need to be of rectangular shape. The following describes the derivation of an ideal window function for a maximum S/N in the SIC. Assume the window function is wi, in which i stands for the ith isotope (i g 0). Ion intensity at the ith isotope is denoted as Ii. Therefore, the total signal Stotal, after applying the window function, is calculated as
8358
Analytical Chemistry, Vol. 81, No. 20, October 15, 2009
(3)
i i
i
Assuming that each m/z value has an equal probability of having an interfering ion, we can therefore assume that the noise level (standard deviation of the noise signals) at the location of each isotope has a constant value of N. When random noises are added together, the variances (square of standard deviation) are additive. Therefore, the total noise Ntotal after applying the window function wi is calculated by Ntotal )
[∑ i
(18) Zhang, Z. Anal. Chem. 2004, 76, 6374–6383. (19) Cox, J.; Mann, M. Nat. Biotechnol. 2008, 26, 1367–1372.
∑wI
Stotal )
]
(wiN)2
1/2
(∑ ) wi2
)N
Therefore, the S/N is calculated by
i
1/2
(4)
Stotal S/N ) ) Ntotal N
∑wI
i i
i
(∑w i
)
2 1/2
i
(5)
to ensure that the sum of weighted ion intensities equals the sum of unweighted ion intensities. That is,
∑I
i
∑wI
)
It has been shown previously13 that the S/N expression described in eq 5 has its maximum value of (1/N)(∑iIi2)1/2 when the window function wi is proportional to the signal intensity Ii. Therefore, the S/N is maximized when the window function has the same shape as the real isotope distribution of the peptide (Figure 5C). This type of window function is called a matched window function in this article. The matched window function is similar to matched filtering used for signal processing for optimizing S/N. A matched filter function is also used in MassAnalyzer to optimize S/N of mass spectra for mass determination.13 By applying a matched window function, a lower weight is applied to ions with lower theoretical abundance, thus reducing interferences from ions farther away from the center of the isotope envelop. Because this new window function does not have a straight edge, problems associated with m/z shift are also eliminated. For high-resolution centroid data as described in this work, each isotope peak is well-separated and their masses accurately determined. Therefore, the window function shown in Figure 5D is used. That is, only isotopes with their m/z within a narrow window of the expected m/z are added to the intensity. This way any interference resolved from the isotopic peaks is eliminated. In MassAnalyzer, the width of each isotope window in the window function is set at one-fourth of the peak width (at half height) as calculated from the instrument resolution at that m/z value, plus a constant value of 0.02 u/charge to account for the inaccuracy in calculating the m/z of each isotopic peak (e.g., the first heavy isotope is a mixture of 13C and 2H) as well as mass shift caused by deamidation (0.984 u instead of 1.000 u). The location of each isotope peak in the filter is calculated from the determined m/z value instead of the theoretical m/z value to reduce systematic errors caused by instrument miscalibration, and a value of 1.000 u is used as the mass difference between nearby isotopic peaks. In the work described here, full-scan data were acquired on an LTQ-Orbitrap in highresolution centroid mode. If the data are acquired in profile mode, a window function that matches the real peak shape of the ion (Figure 5E) may be preferred. Theoretically, the window function shown in Figure 5E is also ideal for high-resolution centroid data. However, the window function shown in Figure 5D is used for faster computation speed. A proper window function, combined with narrow window for each isotope peak, minimizes interferences and optimizes the reliability and accuracy in peak area calculation. For quantification purposes, we need to make sure that the peak areas calculated by the matched window function reflect the true peak areas calculated by adding intensities of all isotope peaks together. Below shows the derivation of this matched window function. Assume the theoretical isotope distribution of the peptide ion is Ai (the abundance of the ith isotope), the matched window function is wi, and the experimentally determined isotopic distribution for the ion is Ii (without interferences). We need
(6)
i i
i
Since wi should have the same shape as Ai, we have wi ) kAi
(7)
where k is a factor relating the theoretical isotope abundance and the window function. Without interferences, the experimentally determined Ii should also have the same shape as Ai, thus Ii ) k'Ai
(8)
where k′ is a factor relating the theoretical isotope abundance and the ion intensity. Substituting eq 6 with eqs 7 and 8, we have k'
∑A
i
) kk'
i
∑A
2 i
(9)
Therefore,
∑A
i
k)
i
∑A
2
(10)
i
i
We have the final matched window function by combining eqs 7 and 10.
wi )
∑A
i
i
∑A
2
Ai
(11)
i
i
Since not all ions are identifiable at all times, we prefer to calculate the isotope distribution without the knowledge of the peptide sequence for the ion of interest. Fortunately, peptides with similar masses have very similar isotopic distributions. Therefore, the isotopic distribution of a peptide is rapidly estimated from its mass, based on the empirical equations described previously.11 This approach also significantly reduces the computation burden. Calculating the Peak Area in the Selected-Ion Chromatogram. Calculation of the peak area is straightforward once the SIC is generated. The only required user-defined parameter is the estimated width of the most intense chromatographic peak, and the default value, which is automatically determined by MassAnalyzer by looking into each data file, works virtually all the time. MassAnalyzer starts peak detection from the most intense ion in the run and works down according to the intensity of each detected ion. For the most intense ion, time points within onesixth of the user-defined peak width in an SIC are bunched together (take the average value) to generate a smoothed SIC for the purpose of chromatographic peak detection. A local minimum in the bunched SIC indicates a valley between two resolved chromatographic peaks. After the start and end point of a peak is determined from the bunched SIC, the peak area is then Analytical Chemistry, Vol. 81, No. 20, October 15, 2009
8359
calculated from the original SIC. Peak areas of less intense ions are calculated the same way, except that their peak widths are estimated more accurately from the determined widths of their more intense neighboring (in retention time) peaks, instead of the user-defined peak width. Quantification of Modifications. For each identified modified peptide, its relative abundance can be calculated by dividing the peak areas of the modified peptide by the total peak areas of all related peptides, including the native peptide and all modified forms of the native peptide. Problems arise when attempting to quantify modified peptides with extremely low abundance. For peptides with extremely low abundance, it is often the case that only one charge state is detected. However, for its unmodified counterpart, more than one charge state is usually detected due to its much higher abundance. Therefore, adding peak areas of all the detected charge states of the unmodified peptide into the equation will underestimate the low-abundance peptide. To solve this problem, for each modified peptide, MassAnalyzer first determines the number of charge states n that have their abundance at least one-third of the abundance of the most abundant charge state. When calculating abundance of this modified peptide, only the top n most abundant charge states for each related peptides are used in the calculation. Because different denominators are used for calculating different modifications on a same peptide, the percentage of all modifications and unmodified peptide may not add up to exactly 100%. When a peptide is found to have very small total peak area (total peak area includes peak areas of the unmodified and modified peptides), quantification of any modification on this peptide will not be performed because the same modification will be quantified from a more abundant peptide. If a more abundant peptide with this modification is not observed, then the modification is most likely a misidentification or digestion artifact and therefore should not be included in the final modification list. In order to be quantified, the total peak area of the peptide must be at least 1% of the total peak area of the most abundant peptide (from the same protein). When a modification is represented by more than one peptide, e.g., due to missed cleavages or nonspecific digestion, peak areas of modified peptides are added together as numerator in the abundance calculation, and peak areas of all related peptides are added together as the total peak area for the denominator. To avoid adding unnecessary interferences in the calculation, only peptides with total peak area above one-sixth of the most abundant peptide (containing the modification site) are used in the calculation. When the modified residue matches the specificity of the protease used for the digestion, the protease generally does not cleave at the site near the modified residue. Take glycation on lysine, for example; trypsin does not cleave at the C-terminus of a glycated lysine. However, trypsin does cleave at the C-terminus of this lysine in the unmodified peptide, generating two shorter unmodified peptides. In this case, the peak areas of the longer one of the two unmodified peptides, together with the peak areas of the full-length peptide (if present), are added into the total peak area when calculating the abundance of the modified residue. Accuracy in quantification of these types of modification is compromised in this case. A different protease (such as endopro8360
Analytical Chemistry, Vol. 81, No. 20, October 15, 2009
teinase Glu-C or Asp-N) may be preferred for quantification of lysine glycations. EXPERIMENTAL SECTION Two immunoglobulin G2 (IgG2) molecules produced in Amgen (Thousand Oaks, CA) were digested with trypsin and Lys-C, respectively. Tryptic digestion of IgG2 molecule 1 was performed under 37 °C for 2 h, after reduction and alkylation with iodoacetamide, using a procedure similar to the method described by Ren et al.20 For Lys-C digestion of IgG2 molecule 2, the antibody was first incubated at 37 °C for 2 h under 6 M guanidium hydrochloride to denature the antibody molecule, followed by diluting 10-fold with a phosphate buffer (0.1 M, pH 7.1) containing 4 M urea (with 20 mM hydroxylamine to scavenge the cyanate molecules that cause carbamylation), to a concentration of ∼1 mg/ mL. Lys-C (Wako Chemicals USA, Virginia) was added to achieve an enzyme-to-substrate ratio of 1:20, followed by incubation at 37 °C for 18 h. After digestion, approximate amount of DTT was added to achieve a final concentration of 10 mM and incubate at 37 °C for 1 h for reduction of disulfide bonds. The mass spectrometer used in this work was a Thermo Scientific LTQ-Orbitrap high-resolution mass spectrometer directly connected to an Agilent 1200 SL system. Although MassAnalyzer is able to analyzed data acquired from all Thermo Scientific iontrap instruments, the LTQ-Orbitrap is greatly preferred because more ions can be confidently identified and quantified due to its ultrahigh resolution and mass accuracy in the MS scan. Four LC/ MS/MS analyses were performed for the tryptic digest of IgG2 molecule 1. An Agilent 1.8 µm particle rapid-resolution reversedphase column (SB C18, 2.1 mm × 150 mm) was used for the first analysis, and a Waters 1.7 µm particle UPLC reversed-phase column (BEH300 C18, 2.1 mm × 150 mm) was used for the other three analyses. Peptides were eluted with a gradient of 1-20% acetonitrile in 38 min, followed by 20-40% acetonitrile in 60 min, with 0.02% trifluoroacetic acid (TFA) in each mobile phase, at a flow rate of 0.2 mL/min. For the Lys-C digest of the IgG2 molecule 2, a Phenomenex Jupiter C-5 reversed-phase column (5 µm particle, 300 Å pore, 2.0 mm × 250 mm) was used. Peptides were eluted with a gradient of 0.5-22% acetonitrile in 80 min, followed by 22-50% acetonitrile in 80 min, with 0.2% formic acid in each mobile phase, at a flow rate of 0.2 mL/min. The mass spectrometer was set up to acquire one highresolution full scan at 60 000 resolution (at m/z 400), followed by three concurrent data-dependent MS/MS scans of the top three most abundant ions, with dynamic exclusion, using CID (normalized collision energy 35%). The dynamic exclusion duration was set at 10 and 36 s for the tryptic digest and the Lys-C digest, respectively. These exclusion durations were slightly shorter than the width of a typical chromatographic peak to ensure at least one high-quality MS/MS scan for each major ion. Singly charged ions were excluded from MS/MS for the Lys-C digestion of the antibody. About 30 µg of each digest was injected into the LC/ MS/MS system for analysis. (20) Ren, D.; Pipes, G. D.; Liu, D.; Shih, L.-Y.; Nichols, A. C.; Treuheit, M. J.; Brems, D. N.; Bondarenko, P. V. Anal. Biochem. 2009, 392, 12–21. (21) Chen, X.; Flynn, G. C. Anal. Biochem. 2007, 370, 147–161. (22) Chen, X.; Flynn, G. C. J. Am. Soc. Mass Spectrom., in press.
RESULTS The LC/MS/MS data (collected on Thermo Scientific LTQOrbitrap) of proteolytic digests of the monoclonal antibodies were analyzed on MassAnalyzer for large-scale identification and relative quantification of covalent modifications. Some commonly observed modifications were specified for MassAnalyzer to search for, including addition of lysine or arginine to either terminus of a peptide (a common artifact during trypsin digestion), NH3 loss and deamidation from asparagine or glutamine, oxidation of methionine or tryptophan, double oxidation of methionine, tryptophan, or cysteine, triple oxidation of tryptophan and cysteine, glycation of lysine, and H2O loss from serine, threonine, aspartic acid, and glutamic acid. For the search of unspecified modifications, a modification mass range of -129 to 163 u was used. In an analysis of the tryptic digest of an IgG2 antibody, with a precursor mass tolerance of ±8 ppm and a sequence search space including the heavy chain and light chain of the IgG2 molecule as well as bovine trypsin, a total of 1712 ions, with masses ranging from 450 to 7200 Da, were identified for a confidence level above 80%, among which 1154 ions were from the heavy chain, 488 ions from the light chain, and 70 ions from trypsin. The identified peptides covered the entire sequences of both the heavy chain and the light chain. Identified peptides in the constant domains of the IgG2 molecule are shown in the Supporting Information Table S-1. Due to the small sequence search space, accurate mass measurement on a high-resolution mass spectrometer, and the mathematical model for accurate prediction of peptide fragmentation spectra, the peptide identification false-discovery rate is rather small. For example, in a separate search of the same data with the reversed sequences of the IgG2 molecule and trypsin appended to the original target sequences, 3 out of 1703 identified peptides (for a confidence level above 80%) were from the reversed sequences, representing a false-discovery rate of ∼0.4%. Please note that the calculation of confidence levels is based on the similarity score distribution of correct peptides as compared to the similarity score distribution of random peptides, corresponding to an extremely large sequence search space. A confidence of 80% when no limitation is applied to the peptide sequence is actually a quite reliable identification when the search space contains only three proteins. When identified peptides, instead of ions, were counted, i.e., ions of the same peptide with different charge states are collapsed into a single peptide identification, the above search results represent a total of 793 identified peptides, among which 518 peptides are from the heavy chain, 232 peptides from the light chain, 41 peptides from trypsin, and 2 peptides from the reversed sequences. The same raw MS data were submitted to a search with Mascot (Matrix Sciences) using similar criteria (except for the number of possible modifications). Specifically, the sequence search space included the three proteins and their reversed sequences. Variable modifications included carbamidomethylation of cysteine (better results obtained than setting it as fixed modification), sodium adduct, deamidation of asparagine and glutamine, oxidation of methionine, tryptophan, and histidine, double oxidation of methionine, as well as pyroglutamine formation from N-terminal glutamine or glutamic acid. Semitrypsin was selected as the enzyme. Peptide tolerance was set as ±8 ppm.
Maximum number of missed cleavages was set as 9. Peptide mass range was from 450 to 7200 Da. For comparison purpose the ions score threshold was set at 26.5 so that exactly two peptides were identified from the reversed sequences. For an ions score threshold of 26.5, a total of 347 peptides were identified by Mascot, among which 214 peptides were from the heavy chain, 107 peptides from the light chain, 24 peptides from trypsin, and 2 peptides from the reversed sequences. For the same number of false positives, MassAnalyzer identified more than double the number of peptides as identified by Mascot. One of the primary reasons for the difference is that MassAnalyzer searches a much larger number of possible modifications, including different glycoforms and unrestricted search of modifications not specified by the user. Another reason is due to the advantage of the fragmentation model used in MassAnalyzer. For example, although the peptide shown in Figure 2 was identified by Mascot, the peptide with oxidized methionine as the one shown in Figure 3 was not identified. MassAnalyzer is advantageous over Mascot for the full characterization of a few proteins because Mascot is not designed for that purpose. Among the peptides identified by MassAnalyzer, modified peptides represent a total of 227 modifications in the IgG2 molecule. As examples, Table 1 shows 86 modifications detected in the heavy-chain CH2/CH3 domains of the IgG2 molecule, and their relative abundances calculated based on their MS peak areas. In Table 1, specified modifications are shown by their names and unspecified modifications are indicated by their mass changes. For example, “∼C318-57.0212” stands for a loss of 57.0212 u near Cys-318. A loss of about 57.02 u was observed for many carbamidomethylated cysteine residues, indicating an incomplete alkylation (theoretical -57.0215 u) on these residues. Although a loss of 57.0215 u was not specified in the search, they were detected with high confidence by MassAnalyzer. Many other unknown modifications were detected and quantified, many of which are likely artifacts introduced during sample preparation and analysis. See footnotes of Table 1 for explanations of some modifications, after manually looking into the MS/MS data for some of the ions. The 165 modifications in the entire constant domains of the molecule are shown in the Supporting Information Table S-2. For a test of the accuracy of MS-based quantification, the relative abundances of different glycoforms, as determined by MSbased quantification of trypsin or Lys-C digested antibodies, are compared to those determined by fluorescence-based reversedphase method of 2AB-labeled glycans released from the antibodies21,22 (Table 2). The match between the two methods is surprisingly good for high-abundance species. For low-abundance species (