Prediction of Collision-Induced Dissociation Spectra of Common N

Nov 23, 2010 - Citation data is made available by participants in CrossRef's ... Prediction of Peptide Fragment Ion Mass Spectra by Data Mining Techni...
0 downloads 3 Views 2MB Size
Anal. Chem. 2010, 82, 10194–10202

Prediction of Collision-Induced Dissociation Spectra of Common N-Glycopeptides for Glycoform Identification Zhongqi Zhang* and Bhavana Shah Process and Product Development, Amgen Inc., One Amgen Center Drive, Thousand Oaks, California 91320, United States Confident identification of the glycan moieties in glycopeptides by collision-induced dissociation (CID) requires accurate prediction of the CID spectrum of the glycopeptides. In this Article, the kinetic model for the prediction of peptide CID spectra is extended to predict the CID spectra of N-glycopeptides. The model was trained with 1831 ion-trap CID spectra of N-glycopeptides and is able to predict ion-trap CID spectra with excellent accuracy in ion intensities for N-glycopeptides up to 8000 u in mass. A total of 524 common glycoforms including complex N-glycans with 2-4 antennas, plus high-mannose type and hybrid type, can be predicted. Recombinant therapeutic proteins, especially monoclonal antibodies (mAb), have emerged as the most exciting new drug class in the pharmaceutical industry. Many of these therapeutic proteins are glycoproteins. For example, human antibodies, or immunoglobulins (IgG), are glycoproteins with N-linked glycans attached to Asn-297 on the Fc region of both heavy chains. Additionally, 15-20% of human IgG molecules contain N-linked glycans in the Fab region.1,2 The Fc glycans are mainly biantennary fucosylated complex type, differing in the level of terminal galactose. Other minor forms of Fc glycans include bisecting complex types, sialylated complex types,3 as well as other less mature structures such as high-mannose or hybrid types.4-6 In contrast, the Fab glycans are more fully galactosylated and contain higher levels of sialic acid1 and, for antibodies derived from murine cell lines, N-glycolylneuraminic acid (NGNA) as well.7,8 Antibodies derived

* Corresponding author. E-mail: [email protected]. Fax: (805) 376-2354. (1) Abel, C. A.; Spiegelberg, H. L.; Grey, H. M. Biochemistry 1968, 7, 1271– 1278. (2) Jefferis, R. Adv. Exp. Med. Biol. 2005, 564, 143–148. (3) Mimura, Y.; Ashton, P. R.; Takahashi, N.; Harvey, D. J.; Jefferis, R. J. Immunol. Methods 2007, 326, 116–126. (4) Bailey, M. J.; Hooker, A. D.; Adams, C. S.; Zhang, S. H.; James, D. C. J. Chromatogr., B 2005, 826, 177–187. (5) Kamoda, S.; Ishikawa, R.; Kakehi, K. J. Chromatogr., A 2006, 1133, 332– 339. (6) Flynn, G. C.; Chen, X.; Liu, Y. D.; Shah, B.; Zhang, Z. Mol. Immunol. 2010, 47, 2074–2082. (7) Huang, L.; Biolsi, S.; Bales, K. R.; Kuchibhotla, U. Anal. Biochem. 2006, 349, 197–207. (8) Qian, J.; Liu, T.; Yang, L.; Daus, A.; Crowley, R.; Zhou, Q. Anal. Biochem. 2007, 364, 8–18.

10194

Analytical Chemistry, Vol. 82, No. 24, December 15, 2010

from Chinese hamster ovary (CHO) cells, however, contain a very small amount of NGNA.9 Proper glycosylation plays important roles in the safety and efficacy of antibody therapeutics. Having glycosylation patterns similar to those found in human serum-derived antibodies could be important for activity and potential immunogenic effects. Mammalian tissue culture cells such as Chinese hamster ovary (CHO) cells or murine cells are typically used in the biopharmaceutical industry to produce antibodies with appropriately processed carbohydrates. Cumulative evidence indicates the correlation of the presence of glycan on an antibody to certain effector functions of the molecule. Furthermore, the efficacy might also be influenced by specific glycoforms. For example, the presence of terminal galactose (Gal)10 and bisecting N-acetylglucosamine (GlcNAc),11,12 and the absence of core fucose,13,14 have all been shown to improve antibody effector functions. Some evidence also suggests that mAbs containing high-mannose species clear faster from mouse serum.15,16 Therefore, the glycoforms of the antibody product, just like the protein moiety, must be characterized and monitored in great detail. The glycan moieties of a glycoprotein are commonly characterized by mass spectrometry on the released glycans.17-19 However, the process of releasing the glycan from the glycoprotein, labeling the glycans, followed by chromatographic analysis is often labor intensive and time-consuming. Additionally, the microheterogeneities regarding the site distribution of glycans are lost after the glycans are released from the glycoprotein. Characterizing glycan (9) Hokke, C. H.; Bergwerff, A. A.; van Dedem, G. W.; van Oostrum, J.; Kamerling, J. P.; Vliegenthart, J. F. FEBS Lett. 1990, 275, 9–14. (10) Tsuchiya, N.; Endo, T.; Matsuta, K.; Yoshinoya, S.; Aikawa, T.; Kosuge, E.; Takeuchi, F.; Miyamoto, T.; Kobata, A. J. Rheumatol. 1989, 16, 285–290. (11) Uman ˜a, P.; Jean-Mairet, J.; Moudry, R.; Amstutz, H.; Bailey, J. E. Nat. Biotechnol. 1999, 17, 176–180. (12) Davies, J.; Jiang, L.; Pan, L. Z.; LaBarre, M. J.; Anderson, D.; Reff, M. Biotechnol. Bioeng. 2001, 74, 288–294. (13) Shields, R. L.; Lai, J.; Keck, R.; O’Connell, L. Y.; Hong, K.; Meng, Y. G.; Weikert, S. H.; Presta, L. G. J. Biol. Chem. 2002, 277, 26733–26740. (14) Shinkawa, T.; Nakamura, K.; Yamane, N.; Shoji-Hosaka, E.; Kanda, Y.; Sakurada, M.; Uchida, K.; Anazawa, H.; Satoh, M.; Yamasaki, M.; Hanai, N.; Shitara, K. J. Biol. Chem. 2003, 278, 3466–3473. (15) Wright, A.; Morrison, S. L. J. Exp. Med. 1994, 180, 1087–1096. (16) Kanda, Y.; Yamada, T.; Mori, K.; Okazaki, A.; Inoue, M.; Kitajima-Miyama, K.; Kuni-Kamochi, R.; Nakano, R.; Yano, K.; Kakita, S.; Shitara, K.; Satoh, M. Glycobiology 2007, 17, 104–118. (17) Wuhrer, M.; Deelder, A. M.; Hokke, C. H. J. Chromatogr., B 2005, 825, 124–133. (18) Morelle, W.; Canis, K.; Chirat, F.; Faid, V.; Michalski, J. C. Proteomics 2006, 6, 3993–4015. (19) Zhang, Z.; Pan, H.; Chen, X. Mass Spectrom. Rev. 2009, 28, 147–176. 10.1021/ac102359u  2010 American Chemical Society Published on Web 11/23/2010

Figure 1. Glycan structures (total 524) considered by the kinetic model, including complex type, hybrid type, high-mannose type, and the trimannosylated core structure. Shown here are the largest structures of each type.

microheterogeneities requires that the glycan is characterized in the glycopeptide level. A peptide mapping experiment, i.e., LC/MS/MS analysis of a proteolytic digestion of the protein, is routinely used in the biopharmaceutical industry for structural characterization of therapeutic proteins.19 The glycan profiles are often contained in these types of data.7,20,21 Achieving a glycan profile from these data requires an automated algorithm to identify glycoforms at the glycopeptide level. Peptide tandem mass spectra generated from collision-induced dissociation (CID) have been widely used in the identification of peptides and post-translational modifications. The glycan moiety of glycopeptides, however, is difficult to identify by CID due to its complicated branch structure, as compared to a linear peptide sequence, and existence of isobaric structures. Limited number of algorithms used for identification of glycopeptides or released glycans do not take advantage of the intensity information of CID fragments. If the tandem spectrum of a glycopeptide can be predicted with reasonable accuracy, more reliable identification may be achieved. An empirical kinetic model has been reported previously for quantitative prediction of CID spectra of peptides acquired on iontrap instruments,22,23 and has been used for full characterization of therapeutic proteins.19,24 This article describes the extension of the kinetic model to the prediction of CID spectra of Nglycopeptides with 524 different common N-glycoforms, including complex N-glycans with 2-4 antennas, plus high-mannose type and hybrid type. THEORY AND METHOD N-Linked Glycans. N-Linked glycans considered by the kinetic model include complex N-glycans with 2-4 antennas, each antenna terminating with N-acetylglucosamine (GlcNAc), galactose (Gal), or sialic acid, plus hybrid type and high-mannose type, with or without core fucose (Fuc) and bisecting GlcNAc, a total of 524 possible glycoforms. Different types of glycosidic linkages are not distinguished. Sialic acid can be either N-acetyl neuraminic acid (NANA) or N-glycolyl neuraminc acid (NGNA). See Figure 1 for structures of these glycans and Table 1 for sugar residues and their abbreviations. The entire list of the 524 glycans and their (20) Wagner-Rousset, E.; Bednarczyk, A.; Bussat, M. C.; Colas, O.; Corvaia, N.; Schaeffer, C.; Van Dorsselaer, A.; Beck, A. J. Chromatogr., B 2008, 872, 23–37. (21) Sinha, S.; Pipes, G.; Topp, E. M.; Bondarenko, P. V.; Treuheit, M. J.; Gadgil, H. S. J. Am. Soc. Mass Spectrom. 2008, 19, 1643–1654. (22) Zhang, Z. Anal. Chem. 2004, 76, 3908–3922. (23) Zhang, Z. Anal. Chem. 2005, 77, 6364–6373. (24) Zhang, Z. Anal. Chem. 2009, 81, 8354–8364.

Table 1. Glycan Residues and Their Abbreviations Used in the Article glycan residue

abbreviation

N-acetylglucosamine (GlcNAc) core fucose (Fuc) mannose (Man) bisecting N-acetylglucosamine galactose (Gal) N-acetyl neuraminic acid (NANA) N-glycolyl neuraminc acid (NGNA)

Gn F M B G S Sg

fragmentation products are shown in the Supporting Information Table S-1. These glycans represent most N-glycans present in recombinant and endogenous antibodies and a majority of other recombinant glycoproteins expressed in mammalian cells. In this Article, complex N-glycans are represented in the form of AaSgs2Ss1GgFB. a represents the total number of antennas, s1 represents the number of antennas terminating with NANA, s2 represents the number of antennas terminating with NGNA, g represents the number of antennas terminating with galactose, F represents the presence of core fucose, and B represents the presence of bisecting GlcNAc. High-mannose N-glycans are represented as Mm, where m represents number of mannose residues. Hybrid N-glycans are represented as AaSgs2Ss1GgMmFB. For example, A2G0F represents a glycan with two antennas, both terminating with GlcNAc (zero galactose) and with a core fucose. Fragment ion nomenclature used in this Article is as follows. For peptide bond cleavages, the fragment ions are labeled as y or b. For glycosidic bond cleavage, the reducing end fragments (together with the peptide moiety) are labeled either with the abbreviation of the remaining glycan, or with the loss of the nonreducing end (i.e., “-M” represents loss of a mannose from the nonreducing end), depending on which one is more concise. Fragments of nonreducing end are labeled with its residue composition enclosed in parentheses. Cleavages of the chitobiose core generates Y1, Y2, and Bn and Bn-1 ions. See Figure 2 for an illustration of fragment ion nomenclature used in this article. Kinetics. For a unimolecular fragmentation reaction involving many competing pathways, the reaction kinetics can be described as follows. Assuming a precursor ion P has n competing fragmentation pathways to form fragments F1, F2, ..., Fn, with rate constants k1, k2, ..., kn, respectively, the kinetics of fragmentation can be described as [P]t ) [P]0 exp(-ktotalt) Analytical Chemistry, Vol. 82, No. 24, December 15, 2010

(1) 10195

Figure 2. Illustration of fragment ion nomenclature used in this Article with a hypothetical glycan A1S1M4. Fragments of nonreducing end are labeled with its residue composition enclosed in parentheses. The reducing end fragments (together with peptide moiety) are labeled either with the abbreviation of the remaining glycan, or with the loss of the nonreducing end, whichever is more concise. For example, “M4” is used instead of “-SGGn”.

where [P]0 and [P]t are abundances of the precursor ion at time zero and time t, respectively, and ktotal is the sum of rate n constants for all pathways ktotal ) ∑i)1 ki. The abundance of each fragment ion at time t can be expressed as [Fi]t )

∫ k [P] dt t

0

i

t

(2)

Combining eqs 1 and 2, we have [Fi]t ) ki[P]0

∫ exp(-k t

0

totalt)dt

)

ki[P]0[1 - exp(-ktotalt)] ktotal (3)

When rate constants of all fragmentation pathways ki are known, the abundance of each fragment ion at time t can be derived from eq 3. Calculation of rate constants for the peptide part has been described previously.22,23 This Article describes the rate constant calculation of the glycan part of the molecule. Major Assumptions. To simplify the glycan fragmentation model, the following major assumptions are made: (1) Each glycan carries no more than one charge. The probability of having more than one charge on the glycan is negligible due to Coloumb repulsion and lack of basic site in the common N-glycans described in this Article. (2) Each glycosidic bond can be cleaved through either a charge-remote pathway or a charge-directed pathway. Fragmentation rate of a charge-directed pathway is proportional to the charge density of the cleavage site. (3) Each glycosidic bond cleavage pathway has an activation energy (Ea) and an A factor. The rate constant for the cleavage can be calculated by the following Arrhenius equation k ) A exp(-Ea /RTeff)

(4)

where R is the gas constant, and Teff is the effective temperature of the ion. Fragmentation Pathways. Fragmentation pathways of the peptide moiety have been described previously.22,23 For the glycan moiety, the kinetic model considers cleavages of all glycosidic bonds. It is assumed that both charge-remote and charge-directed processes exist for the cleavage of each glycosidic bond. Parameters Used in the Model. In addition to the parameters used to simulate peptide CID spectrum,22,23 79 extra parameters are used for modeling the fragmentation of the glycan 10196

Analytical Chemistry, Vol. 82, No. 24, December 15, 2010

moiety. During refinement of the model, these parameters were varied until a best match was obtained between the predicted spectra and the experimental spectra in the training data sets. Details of these parameters are described as follows: (1) two parameters for estimating the gas-phase basicity (GB) of the glycan group; (2) seven activation energies for charge-remote glycosidic bond cleavages and seven activation enengies for charge-directed glycosidic bond cleavages (it is assumed that the activation energies depend on the nature of the two residues near the glycosidic bond); (3) 28 A factors for charge-remote process and 28 A factors for charge-directed processes; (4) two parameters for estimating the A factors of charge-remote and charge-directed processes, respectively, for free nonreducing end glycan fragments; (5) two heat capacity parameters for estimating decrease of effective temperature for glycopeptides during charge-remote and charge-directed fragmentation events, respectively; (6) two heat capacity parameters for estimating decrease of effective temperature for free nonreducing end glycan fragment during charge-remote and charge-directed fragmentation events, respectively. Mathematical Model. The details of the kinetic model have been described previously.22,23 The procedure starts by calculating the proton distribution in the precursor ion, followed by calculating the rate constant for each competing pathway, then by calculating the abundance of each product ion (from eq 3). The procedure is an iterative process in that any “product ion” will become “precursor ion” and undergo further fragmentation if the reaction time allows (tnext > 0). Details of calculating rate constantsforthepeptidemoietyhavebeendescribedpreviously,22,23 and the following describes the calculation regarding the glycan moiety. Proton Distribution. Each protonation site, including backbone amides, side-chains, and terminal functional groups, has a GB value. The apparent GB of the glycan moiety is contributed primarily from the amide groups in GlcNAc and sialic acids. 0 Assuming the GB of an amide group is GBamide and there are N amide groups in the glycan, the apparent GB of the glycan is calculated by

[ (

app GBglycan ) RTeff ln(Kapp) ) RTeff ln N exp

0 GBamide RTeff

)]

(5)

Therefore app 0 GBglycan ) GBamide + RTeff ln(N)

(6)

Considering the contribution of all other nonamide atoms to the GB, let amide GB N ) Nglycan + Fmass · Mglycan

(7)

we have app 0 amide GB GBglycan ) GBamide + RTeff ln(Nglycan + Fmass · Mglycan)

(8) where Namide glycan is the actual number of amide groups in the glycan (including GlcNAc and sialic acid residues), Mglycan is the mass

GB of the glycan, and Fmass is a parameter representing the contribution of all other nonamide atoms to GB of the glycan. It is later demonstrated that the mass term is negligible when Namide glycan > 0. Proton distribution across the peptide backbone can be calculated using the Boltzman distribution,22 treating the N-glycan moiety as one of the side chains. Coulomb repulsion is considered when calculating charge distribution. The distance from the glycan charge site to the peptide backbone, for the purpose of calculating Coulomb repulsion for charge distribution calculation, is estimated by the following equation in a unit of backbone residue length:

glycan free glycan parameters f free charge remote and f charge remote are applied to the above two equations for calculating rate constants of glycosidic bond cleavages in free glycans.

Lglycan ) 1.5 + 0.2 √Mglycan

Rate constants for ion loss and neutral loss from free glycan are calculated the same way as eqs 12 and 13 for a pglycan value of 1. That is

(9)

Rate Constants for each Competing Fragmentation Pathway. Rate constants of all pathways involving peptide fragmentation are calculated the same way as described previously.22,23 The rate constants for glycosidic bond cleavages are calculated by the Arrhenius equation as shown in eq 4. For chargedirected pathways, the rate constant is also proportional to the charge density (pglycan) on the glycan (considered as the asparagine side chain), spreading to different possible protonation sites.

kcharge directed )

Acharge directed · pglycan amide Nglycan

+

GB Fmass · Mglycan

kcharge remote ) Acharge remote · exp(-Ea /RTeff)

(11)

Ea’s (activation energies) are parameters in the model, which depend on fragmentation pathways as well as the neighboring sugar residues near the cleavage sites. For charge-directed cleavages, it is assumed that the proton at the cleavage site stays with the leaving group (nonreducing end). This assumption was made because the resulting model gave better prediction than alternative assumptions. Therefore, the rates of ion loss (kion loss) and neutral loss (kneutral loss) are calculated by the following equations kion oss ) kcharge directed +

(

kneutral loss ) kcharge remote 1 - pglycan

+ +

Acharge directed amide Nglycan

GB + Fmass · Mglycan

·

exp(-Ea /RTeff)

GB Fmass · Mnonred GB Fmass · Mglycan

(15)

free glycan free glycan free glycan kion ) kcharge loss directed + kcharge remote

amide GB Nleaving + Fmass · Mnonred amide GB Nglycan + Fmass · Mglycan

(16)

(

free glycan free glycan kneutral loss ) kcharge remote 1 -

amide GB + Fmass · Mnonred Nnonred amide GB Nglycan + Fmass · Mglycan

amide GB Nnonred + Fmass · Mnonred amide GB Nglycan + Fmass · Mglycan

)

(13)

Note that all proton distributions are calculated by assuming the proton is distributed proportionally according to amide Namide + FGB is the number of amide groups mass · M, where N in the fragment and M is the mass of the fragment. Also note that “nonred” stands for the nonreducing end of the glycan. For a free glycan ion (a charge is retained on the glycan moiety after a glycosidic bond cleavage), pglycan is always 1, and two extra

)

(17)

Due to the likelihood of glycosidic bond cleavage during CID, cleavages of more than one bond are frequently observed, as compared to the cleavages of more than one peptide bond. Direct application of the above rate constants to the previously developed peptide CID model produced inaccurate results due to severe underestimation of multiple bond cleavages. The reason is that, in the previously developed peptide model, a five-step temperature approximation was used to save computation time. Due to the abrupt decrease of effective temperature between each step, cleavage of more than one bond is underestimated. To solve this problem for glycan fragmentation as well as to save computation time, it is assumed that two charge-remote glycosidic bond cleavages can happen within a same temperature step. The rate constant of the charge-remote two-bond cleavage is calculated by 2-bond 1 2 kcharge remote ) f2-bond · kcharge remote · kcharge remote

(12)

(14)

free glycan free glycan kcharge remote ) f charge remote · Acharge remote · exp(-Ea /RTeff)

· exp(-Ea /RTeff) (10)

amide Nnonred kcharge remotepglycan amide Nglycan

free glycan free glycan kcharge directed ) f charge directed ·

(18)

where f2-bond is a parameter relating to the efficiency of a two1 bond cleavage, kcharge remote is the rate constant of the first 2 charge-remote process, and kcharge remote is the second chargeremote process from the product ion of the first process (calculated from eq 11). Distribution of charges for double GB cleavages is calculated similarly according to Namide glycan + Fmass · Mglycan value of each fragment. Further Fragmentation. With rate constants for all pathways calculated, the abundance of each ion at a certain time can be calculated. Each ion is then submitted to further calculation with lower temperature until the fragmentation time runs out, and the isotope pattern of each ion at that time is added to the predicted spectrum.23 The temperatures of the later steps are calculated by the same way as described previously for quadrupole ion-trap instruments.22 Analytical Chemistry, Vol. 82, No. 24, December 15, 2010

10197

That is, the temperature for next round of fragmentation in an ion-trap instrument is calculated by next Teff ) Teff - ∆T -

∆E C · Mprecursor

(19) similarity )

∆T is /10 of Teff - T0 for the first step and /5 for the later steps. C is the heat capacity of the precursor ion. The activation energy of each cleavage process is used as the value of ∆E. To compensate for the inaccuracies in ∆E estimation, four next different values of C are used for calculating Teff for chargedirected, charge-remote processes for glycopeptides and free glycans separately. Note that the cooling model described here applies only to the ion-trap type of collision cells. For prediction of CID spectra on the quadrupole type of collision cells, the cooling model needs to be modified appropriately. After losing a leaving group (nonreducing end), the remaining glycan is still one of the 524 glycans, whose fragmentation behavior can also be predicted (see Supporting Information Table S-1). Therefore, the fragmentation process can be further simulated using the same procedure at a lower temperature. The process stops when the remaining fragmentation time is shorter than 10% of the total fragmentation time (to save computation time) and all ions at that time are added to the simulate spectrum as described previously.23 Training of the Model. The model is trained using a large number of ion-trap CID spectra of known N-glycopeptides by examining the similarity scores between the predicted and 1

experimental spectra. The similarity score between two spectra is defined as22

∑ √I I √( ∑ I )( ∑ I 1 2 m m

1 m

1

(20) 2 m)

where I stands for signal intensities at certain m/z value (m). The above definition of spectral similarity represents the dot product of the two spectra (cosine of spectral angle),25 after a square-root transformation of the intensity of each signal. Equation 20 is preferred over the conventional dot product because the conventional dot product does not put enough emphasis on low intensity ions,26 which often contain large amounts of structural information. Due to the limited number of peptide sequence ions observed in the training spectra, the parameters related to the peptide moiety were not changed during the training. Instead, parameters previously derived using a large number of nonglycosylated peptides were used.23 The best match between the simulated and experimental spectra was obtained when the average similarity of all spectra in the training data set were maximized. To avoid overfitting the model, the deviations of different activation energies were minimized at the same time when the spectral similarities were maximized. Specifically, the best set of parameters is obtained when the following function is maximized.

Table 2. Optimized Values for A Factor (A) and Activation Energy (Ea) A (s-1) glycosidic bonds M-M

Gn-M

G-Gn S-G Sg-G F-Gn Gn-Gn M-Gn

10198

leaving group

mass of leaving group

no. of amide in the leaving group

M M2 M3 M4 M5 GnM GGnM SGGnM SgGGnM Gn2M GGn2M G2Gn2M SGGn2M SgGGn2M SG2Gn2M SgG2Gn2M S2G2Gn2M SgSG2Gn2M Sg2G2Gn2M Gn GGn SGGn SgGGn bisecting Gn G SG SgG S Sg F Bn Bn-1

162.053 324.106 486.158 648.211 810.264 365.132 527.185 818.280 834.275 568.212 730.264 892.317 1021.360 1037.355 1183.413 1199.408 1474.508 1490.503 1506.498 203.079 365.132 656.228 672.222 203.079 162.053 453.148 469.143 291.095 307.090 146.058 varies varies

0 0 0 0 0 1 1 2 2 2 2 2 3 3 3 3 4 4 4 1 1 2 2 1 0 1 1 1 1 0 varies varies

Analytical Chemistry, Vol. 82, No. 24, December 15, 2010

Ea (kJ/mol)

charge directed

charge remote

charge directed

charge remote

1.83 × 1021 8.60 × 1021 5.20 × 1021 1.93 × 1022 6.70 × 1021 6.51 × 1022 3.87 × 1022 3.52 × 1022 1.26 × 1022 1.23 × 1022 2.49 × 1021 5.04 × 1021 6.35 × 1019

5.39 × 106 1.80 × 106 2.93 × 106 4.24 × 106 2.40 × 106 1.80 × 106 1.31 × 106 1.81 × 103 1.48 × 103 1.29 × 106 3.24 × 105 2.01 × 103 1.66 × 106

312.5

73.5

1.74 × 1022

2.00 × 102

7.49 × 1022

3.99 × 104

2.32 × 1023 2.72 × 1023 2.26 × 1023 8.19 × 1022 1.66 × 1023 4.76 × 1020 8.55 × 1021 1.72 × 1021 1.18 × 1022 1.04 × 1021 8.61 × 1018 1.37 × 1020 1.25 × 1019

7.88 × 106 5.78 × 106 2.99 × 106 8.19 × 103 8.99 × 106 6.00 × 107 7.40 × 107 3.30 × 104 1.80 × 1010 1.27 × 1010 1.91 × 105 4.90 × 104 1.80 × 105

315.0

77.9

312.7

101.7

303.4

107.8

273.9 269.2 271.7

49.2 59.4 69.5

Table 3. Optimized Values for Other Miscellaneous Parameters optimized balues

equation

0 GBamide GB F mass

877.9 3.7 × 10-12

5-8 7-8

free glycan f charge directed free glycan f charge remote f2-bond C

56.8 2.83 0.52 0.162

14 15 18 19

C

0.037

19

C

0.196

19

C

1.6 × 105

19

parameter

symbol

GB of glycan amide contribution of glycan mass to GB free glycan charge directed free glycan charge remote two-bond cleavage heat capacity charge-directed glycopeptide heat capacity charge-remote glycopeptide heat capacity charge-directed free glycan heat capacity charge-remote free glycan

f ) ¯s 0.001 N

0.001 N

√ ∑ (E

√ ∑ (E

charge directed a

charge remote a

- Eacharge directed)2 -

- Eacharge remote)2 (21)

Here, ¯s is the average similarity score for all spectra in the training data set, N is the total number of spectra in the training data set, and 0.001 is a weight factor, which is selected to be the largest number that does not decrease the optimized average similarity ¯s significantly. EXPERIMENTAL SECTION To train the described mathematical model for predicting CID spectra of N-glycopeptides, a data set containing 1831 CID spectra was generated from glycopeptides with 143 different peptide sequences and 89 different N-glycoforms, including 60 complex type glycans with 2-4 antennas, 7 high-mannose type glycans, and 22 hybrid type glycans. The mass of these glycopeptides varies from 1900 to 9900 Da, with peptides from 5 to 72 residues in length. CID spectra of the same ion acquired at different collision energies and isolation widths were treated as different spectra. All CID spectra were collected in centroid mode on three Thermo-Scientific LTQ-Orbitrap mass spectrometers. Most spectra were collected in low resolution on the linear ion trap, and 17 spectra in high resolution on the orbitrap, with isolation width of 2-4 u, and relative collision energy of 30-35%. To ensure the quality of spectra in the training data set, most glycopeptides were generated from proteolytic digestion of well-characterized recombinant glycoproteins, including primarily monoclonal antibodies produced in Amgen (Thousand Oaks, CA), and well-characterized commercial glycoproteins purchased from Sigma-Aldrich (Saint Louis, MO). Each glycopeptide was identified from its tandem mass spectra, including the accurate mass and its CID fragmentation pattern, using custom-written software MassAnalyzer,24 and validated manually. Proteases used for digestion include trypsin, endoproteinases Lys-C, Glu-C, Asp-N, pepsin, and chymotrypsin. Most spectra were collected with reversed-phase liquid chromatography/tandem mass spectrometry (LC/MS/MS) at ∼200 µL/ (25) Stein, S. E.; Scott, D. R. J. Am. Soc. Mass Spectrom. 1994, 5, 859–866. (26) Zhang, Z. Anal. Chem. 2010, 82, 1990–2005.

Figure 3. Distribution of similarity scores for spectra in the training data set (top) and testing data set (bottom). A similar distribution was obtained for the two data sets, demonstrating the validity of the prediction model.

min flow rate using acetonitrile gradient and either 0.02%-0.1% of TFA, or 0.1-0.2% formic acid, in the mobile phase. Glycopeptides generated from Glu-C digestion of IgG1 and IgG2 monoclonal antibodies were used to construct the testing data set. The testing data set contains 196 CID spectra with 6 unique peptide sequences (24-35 residues in length) and 28 different N-glycoforms. None of these glycopeptides was present in the training data set. Mass of these glycopeptides varies from 3700 to 6500 Da. Spectra were collected at similar conditions, in low-resolution mode, as the spectra in the training data set. A computer program, written in Microsoft Visual C++, was developed for simulating CID spectra and refining the model. The program is incorporated into MassAnalyzer,24 a program for fully automated protein and peptide LC/MS/MS data analyses, through a dynamically linked library (DLL). CID spectra in the training data set were simulated with varied parameters until function f described in eq 21 was maximized. When simulation was performed on a desktop computer with 3.33 GHz Intel Core2Duo processor (only one CPU was used), for all the spectra with precursor mass less than 8000 u in the training data sets, an average simulation speed of 28 spectra/s was achieved, with larger peptides much slower than smaller peptides. A function optimization routine was developed and was used to optimize parameters in the model. The routine is an iterative process in which each parameter in the model was varied until the function described in eq 21 was maximized. The process repeats until no further optimization can be achieved. RESULTS Many parameters were tested for their effects on the fragmentation patterns. Those parameters determined to have significant effects were included in the model, and their optimized values are presented in Tables 2 and 3. Analytical Chemistry, Vol. 82, No. 24, December 15, 2010

10199

Figure 4. Representative experimental and predicted spectra of protonated glycopeptides in the training data set. The ion intensities are square-root transformed to show the low-intensity signals.

The N-glycopeptide CID model, with its respective set of parameters, was used to simulate the spectra in the training and testing data sets. The predicted spectra were compared to the experimental spectra in each data set, and the distribution of similarity scores is shown in Figure 3. For fair comparison, peptides larger than 8000 u are not included in the calculation of similarity score distributions. It is seen from Figure 3 that the similarity score distribution for the testing set is very similar to the distribution in the training set, demonstrating the validity of the CID model. Figures 4 and 5 show some representative simulated CID spectra of protonated glycopeptides in the training data set and testing data set, respectively, as compared to their experimental spectra. In order to make weak fragment ions more visible, ion intensities in all spectra are plotted in square-root scale. The 10200

Analytical Chemistry, Vol. 82, No. 24, December 15, 2010

glycopeptides shown in Figures 4 and 5 are selected so that they represent most varieties of glycans, peptides, and precursor charge states. Importantly, all selected experimental spectra have basedpeak intensity above 104 to ensure the spectral quality. As a result, the similarity scores between the predicted and experimental spectra shown in Figures 4 and 5 are above average, due to the better signal-to-noise ratios of the experimental spectra. As demonstrated in Figures 4 and 5, the general fragmentation patterns of the glycan moiety are accurately predicted. Occasional observations of peptide cleavage b and y ions are also predicted nicely. Some more examples, including a few low-quality spectra, are shown in Supporting Information Figure S-1. DISCUSSION Due to the model’s empirical nature as well as the large number of parameters used in the model, the absolute value of

Figure 5. Representative experimental and predicted spectra of protonated glycopeptides in the testing data set. The ion intensities are squareroot transformed to show the low-intensity signals.

each parameter may be far away from its real physical values. However, analyzing the relative values of optimized parameters may yield useful information that advances our understanding of the N-glycopeptide CID process. Examination of the optimized parameters shown in Table 2 reveals that the most striking characteristic is the high A factor for charge-directed cleavage of Gn-M linkage and charge-remote cleavage of S-G (and Sg-G) linkage. As a result, sialylated glycans loses its sialic acid residues very readily at low charge state (see Figures 4E and 5F for examples). However, with high charge state, charge-directed Gn-M cleavages dominate. For example, for the 3+ sialylated glycopeptides shown in Figure 4E, the most abundant fragment ions are from the loss of sialic acid residues, whereas for the similar glycopeptide at 4+, the most abundant fragment ion is from the loss of SGGn+ caused by the Gn-M cleavage. After

the loss of SGGn+, the charge state of the precursor ion reduces to 3+ and therefore further loss of neutral sialic acid residues follows (Figure 4F). CID provides rich structural information on the glycan moiety of a glycopeptide, while providing little sequence information on the peptide part of the molecule. In most cases, because of the simplicity of N-glycans on mAb molecules as well as the specificity of the N-glycosylation site (NXS or NXT, where X can be any residue except proline), CID fragmentation patterns combined with accurate mass are enough to identify the glycopeptides in mAb molecules. For a more complex system such as a complex protein mixture or a protein with heavily modified amino acid residues, electron-transfer dissociation (ETD) or electron-capture dissociation (ECD) may be necessary for peptide sequence information, and to identify the site of Analytical Chemistry, Vol. 82, No. 24, December 15, 2010

10201

glycosylation. A similar kinetic model for ETD/ECD is described in a separate paper.26 Although peptide sequence ions usually have low abundances in a glycopeptide CID spectrum, the sequence of the peptide does contribute to the overall fragmentation pattern of the glycan moiety. The peptide sequence affects the glycan fragmentation pattern primarily through charge distribution. A peptide with higher gas-phase basicity will retain more charges on the peptide moiety, resulting in fewer charges on the glycan moiety. As a result, charge-directed glycan fragmentation pathways will be suppressed. The described kinetic model makes it possible to automatically identify glycopeptides, facilitating routine profiling of glycan structures of therapeutic glycoproteins. However, due to structural similarities of some isobaric glycan structures, it is sometimes difficult to distinguish these isobaric glycoforms based on CID alone. For example, it is frequently difficult to distinguish confidently a bisecting GlcNAc from a GlcNAc attached to an antenna (e.g., A2G0FB and A3G0F), and to distinguish galactose and mannose (e.g., A1G0M4F and A1G1F). Therefore, a confident identification of a glycan structure often needs the combination of mass information, MS/MS fragmentation pattern, and biosynthetic restrictions known for the expression system. For recombinant therapeutic proteins, fortunately, glycan structures are often well characterized by other more structurally informative techniques. Possible glycoforms produced by a specific expression system are usually known. As additional examples, it is known that human glycan does not contain NGNA27 as other mammals, endogenous human IgG Fc glycans usually contain no more than two antennas,6 and CHO-expressed glycoproteins do not contains (27) Varki, A. Biochimie 2001, 83, 615–622. (28) Stanley, P.; Raju, T. S.; Bhaumik, M. Glycobiology 1996, 6, 695–699.

10202

Analytical Chemistry, Vol. 82, No. 24, December 15, 2010

bisecting GlcNAc28 as human endogenous IgGs, etc. Therefore, after applying these biosynthetic restrictions, glycan profiling based on CID of glycopeptides serves the need in most cases for routine glycan profiling in therapeutic mAb development. For example, the described model has been used to identify and quantify glycan profiles in several mAbs, and the profile matched the profiles determined by conventional approaches nicely.24 It was also used to identify minor N-glycoforms present in endogenous human IgGs.6 The described model makes it possible for fully automated identification/quantification of glycopeptides from an LC/MS/ MS peptide mapping experiment.24 A major advantage of this approach is that microheterogeneities of glycoforms are distinguished. Glycans attached to different regions of the protein or a different protein can be conveniently distinguished. Additionally, an LC/MS/MS peptide mapping experiment is routinely performed for protein structural characterization; no additional experiment needs to be performed for glycan profiling. ACKNOWLEDGMENT The authors would like to thank Gang Xiao, Jason Richardson, Drew Nichols, Da Ren, and Diana Liu for their help in collecting data for training the model, and Gary Rogers, Greg Flynn, and Pavel Bondarenko for helpful discussions during development of the model. SUPPORTING INFORMATION AVAILABLE Additional information as noted in text. This material is available free of charge via the Internet at http://pubs.acs.org. Received for review November 11, 2010. AC102359U

September

4,

2010.

Accepted