Automated N-Glycopeptide Identification Using a Combination of

Aug 30, 2007 - Palo Alto Research Center, Palo Alto, California 94301, Division of Molecular Biosciences, Faculty of Natural Sciences, South Kensingto...
8 downloads 9 Views 507KB Size
Automated N-Glycopeptide Identification Using a Combination of Single- and Tandem-MS David Goldberg,*,† Marshall Bern,† Simon Parry,‡ Mark Sutton-Smith,‡ Maria Panico,‡ Howard R. Morris,§ and Anne Dell‡ Palo Alto Research Center, Palo Alto, California 94301, Division of Molecular Biosciences, Faculty of Natural Sciences, South Kensington Campus, Biochemistry Building, Imperial College, London SW7 2AZ, United Kingdom, and M-SCAN Ltd Wokingham, Berkshire, RG41 2TZ, United Kingdom Received April 26, 2007

We describe Peptoonist, a program that can automatically identify the glycans (sugars) present at each N-glycosylation site of a protein. The input to Peptoonist is a series of mass spectra, both MS and MS/MS, obtained from a liquid chromatography (LC) run of proteolytically digested purified glycoproteins. The program uses MS/MS to identify glycosylated peptides and single-MS to identify the N-glycans present on each of these peptides, at least to the level of monosaccharide composition. We validate the program on an LC run of mouse zona pellucida proteins that had been intensively hand annotated by a human expert. Our program doubled the number of glycopeptide identifications, and also found several possible errors in the hand annotation. In addition, it automatically made most of the same glycan isomer identifications as the expert annotator. Keywords: glycosylation • N-glycan • glycomics • proteomics • mass spectrometry

Introduction Glycosylation is the post-translational addition of oligosaccharides (short chains of simple sugars called monosaccharides) to proteins or lipids. Protein glycosylation consists of two broad classes, called N- and O-glycosylation, and is estimated to be present in over half of all proteins.1 Glycobiology textbooks2,3 give a diverse list of biological processes that depend on protein glycosylation; moreover, differential glycosylation may offer new biomarkers for various diseases.4,5 In this paper, we focus on N-glycosylation, in which glycans are attached to asparagine residues in the consensus sequences Asn-X-Thr and Asn-X-Ser, where X is any residue except Pro. A detailed analysis of N-glycosylation can further functional understanding, as shown by the recent (May 2005) FDA approval of a blood biomarker test for liver cancer that measures the percentage of N-glycoforms of alpha-fetoprotein (AFP) that carry a fucose near the reducing end of the glycan, linked by an R 1-6 bond.6 Glycosylated asparagine is almost always observed with a biosynthetically related family of glycans, rather than a single type. We have developed a program, Peptoonist, that uses both single- and tandem-MS spectra to determine the glycoforms present at each N-glycosylation site within a purified sample of glycoproteins. By using both MS/MS and high-resolution single-MS, Peptoonist can analyze more complex samples, with a more thorough analysis, than previous algorithms. * To whom correspondence should be addressed. Palo Alto Research Center, Fax, 650-812-4471; E-mail, [email protected]. † Palo Alto Research Center. ‡ Imperial College. § M-SCAN Ltd Wokingham. 10.1021/pr070239f CCC: $37.00

 2007 American Chemical Society

Tandem-MS alone misses many glycopeptides, becausesas in proteomicsslow-intensity precursor peaks are often not selected for MS/MS. Moreover, glycopeptides generally give incomplete fragmentation and poor MS/MS spectra. The mouse zona pellucida (ZP) sample studied here included many glycopeptides identified by a human expert that either were not selected for MS/MS or had ambiguous MS/MS spectra. Single-MS alone also falls short, but the reason is less apparent, as high-resolution single-MS analysis of detached N-glycans, as performed by our previous tool, Cartoonist, is often quite successful.7 Mammalian glycans are typically composed of 5 types of monosaccharides, each with a distinct residue mass: Hexose (162.05), HexNAc (203.08), Fucose (146.06), NeuAc (291.10), and NeuGc (307.09). Because there are only 5 masses (compared with 19 distinct amino acid residue masses), the composition of modest-sized glycans can usually be determined by mass alone. So at first glance, glycopeptide analysis appears tractable: in a purified sample, there are a modest number of tryptic peptides with potential N-glycosylation sites, and hence it should be possible to match each single-MS peak with the mass of a tryptic peptide plus a glycan. Unfortunately, because peptides and glycans are composed of the same types of atoms, there turn out to be many coincidences in the masses of glycopeptides. For example, in the zona pellucida sample, there are peaks at m/z 1820.67, 1365.75, and 1092.80, all different charge states of a molecule of mass 5460.25. This molecule could be either the tryptic peptide PRPETLQFTVDVFHFANSSR from ZP3 + NeuGc2FucHex5HexNAc6, or the tryptic peptide IGNGTRAHILPLK from ZP2 + NeuGcNeuAcFucHex7HexNAc9. Both glycans are plausible and have been observed biologically. (As we shall see, MS/ Journal of Proteome Research 2007, 6, 3995-4005

3995

Published on Web 08/30/2007

research articles

Goldberg et al.

Table 1. Examples of Related Glycopeptides Whose Mass Agrees to within 0.001 Daltonsa mass

HexNAc

Hex

Fuc

NeuAc

NeuGc

amino acid residues

146.058

0

0

1

0

0

S S K -R A P D -H G P E -H

203.079

1

0

0

0

0

T T -N D T T -Q E P E Y -W P D Q -H

15.995

0 0

0 1

0 -1

-1 0

1 0

-A S -F Y G S -Q -G S D -E

41.026

1

-1

0

0

0

-S Q A A -T -S -S T N -T -T D K

57.021

1

0

-1

0

0

G -G N -A Q A D -E

88.016

-1

0

0

1

0

T D -Q S -N D S -Q E -G -G S D

104.011

-1

0

0

0

1

-G -T M M -A -S M M

129.043

0

-1

0

1

0

E -G A D -S T D -V L D -N D Q

145.037

0

-1

0

0

1

-G S D -A T D -A S E E -F Y

161.032

0

0

-1

0

1

-P E E -T M M S T T -K S S D -Q

162.053

0

1

0

0

0

S P D -H

a The first line of the third row shows that adding a NeuGc and subtracting a NeuAc has the same effect as adding a serine and deleting an alanine: they both increase the mass by 15.995 Daltons. So any peak annotated as a glycopeptide containing an alanine and a NeuAc will have a second possible annotation that substitutes an S for the A and a NeuGc for the NeuAc.

MS analysis reveals that the first interpretation is much more likely.) Such coincidences are very common, because there are many ways for a sum of monosaccharide residues to equal a sum of amino acid residues. Table 1 lists a few examples. The first row shows that the mass of a fucose equals the mass of two serines plus a lysine minus an arginine. So if a peak matches a glycopeptide containing at least one fucose and one arginine, it also matches a glycopeptide without the fucose and with the arginine replaced by two serines and a lysine. Of the 53 - 1 ) 124 ways of adding or subtracting at most one of each of the 5 common mammalian monosaccharides, 54 match a peptide fragment within 0.001 Dalton, and 92 match within 0.01 Dalton. Table 1 shows 12 of these matches. Mass coincidences present the largest obstacle in automated analysis of samples containing 1000s of possible glycopeptides, the cross-product of 100s of potential glycans, and 10s or 100s of tryptic peptides containing potential N-glycosylation sites (NXS and NXT sequences). Peptoonist handles this difficulty 3996

Journal of Proteome Research • Vol. 6, No. 10, 2007

by using MS/MS to identify which few peptides are actually glycosylated; for this, it needs only a single good MS/MS spectrumsof any glycoformsfor each site. But then, unlike pure MS/MS approaches, Peptoonist expands the initial identifications into many more identifications by matching the now reduced set of possible glycopeptides with the peaks in the single-MS profile. Thus, we believe that Peptoonist gives a more thorough glycopeptide analysis than was previously possible. We think that it will become an important tool for characterizing glycopeptides, either as the primary method or combined with other techniques such as MALDI screening and MSn experiments. A high-level overview of the single-MS part of Peptoonist appears in a recent survey.8 Related Work. Automated peptide identification by mass spectrometry is now routine, available via numerous commercial and free software packages.9-11 Most of these programs can handle simple post-translational modifications and can be used for tentatively identifying attachment sites of N-glycans12,13 because enzymatic removal of an N-glycan converts an Asn to Asp. (H218O can be used to distinguish this conversion from chance deamidations.) Since the sugars added by N-glycosylation have masses ranging from about 1200 Daltons to over 3500 Daltons, the identification of glycopeptides cannot be done by treating them as simple post-translational modifications. Algorithms have been described for identifying detached glycans from single MS7,14 or by MS/MS.15-21 But there are very few techniques and tools that give the complete picture by automatically identifying families of glycoforms at glycosylation sites. One such technique has been described by Lebrilla et al.22 This method digests the sample with the nonspecific enzyme mixture Pronase and then performs single-FTMS (Fourier Transform Mass Spectrometry) without LC. One of the advantages of this method is that it digests naked peptides down to single residues that cannot be confused with glycopeptides. On the other hand, Pronase digestion gives glycopeptides with very small peptides and nonspecific cleavage, and hence even more mass coincidences. The most complex glycoprotein analyzed by Lebrilla et al. has 3 potential glycosylation sites and 20 glycans. The mixture of ZP2 and ZP3, analyzed here, has 13 potential sites and over 100 distinct glycans. Wu23 et al. have described an approach using LC-MS/MS that takes advantage of the observation that glycopeptides with the same base peptide tend to elute together. This approach identifies likely glycopeptides with the same base peptide, based upon MS/MS signatures, and then deduces the likeliest base peptide mass. The peptide mass and elution time are then used to identify the most likely glycosylation sites, and in turn, the likely set of glycoforms (from 22 possibilities). As argued above, MS/MS approaches limit attention to glycopeptides that both trigger MS/MS and give reasonable fragmentation, and hence they are likely to be less sensitive than Peptoonist’s combined approach. Overview of Peptoonist. Peptoonist processes a collection of LC-MS and MS/MS spectra from a digest of a simple glycoprotein sample, either a purified protein or a small number of proteins. There are four basic steps: 1. Recalibration. Software recalibration of m/z measurements is essential for achieving the optimal mass accuracy from a QTOF (Quadrupole Time-of-Flight) instrument. Peptoonist uses peaks from unmodified peptides, which fragment well and hence can be reliably identified by MS/MS, as “ground truth” for the recalibration of both the single-MS and MS/MS spectra.

Automated N-Glycopeptide Identification

We found that single-MS and MS/MS scans often require different recalibrations. 2. Quality Filtering of Single-MS Peaks. The single-MS spectra are inspected for series of peaks that match the isotope ratios of glycopeptides. This is done by fitting a theoretically generated set of Gaussians to the peak envelope. The fit determines the +0 (monoisotopic) mass (containing only H1, C12, N14, etc.), along with the charge. The closeness of the fit gives the quality of the envelope. 3. MS/MS Scoring. Each MS/MS spectrum from a highquality single-MS precursor peak is scored against all possible glycopeptides, generated from all the potentially N-glycosylated peptides in the sample (not necessarily tryptic or even semitryptic), and an N-glycan library, generated by production rules as in Cartoonist.7 The base peptides of the high-scoring glycopeptides give a small set of peptides with glycosylation. 4. Glycopeptide Annotation of Single-MS Peaks. In the final step, we score peaks from step (2) that have masses matching glycopeptides with base peptides from step (3). The score depends upon both local evidence, such as the closeness of the matches of theoretical and observed masses and isotope envelopes, and global information, such as single-MS peaks representing other charge states or other members of the same glycopeptide family (same peptide and similar glycan). We also allow for special-case global information. For example, murine glycopeptides containing sialic acids would be expected to contain variants with NeuAc substituted for NeuGc.

Methods Experimental. Sample Preparation. The sample contained mouse zona pellucida proteins ZP2 and ZP3. Purified ZP was isolated from flash-frozen mouse ovaries obtained from 1214-week-old wild-type mice. Tryptic digests were analyzed by nanoLC-ES-MS/MS using a reverse-phase nanoHPLC system connected to a QTOF mass spectrometer. The results were analyzed by hand and have been described previously8 along with more details of sample preparation and a preliminary description of automated data analysis. Algorithms. Recalibration. The MS/MS spectra are processed by the publicly available program X!Tandem11 to identify spectra of unmodified peptides. For all identified spectra, the peptide and its retention time are recorded in a list of validated peptides. Next, the single-MS spectra are scanned to find singleMS peaks of any charge that have a mass and retention time matching a validated peptide on the list. Each match is used to generate a point (x, y) where x is the m/z of the single-MS peak, and y is the difference between the actual m/z and the theoretical m/z of the peptide. Recalibration of the single-MS spectra is done by performing a robust line-fitting algorithm as described previously.24 This recalibration line is used to recalibrate all the single-MS spectra. For the MS/MS spectra, Peptoonist uses a custom program that performs a recalibration for each identified MS/MS spectrum, based on its observed band y-ions, and then averages these recalibration lines to obtain a single overall recalibration line that is used to recalibrate all the MS/MS spectra. For the mouse ZP3 spectra the slopes of the individual MS/MS recalibration lines were very consistent, averaging -0.000223 with a standard deviation of 0.000008. Envelope Recognition. Each single-MS spectrum is centroided, and the resulting peaks are considered one at a time starting with the most intense one. The charge C of a peak is estimated by looking at the distance to neighboring peaks, and the resulting region of the (uncentroided) spectrum is fitted to

research articles a sum of Gaussians spaced 1/C apart. Fitting uses the nonlinear minimization routine dfpmin from Numerical Recipes.25 The optimization is over three scalar parameters: the position of the series of Gaussians, the width (variance) of the Gaussians, and an overall scale factor. The Gaussians all have the same width, and a single scale factor determines all their heights, because their relative heights are precomputed based on isotope abundances. The relative heights are computed using the total mass corresponding to the peak, and assuming a generic glycopeptide has an atomic formula containing 50% Hydrogen, 30% Carbon, 5% Nitrogen, and 15% Oxygen atoms. These percentages were computed from the 58 hand-annotated glycopeptides in Table 2, and we verified that other glycopeptides had similar compositions. The starting point for the dfpmin iteration sets the position parameter to match the Gaussian of highest theoretical abundance to the tallest observed peak in the region. Then dfpmin is called two additional times, with the starting point shifted 1/C to the left, and to the right. The objective function minimized by dfpmin is the sum of squares of the distances between the intensity of points of the spectrum and the corresponding value of the sum of Gaussians. The range of spectrum over which the norm is computed is chosen based on isotope abundances. The range is selected to include all isotope peaks with an intensity of 10% or more of the most intense isotope. A baseline correction is performed on the spectrum before calling dfpmin. The current version of Peptoonist does not explicitly consider overlapping envelopes, and so may fail for the (rare) case of two glycopeptides whose peaks overlap and have similar intensities. MS/MS Analysis. MS/MS analysis examines each MS/MS spectrum that corresponds to a single-MS envelope found in the previous step, scores it against all possible theoretical glycopeptides, and returns the best score. Each score is computed by searching the spectrum for three types of theoretically generated fragment peaks: (1) b- or y-ions from the base peptide, (2) glycan fragments, and (3) the entire peptide plus a glycan fragment. The third type of fragment is much more likely to be multiply charged than the first two types. Only MS/MS spectra with a strong peak of type (2) are considered,26 and then their score is based on types (1) and (3). The reason for this choice is that the score is used to determine the base peptide, and fragments of type (2) are of no help in distinguishing between peptides. In detail, there must be a peak of type (2) whose intensity ranks it as one of the top 10 peaks. Then the ranks of the peaks of types (1) and (3) are converted to scores and summed. The rank to score function is 1/(1 + (rank/100)2). This simple scorer suffices, because the database is relatively small. Single-MS Envelope Quality. The quality of a candidate peak envelope depends on two local factors, the difference between the theoretical and observed m/z’s, and the quality of the fit to the sum of Gaussians. The table of theoretical glycopeptide masses is computed using Cartoonist.7 Cartoonist generates about 145 000 cartoons using biosynthetic rules, which build up each glycan antenna as a series of lactosamine units followed by one of nine capping units. Corresponding to these cartoons are approximately 6000 different glycan compositions. The table of theoretical glycopeptide masses consists of all sums of a glycan composition and a peptide. The set of peptides used will typically be either all tryptic peptides, or the peptides found in the MS/MS analysis. The tolerance used in matching the Journal of Proteome Research • Vol. 6, No. 10, 2007 3997

research articles

Goldberg et al.

Table 2. Fifty-Five Glycopeptides Found by Both Hand Annotation and by Peptoonista

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55

score

retention time (mins)

theoretical mass

observed m/z

chg

delta

qual

NAc

Hex

Fuc

Ac

Gc

12.73 12.57 12.39 12.08 12.06 11.86 11.72 11.33 10.87 10.70 10.59 10.17 10.15 9.69 9.65 9.56 9.53 9.37 9.29 9.11 9.07 9.01 8.93 8.92 8.50 8.03 8.02 8.00 7.99 7.78 7.66 7.52 7.50 7.18 7.06 6.98 6.79 6.46 6.45 6.39 6.32 6.12 6.10 6.09 5.99 5.88 5.68 5.65 5.21 4.81 4.60 4.41 4.00 3.97 3.91

53.65 53.75 53.75 53.85 59.28 59.28 53.96 53.03 58.25 58.87 58.15 57.95 53.96 59.07 58.97 57.54 59.48 58.25 58.87 59.58 59.38 54.36 59.17 59.48 54.16 53.96 59.38 53.03 53.65 53.55 49.36 53.65 59.48 58.87 59.48 58.87 59.48 58.56 59.38 59.48 58.66 52.73 54.26 53.85 59.38 53.96 49.26 53.85 59.07 54.16 59.17 58.77 53.75 59.58 54.16

5079.13 5444.26 5460.25 5095.12 5589.30 5605.29 5298.20 4625.98 4788.03 5282.21 5751.35 5153.16 4772.04 5808.37 5386.22 4991.11 4422.90 5735.35 5767.34 5370.22 5792.38 5063.13 5573.30 5402.21 5241.18 4406.90 4609.98 5137.17 5428.26 4480.94 4115.81 5257.17 3588.62 4260.85 5354.23 4713.99 6011.45 4318.89 3953.76 5776.38 4729.99 4846.07 4975.12 4584.95 5995.45 3750.68 4277.86 4568.96 3895.71 4057.77 3791.70 4156.84 3912.73 4244.85 3385.54

1693.66 1361.75 1820.67 1698.98 1863.67 1869.00 1766.66 1542.64 1596.64 1321.24 1438.50 1718.32 1591.32 1936.67 1795.99 1664.32 1474.96 1434.50 1442.50 1790.66 1448.76 1688.32 1394.01 1801.32 1311.00 1469.63 1537.31 1284.99 1357.75 1494.30 1372.62 1752.99 1196.94 1420.95 1339.24 1571.97 1503.51 1440.30 1318.62 1444.75 1577.30 1615.98 1244.50 1528.97 1499.51 1250.94 1426.62 1523.64 1299.28 1353.29 1264.62 1386.30 1304.95 1415.63 1129.27

+3 +4 +3 +3 +3 +3 +3 +3 +3 +4 +4 +3 +3 +3 +3 +3 +3 +4 +4 +3 +4 +3 +4 +3 +4 +3 +3 +4 +4 +3 +3 +3 +3 +3 +4 +3 +4 +3 +3 +4 +3 +3 +4 +3 +4 +3 +3 +3 +3 +3 +3 +3 +3 +3 +3

-0.01 -0.01 -0.01 -0.00 0.00 -0.00 -0.00 -0.01 0.00 -0.00 0.00 0.00 -0.01 0.00 0.00 -0.00 -0.00 0.00 0.00 -0.00 -0.00 -0.00 -0.00 -0.00 -0.01 -0.01 -0.00 0.00 -0.00 -0.00 0.00 -0.01 -0.01 -0.00 -0.00 -0.00 -0.00 -0.00 -0.01 0.01 0.00 0.00 -0.01 -0.01 0.00 -0.01 0.00 -0.00 -0.01 -0.01 -0.01 -0.01 -0.01 -0.01 -0.02

2.6 1.4 2.7 2.3 3.7 3.1 2.7 1.5 0.9 1.0 1.2 2.8 2.1 3.1 1.8 2.2 0.9 1.8 1.0 3.7 1.0 4.2 1.5 2.1 1.8 1.2 3.5 3.1 2.7 1.1 2.0 3.4 1.4 1.6 4.8 1.5 1.3 1.3 1.3 2.3 1.5 2.0 4.4 1.8 1.4 0.6 2.2 4.3 1.3 1.8 1.9 3.0 2.3 4.6 2.7

5 6 6 5 6 6 6 5 5 6 6 6 5 7 5 6 4 6 6 5 7 5 6 5 5 4 5 6 6 5 4 5 3 4 5 4 8 5 4 7 4 6 6 4 8 3 4 4 3 3 4 5 3 4 2

6 7 7 6 6 6 6 5 6 6 7 7 6 6 6 6 5 7 7 6 6 6 6 6 7 5 5 7 7 6 5 7 3 4 6 5 6 5 4 6 5 7 6 6 6 4 6 6 3 4 3 4 5 4 3

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 0 0 1 0 0 0 0 1 1 0 1 0 1 0 0 2 0 2 1 2 2 0 1 1 1 1 2 0 0 0 0 0 3 1 0 0 0 2 0 0 1 0 1 0 0 1 0 0 0 0 0 1 0

1 1 2 2 2 3 2 1 1 1 2 1 0 3 2 1 1 1 3 1 2 0 1 3 1 0 0 0 0 0 0 2 0 1 0 1 3 0 0 1 2 0 0 1 2 0 0 0 1 1 0 0 0 0 0

a Of the 58 hand annotated spectra, the 55 found by the program are listed in this table in order of decreasing score. Each row describes only the highestscoring envelope (over different charge states and spectra) for one glycopeptide, with the retention time referring to this best spectrum. Only the preferred isomer (monosaccharide composition) is given for each glycopeptide mass. The Delta column is the difference between the theoretical m/z and the recalibrated m/z. The columns labeled NAc/Hex/Fuc/Ac/Gc give the composition, that is, the number of HexNAc, Hexose, Fucose, NeuAc, and NeuGc monosaccharides in the glycopeptide. Note that the quality ratings, for which lower is better, are all less than 5.0.

recalibrated m/z (the position of the +0 Gaussian) to the table of theoretical masses (converted to m/z) is based on the scatter of points in the recalibration step described above. For the spectra used here, that tolerance is 0.05 Th (Daltons/charge). The other local factor is the quality of fit, which uses a more sophisticated distance than the sum of squares used for dfpmin. The sum of squares is Σi(yi - G(xi)),2 where G is the sum-of-Gaussians function, and the (xi, yi)’s give the m/z’s and intensities of points in the spectrum. Even when G tracks the spectrum closely, (y - G(x)) can be very large if the derivative 3998

Journal of Proteome Research • Vol. 6, No. 10, 2007

G′(x) is large. So instead we use d(G, (x, y)), the distance between (x, y) and the curve G, which is estimated by dividing (y - G(x)) by the square root of 1 + G′(x).2 The first component of the fit quality is the total distance D ) Σ d(G, (xi, yi)) summed over the points of the range of the spectrum corresponding to the envelope. In this computation, the Gaussians are positioned using the offset found earlier from dfpmin using the sum of squares. Unlike the dfpmin step, in this computation we need to ensure that for low mass peptides the range extends below the +0 isotope peak. So whenever the +0 abundance is more

research articles

Automated N-Glycopeptide Identification Table 3. Glycopeptides Found by the Program that Are Not on the Hand-Annotated Lista

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54

score

retention time (mins)

theoretical mass

observed m/z

chg

delta

qual

NAc

Hex

Fuc

Ac

Gc

9.93 9.66 9.62 9.45 8.90 8.60 8.51 8.23 7.70 7.64 7.26 7.22 7.11 7.10 6.99 6.95 6.91 6.82 6.70 6.65 6.54 6.42 6.38 6.37 6.32 6.29 6.26 6.25 6.18 5.91 5.90 5.79 5.64 5.56 5.43 5.28 5.27 5.21 5.18 5.13 5.03 4.77 4.77 4.74 4.74 4.73 4.65 4.65 4.64 4.43 4.17 4.15 4.10 4.10

53.14 58.15 52.93 58.05 58.87 52.52 53.44 52.63 53.24 59.38 54.36 56.92 64.60 58.35 52.83 65.92 57.95 57.95 56.62 53.34 53.14 53.55 53.03 57.23 66.03 59.58 64.80 64.19 48.85 52.52 52.93 65.82 52.83 57.33 53.96 58.97 62.86 52.12 49.56 52.42 59.79 48.74 58.15 57.95 66.03 59.17 66.54 52.73 48.34 53.24 64.39 59.38 60.61 52.42

5825.39 6132.48 5809.39 6116.48 4933.07 5518.30 5663.33 6190.52 4950.08 4917.07 5501.28 6028.46 6423.57 6335.55 5606.31 6058.44 6100.49 6319.56 6497.61 5647.34 5622.31 4934.09 5315.22 6481.61 6042.44 5136.15 6407.58 6439.57 5356.24 6174.52 6012.47 6074.43 5987.44 6173.50 4463.93 6157.51 4684.02 5883.43 4829.06 5502.30 5120.15 5721.37 5866.41 6294.53 6464.60 3426.57 6480.59 5299.22 5680.35 4642.99 6642.64 4771.02 3223.49 6352.57

1457.01 1533.76 1453.01 1529.76 1644.98 1380.25 1416.50 1548.26 1650.65 1229.99 1834.34 1507.77 1606.51 1584.51 1402.25 1515.25 1525.77 1580.52 1625.02 1412.50 1406.25 1234.24 1329.49 1621.02 1511.25 1284.74 1602.52 1610.51 1339.74 1544.27 1503.76 1519.25 1497.51 1544.01 1488.64 1540.01 1561.98 1471.51 1207.97 1376.26 1280.73 1431.00 1467.26 1574.26 1616.76 1142.93 1620.76 1325.50 1420.75 1548.31 1661.26 1590.99 1075.25 1588.77

+4 +4 +4 +4 +3 +4 +4 +4 +3 +4 +3 +4 +4 +4 +4 +4 +4 +4 +4 +4 +4 +4 +4 +4 +4 +4 +4 +4 +4 +4 +4 +4 +4 +4 +3 +4 +3 +4 +4 +4 +4 +4 +4 +4 +4 +3 +4 +4 +4 +3 +4 +3 +3 +4

-0.00 0.01 -0.00 0.00 -0.00 -0.00 0.00 0.01 -0.00 -0.01 -0.00 0.00 0.01 0.01 0.00 0.01 0.00 0.00 0.01 0.00 0.00 -0.01 -0.00 0.01 0.01 0.00 0.01 0.01 0.01 0.00 0.00 0.01 0.00 0.01 -0.01 0.01 0.00 0.01 0.01 -0.00 0.01 0.01 0.01 0.01 0.02 -0.01 0.01 0.00 0.01 -0.01 0.02 -0.02 -0.01 0.00

1.2 1.1 1.6 1.3 1.9 1.3 1.0 2.2 3.0 1.1 3.3 3.3 1.6 1.8 2.6 1.9 2.4 3.1 1.4 1.4 1.4 1.9 2.3 1.4 3.2 1.5 3.3 1.8 1.8 2.5 3.3 2.0 1.6 4.4 1.7 3.2 3.5 1.7 2.2 2.8 4.3 3.1 2.3 2.2 3.5 1.3 2.7 4.4 2.6 4.6 3.8 3.9 0.8 1.9

7 7 7 7 5 7 7 8 5 5 7 8 7 8 6 6 7 8 8 7 6 5 6 8 6 6 7 7 7 8 8 6 7 8 5 8 6 8 6 7 6 8 8 7 8 3 8 6 7 5 8 5 2 8

8 8 8 8 5 8 7 9 7 5 6 8 8 8 8 7 8 8 9 7 8 7 8 9 7 5 8 8 7 9 8 7 9 7 4 7 6 9 5 8 5 8 7 9 7 2 7 8 9 7 8 4 2 10

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

0 0 1 1 0 0 0 0 0 1 0 0 1 0 1 1 2 1 0 1 0 1 0 1 2 0 2 0 0 1 1 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 0 0 0

2 3 1 2 2 1 2 2 1 1 2 2 3 3 1 3 1 2 3 1 2 0 1 2 2 2 2 4 1 1 1 4 2 3 1 2 0 1 1 0 1 1 2 3 3 0 4 0 1 0 4 2 0 2

a We derived the criteria for inclusion in this table from the hand-annotated list of Table 2, specifically a peak envelope quality of at most 5.0, and a score of at least 3.9. The column meanings are the same as in Table 2.

than 2/5 the max abundance, the range is extended 1/C below the m/z of the +0 isotope. Even when G has a good fit to the envelope, G shifted by 1/C might match nearly as well, making the determination of the +0 isotope uncertain. Thus the quality of fit also incorporates the difference between the best and second best fit. If D1 is the distance of the best fit and D2 the second best we boost the distance by 4 × (D1/D2),2 making the final quality score Q ) D1 + 4 × (D1/D2).2 Recall that we started the nonlinear minimization routine dfpmin with three different starting positions, and so we get a D2 value (if any) from one of the alternate starts. Finding and computing the quality of envelopes is the slowest part of the algorithm. It takes about 12 h of CPU time

(Sun-Fire-V440) to process an LC run of 1000 spectra. The processing of a single-MS spectrum stops when either 150 envelopes have been found, or else 40 consecutive envelopes of quality less than 2.0 are processed. Final Scoring. A score is assigned to each envelope, and the score of a glycopeptide identification is the (weighted) sum of scores from envelopes assigned not only to that glycopeptide but also to related glycopeptides, thus taking global information into account. Ideally, the score would be a principled function based on the statistics of spectra from known glycopeptides. However, given the difficulties in producing pure glycopeptide standards and in definitively identifying ambiguous peaks in a Journal of Proteome Research • Vol. 6, No. 10, 2007 3999

research articles

Goldberg et al.

Table 4. Glycopeptide Interpretations with the Highest MS/MS Supporta MS/MS score

chg

m/z

ret time

MS qual

N

H

F

NA

NG

10.04 10.03 9.76 9.30 8.02 7.94 5.92 5.51 5.49 5.48 5.45 5.32 5.23 5.20

3 3 3 3 3 4 7 7 6 3 6 5 4 5

1353.27 1318.62 1372.62 1299.27 1474.95 1351.23 1006.09 1010.65 1184.25 1142.93 962.75 1112.89 1120.98 1121.79

62.0 66.4 53.1 65.9 74.1 66.6 79.8 77.1 79.1 65.7 69.5 63.0 54.4 60.4

1.8 1.3 2.0 1.3 0.9 1.0 1.2 1.8 2.9 1.3 1.4 1.6 0.6 1.1

3 4 4 3 4 5 6 8 7 3 4 4 8 7

5 4 5 4 6 6 5 8 7 2 4 9 5 6

0 1 1 0 0 1 0 1 0 1 1 2 0 2

1 0 0 1 1 0 3 0 1 0 1 1 0 3

0 0 0 0 0 3 1 0 1 0 0 1 0 0

peptide

ZP3: ZP3: ZP3: ZP3: ZP3: ZP3: ZP2: ZP2: ZP3: ZP3: ZP2: ZP2: ZP3: ZP3:

R.PRPETLQFTVDVFHFANSSR R.PRPETLQFTVDVFHFANSSR R.PRPETLQFTVDVFHFANSSR R.PRPETLQFTVDVFHFANSSR R.PRPETLQFTVDVFHFANSSR R.PRPETLQFTVDVFHFANSSR S.HAICAPDLSVACNATHMTLTIPEFPGKLESVDFGQ F.SSHAICAPDLSVACNATHMTLTIPEFPGKLESVDFGQ P.RPETLQFTVDVFHFANSSRNTLYITCHLKVAPAN R.PPRPETLQFTVDVFHFANSSR F.SSHAICAPDLSVACNATHMTLTIPEFPGKLESVDFG A.PDLSVACNATHMTLTIPEFPGK A.PANQIPDKLNKACSFNKT A.PANQIPDKLNKACSFNKT

a The “MS qual” column is quality based on the parent-ion (single-MS) spectrum, lower numbers are better. This is the same measure which is labeled “Qual” in Table 2 and Table 3. The “MS/MS score” column is computed from the best MS/MS spectrum for that parent ion and measures how many fragments (weighted by intensity rank) support the glycopeptide interpretationshigher numbers are better. The glycopeptide is given by composition: column N gives the number of HexNAcs, H gives Hexoses, F Fucoses, NA NeuAcs, and NG NeuGcs.

Figure 1. Typical quality of additional glycans found by Peptoonist are illustrated in this spectrum. The top view shows a region of the spectrum containing peak families corresponding to three different glycans with composition Sialic4FucHex8HexNAc7, corresponding to entries 13, 27, and 28 in Table 3. The bottom view shows the same spectrum zoomed in on the three glycans. The labels give the m/z values for the theoretically most intense isotope (+3) of these quadruply charged glycans.

spectrum, it seems unlikely that large amounts of truly validated spectra will be available any time soon. 4000

Journal of Proteome Research • Vol. 6, No. 10, 2007

Instead, we use the following informal procedure. We transform envelope quality and mass match (both distance

research articles

Automated N-Glycopeptide Identification

Figure 2. Most plausible cartoons for the first 2 glycopeptides of Table 3. The label below each cartoon gives the glycopeptide mass and glycan composition. For example, the first two rows of cartoons are for a glycan with composition “7 8 1 0 2” meaning 7 HexNAc, 8 Hexose, 1 Fucose, 0 NeuAc, and 1 NeuGc.

functions) into a score (with a larger number meaning a better match) using a logistic function. Envelopes for the hand annotated glycopeptides had qualities ranging from 0.6 to 4.8, so we used the logistic function 1/(1 + 0.2 eQ) that drops to 1/2 its maximum at Q ) 4. Similarly, if  is the actual error and E is the maximum expected error (E ) 0.05 Th for the mouse zona pellucida spectra), we used a logistic function 1/(1 + 0.25 e1.8(/E)) that drops to 1/2 its maximum at /E ) 1. We combined these and used the function 1/(1 + 0.2 eQ + 0.25 e1.8(/E)) as the score of an envelope. The final score of a glycopeptide is the weighted sum of the scores of envelopes from three different classes. The first class consists of envelopes of any charge state that are assigned to the glycopeptide. Only the best envelope from each charge state is used. The score of the best of these is multiplied (weighted) by 4, those from other charge states are multiplied by 0.5. In other words, a glycopeptide that is assigned to envelopes of multiple charge states gets a small boost in score. The second class of envelopes are those that map to other members of the glycopeptide family, that is those whose glycan has either one additional or one less monosaccharide than the candidate being scored. The score of the best envelope of any charge state from each family member is multiplied by 1.0. The third (optionally used) class are envelopes assigned to glycopeptides whose glycans have the same total number of sialic acids as the glycopeptide being scored (i.e., with NeuAc/NeuGc substitutions). The best envelope of any charge state for each of these is multiplied by 2.0. The sum of the scores of the envelopes from all three classes is used as the final score for the glycopeptide. Cartoons. For conversion to cartoons (tree topologies), the demerits system of Cartoonist was used. Demerits were applied

to cartoons containing anything from the following list: Sialyl Lewis X, LacdiNAc, Sialyl LacdiNAc, two linked sialic acids, an antenna with exactly 2 mannose, bisecting GlcNAc, an antenna with multiple fucose, an absence of base fucosylation, five antennae, a hybrid structure, uncapped GlcNAc, a single long antenna, or lack of a full trimannosyl core.

Results For experimental evaluation, we used a mixture of mouse zona pellucida proteins ZP2 and ZP3. Hand analysis8 reported 58 different glycans (glycoforms) attached to Asn273 of the ZP3 sequence (257) PRPETLQFTVDVFHFANSSR (276) of mass 2347.18, hereafter referred to as ZP3.257-276 Peptoonist confirmed that there is no evidence that any of the other 12 glycosylation sites are occupied. It found 55 of the 58 hand annotated spectra (Table 2) and it found an additional 54 glycoforms (Table 3). Our examination of the spectra shows that 2 of the 3 glycoforms missed by Peptoonist have very weak support. For some of the peaks, there are two different explanations using glycans with different compositions but the same mass. For example, line 32 of Table 2 is the +3 peak at m/z 1752.99 of a glycopeptide of mass 5257.17. This can be explained as NeuGcNeuAcHex8HexNAc5 or asNeuGc2FucHex7HexNAc5. We can use the scoring function of Peptoonist to choose: we select the isomer with the larger score. This is a reasonable strategy, because the scores differ in the number of biosynthetic neighbors for each interpretation. Of the 36 ambiguous glycoforms, Peptoonist gave a higher score to the isomer selected by the human expert 32/36 ≈ 89% of the time. The other 4 times the expert’s choice tied with another isomer. We now discuss these results in more detail. Journal of Proteome Research • Vol. 6, No. 10, 2007 4001

research articles

Goldberg et al.

Figure 3. There are clearly defined isotope envelopes for the +4 ions of the NeuGcNeuAc and NeuGc2 members of the family. The evidence for the claimed NeuAc2 member is much less compelling. Arrows mark the position of the +0 isotope. Table 5. List of Peaks with Two Alternative Glycan Compositions of the Same Mass that Were Not Distinguishable by Peptoonist first isomer

second isomer

score

mass

Nac

Hex

Fuc

NeuAc

NeuGc

Nac

Hex

Fuc

NeuAc

NeuGc

9.69 7.52 6.39 5.99

5808.37 5257.17 5776.38 5995.45

7 5 7 8

6 7 7 6

1 1 0 1

0 0 3 1

3 2 0 2

7 5 7 8

7 8 6 7

0 0 1 0

1 1 2 2

2 1 1 1

Table 6. Effect of Varying the Averaging Windowa time period (mins)

number of glycopeptides found

minimum score

0 1 2 4 pooled

48 55 51 45 55

2.6 3.9 3.8 3.4 4.0

a

The left column gives the interval over which spectra are averaged (0 means no averaging was done). The right two columns show how many of the 58 hand annotated glycopeptides were found (and worst score any of them achieved) when Peptoonist was run using those spectra.

Base Peptides. Step (3) of Peptoonist finds the glycosylated base peptides from the (quite rare) MS/MS glycopeptide spectra that can be identified. Results are shown in Table 4. The large gap in score between the top scoring 6 peaks from ZP3257-276 and the first peak from a different peptide suggests that ZP3257-276 is the only peptide that is glycosylated, in agreement with the hand analysis. Further evidence for this conclusion is that the sample was trypsin digested, and none of the other top scorers are tryptic. Note that ZP3257-276 with sequence R.PRPETLQFTVDVFHFANSSR.N contains an internal R and is cleaved at an R/P bond, and so it might not have been considered the likeliest product of tryptic digestion. Additional Glycoforms. The hand annotations from Table 2 have peak qualities ranging from 0.6 to 4.8 (lower is better), and glycopeptide assignment scores ranging from 12.7 to down 4002

Journal of Proteome Research • Vol. 6, No. 10, 2007

to 3.9 (higher is better). Based on this, we accepted additional Peptoonist annotations that fell into a similar range, specifically with peak quality less than 5.0 and glycopeptide assignments scores greater than 3.9. There were 54 such glycopeptides, as listed in Table 3. Figure 1 gives a sense of the quality of the these additional glycoforms. It gives two views of a spectrum containing three glycopeptides from Table 3. Ideally, each putative glycopeptide precursor would be subjected to MS/MS at an energy that would elucidate the topology of the glycan. However, even without this information, our previous tool, Cartoonist, can generate plausible cartoons for each peptide. Examples for the first two glycopeptides in Table 3 are shown in Figure 2. Thus the final output of Peptoonist can be (as the name suggests) an annotation of spectral peaks by base peptides and cartoons, typically with far fewer cartoons than shown for the high-mass glycans of Figure 2. Possible Human Annotation Errors. As mentioned earlier, of the 58 hand-annotated peaks, Peptoonist found 55 as displayed in Table 2. We believe that two of the missing ones may be over interpretations of the data, namely NeuAc2FucHex6HexNAc6 and NeuAc2FucHex5HexNac4. The third missing one is probably correct, but its peak envelope has quality too low for Peptoonist to find. Figure 3 shows the peak envelopes for two clearly present glycoforms with NeuAc/ NeuGc substitutions (on the right), and the region where the hand-annotated glycoform should be (on the left). Of course

research articles

Automated N-Glycopeptide Identification Table 7. Results of Running Peptoonist without First Restricting the Set of Base Peptidesa best score

#glycopeps

#score g 4.0

protein:peptide

12.73 12.45 11.10 10.92 9.53 8.33 8.23 8.21 6.01 4.58

235 210 212 221 210 263 173 151 133 98

117 128 93 98 74 53 41 32 16 6

zp3:PRPETLQFTVDVFHFANSSR zp2:IGNGTRAHILPLK zp3:QGNVSSHPIQPTWVPFR zp3:PETLQFTVDVFHFANSSR zp2:LADENQNVSEMGWIVK zp2:IGNGTR zp2:WNPSVVDTLGSEILNCTYALDLER zp2:FDMEKWNPSVVDTLGSEILNCTYALDLER zp2:WNPSVVDTLGSEILNCTYALDLERFVLK zp2:DLISFSFPQLFSRLADENQNVSEMGWIVK

a The table shows the results of skipping step (3) of the algorithm, so that base peptides are no longer restricted to those with MS/MS support. Each line gives all the results for one possible base peptide. The first row shows that there were 235 different m/z values that matched a glycopeptide with the base zp3:PRPETLQFTVDVFHFANSSR.

Figure 4. Chromatogram for the ZP2/ZP3 sample labeled with highest scoring glycopeptides of ZP3257-276. The graph shows total ion current. Each glycopeptide is labeled with its score and most plausible cartoon. The horizontal line underneath a cartoon gives the range of time over which it eluted.

this putative glycoform might have a more compelling presence with a different charge or in a different spectrum, but the program systematically searched for these possibilities without finding them. Isomer Determination. All the hand annotations of ZP3257-276 have single fucoses. We asked whether Peptoonist’s global scoring function, which takes into account other charge states, NeuAc/NeuGc substitution, and other family members, could

reproduce this finding. Initially, 36 of the 55 glycopeptides in Table 2 have more than one biologically plausible monosaccharide composition. For example the first entry HexNAc5Hex6FucNeuAcNeuGc could also be HexNAc5Hex5Fuc2NeuGc2 or HexNAc5Hex7NeuAc2, since they all have the same molecular weight. Of these 36 ambiguous cases, Peptoonist gave the single-fucose explanation a higher score 32/36 ≈ 89% of the time, and a tied-for-first score the other four times. Thus the Journal of Proteome Research • Vol. 6, No. 10, 2007 4003

research articles restriction to isomers with a single fucose was automatically discovered by the program. The four ambiguous cases are shown in Table 5 below. Changing the details of the scoring function will not resolve the ambiguous cases, because the global set of envelopes used for scoring is precisely the same for the two isomers. Experimental Variations of Peptoonist. Time Averaging. The spectra used in the analyses above were “summed spectra”, meaning they represent sums of the recorded single-MS spectra over an elution time period of 1 min. This period contains between 10 and 30 spectra (with an average of about 14), depending upon the timeouts for MS/MS. We investigated whether we would get better results by running Peptoonist multiple times, each time using spectra averaged over a different time period, and pooling the results. Table 6 contains the data. It shows that 1-min periods work better than other fixed periods. For each time period, we observed how many of the 58 hand annotated spectra were detected. The “minimum score” column gives the smallest (worst) score of all the matches. The last row shows the results of pooling the results of averages over 1, 2, and 4 min, so that for each glycopeptide, Peptoonist used only the best of the three time averages. Pooling found the same 55 glycopeptides as 1-minute averaging, although the minimum score was raised from 3.9 to 4.0. Thus, there appears to be little benefit to pooling. Base Peptides. A natural question to ask is whether the MS/ MS of Step (3) is necessary, that is, whether base peptides could be determined simply from single-MS scores. We ran Peptoonist using all possible tryptic peptides, to see which base peptides gave the highest quality matches to envelopes in the spectra. The results for the zona pellucida LC run are in Table 7. Each row gives the results for one base: the best score of a glycopeptide with that base, the total number of peak envelopes that matched glycopeptides, and the total number of envelopes with score better than 4.0. The known peptide, ZP3257-276, gave the highest best score, but it is only slightly greater than the next best base peptide, and it actually gave fewer matches with scores over 4.0 (see Figure 4). Thus a single-MS analysis would conclude that ZP3257-276 and ZP2: IGNGTRAHILPLK are almost equally likely, but using MS/MS we can conclude that IGNGTRAHILPLK is an unlikely interpretation. Many of the peaks that could be assigned to a glycopeptide with IGNGTRAHILPLK have an alternate interpretation as a glycopeptide of ZP3257-276 which is much more likely to be correct. An example was given in the introduction. Retention Time. We investigated whether adding retention time to Peptoonist would improve its results. For ZP3 there are several very abundant glycopeptides that elute over a wide range of time, limiting the usefulness of this type of analysis (see Figure 3). In fact, all the high-scoring ZP3257-276 glycopeptides had plausible retention times, so adding a retention time term would not help to disambiguate isomers.

Conclusions Peptoonist demonstrates that combining high-resolution single-MS and tandem-MS data can provide automatic analysis of glycopeptide samples superior to that obtained by an expert human annotator. We base this conclusion on results obtained using a biological sample, which we believe gives more realistic results than using a commercially available glycoprotein. Our sample of mouse ZP2/ZP3 is also more complex than the 4004

Journal of Proteome Research • Vol. 6, No. 10, 2007

Goldberg et al.

glycoproteins used in previous work, with 13 potential glycosylation sites and over 100 distinct glycans. We do not yet know the limits of the combined single/ tandem technique, but we believe that this technique promises more thorough assays of samples of greater complexity than pure single-MS or pure tandem-MS techniques. A simple measure of glycopeptide sample complexity is the product of the number of glycosylation sites and the number of glycoforms. Here we searched a glycopeptide space of complexity in the low 1000s (13 possible sites and 100s of possible glycan compositions), and found results of complexity about 100 (1 site and over 100 glycan compositions). We anticipate searching glycopeptide spaces of complexity up to about 10,000 before the technique is overwhelmed by mass coincidences. Even in a sample of modest complexity, peak envelopes of the more massive glycopeptides often correspond to several distinct compositions, each of which in turn can be represented by many cartoons. Using only a small amount of biosynthetic “smarts”, Peptoonist’s global scoring scheme appears to do a good job of disambiguating isomers, and then Cartoonist can in turn propose plausible cartoons for those isomers.

Acknowledgment. D.G. was supported by NIH Grant R01GM074128 from the NIGMS and resources from the Consortium for Functional Glycomics funded by grants from the NIGMS (GM62116) and the NCRR. A.D. and H.R.M. are supported by The Wellcome Trust and The Biotechnology and Biological Sciences Research Council (BBSRC). A.D. is a BBSRC Professorial Research Fellow. References (1) Apweiler, R.; Hermjakob, H.; Sharon, N. On the frequency of protein glycosylation, as deduced from analysis of the SWISSPROT database. Biochim. Biophys. Acta 1999, 1473, 4-8. (2) Taylor, M.; Drickamer, K. Introduction to Glycobiology, 2nd ed.; Oxford University Press: New York, 2006; p 255. (3) Brooks, S. A.; Dwek, M. V.; Schumacher, U. Functional & Molecular Glycobiology; BIOS Scientifc Publishers: Oxford, 2003; p 354. (4) An, H. J.; Ninonuevo, M.; Aguilan, J.; Liu, H.; Lebrilla, C. B.; et al. Glycomics analyses of tear fluid for the diagnostic detection of ocular rosacea. J. Proteome Res. 2005, 4, 1981-1987. (5) Kirmiz, C.; Li, B.; An, H. J.; Clowers, B. H.; Chew, H. K.; et al. A Serum Glycomics Approach to Breast Cancer Biomarkers. Mol. Cell. Proteomics 2007, 6, 43-55. (6) Li, D.; Mallory, T.; Satomura, S. AFP-L3: a new generation of tumor marker for hepatocellular carcinoma. Clin. Chim. Acta 2001, 313, 15-19. (7) Goldberg, D.; Sutton-Smith, M.; Paulson, J.; Dell, A. Automatic Annotation of MALDI N-Glycan Spectra. Proteomics 2005, 5, 865875. (8) Morris, H. R.; Chalabi, S.; Panico, M.; Sutton-Smith, M.; Clark, G. F. et al. Glycoproteomics: Past, present and future. Int. J. Mass Spectrom. 2007, 259, 16-31. (9) Perkins, D. N.; Pappin, D. J. C.; Creasy, D. M.; Cottrell, J. S. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20, 3551-3567. (10) Eng, J. K.; McCormack, A. L.; Yates, J. R. I. An approach to Correlate Tandem Mass Spectral Data of Peptides with Amino Acid Sequences in a Protein Database. J. Am. Soc. Mass Spectrom. 1994, 5, 976-989. (11) Craig, R.; Beavis, R. C. TANDEM: matching proteins with tandem mass spectra. Bioinformatics 2004, 20, 1466-1467. (12) Kaji, H.; Saito, H.; Yamauchi, Y.; Shinkawa, T.; Taoka, M. et al. Lectin affinity capture, isotope-coded tagging and mass spectrometry to identify N-linked glycoproteins. Nat. Biotechnol. 2003, 21, 667-672. (13) Zhang, H.; Li, X. j.; Martin, D. B.; Aebersold, R. Identification and quantification of N-linked glycoproteins using hydrazide chemistry, stable isotope labeling and mass spectrometry. Nat. Biotechnol. 2003, 21, 660-666.

research articles

Automated N-Glycopeptide Identification (14) Cooper, C. A.; Gasteiger, E.; Packer, N. H. GlycoMod - A software tool for determining glycosylatoin compositions from mass spectrometric data. Proteomics 2001, 1, 340-349. (15) Lapadula, A. J.; Hatcher, P. J.; Hanneman, A. J.; Ashline, D. J.; Zhang, H. et al. Congruent strategies for carbohydrate sequencing. 3. OSCAR: an algorithm for assigning oligosaccharide topology from MSn data. Anal. Chem. 2005, 77, 6271-6279. (16) Joshi, H. J.; Harrison, M. J.; Schulz, B. L.; Cooper, C. A.; Packer, N. H.; et al. Development of a mass fingerprinting tool for automated interpretation of oligosaccharide fragmentation data. Proteomics 2004, 4, 1650-1664. (17) Goldberg, D.; Bern, M.; Li, B.; Lebrilla, C. B. Automatic determination of O-glycan structure from fragmentation spectra. J. Proteome Res. 2006, 5, 1429-1434. (18) Ethier, M.; Saba, J. A.; Ens, W.; Standing, K. G.; Perreault, H. Automated structural assignment of derivatized complex N-linked oligosaccharides from tandem mass spectra. Rapid Commun. Mass Spectrom. 2002, 16, 1743-1754. (19) Ethier, M.; Saba, J. A.; Spearman, M.; Krokhin, O.; Butler, M. et al. Application of the StrOligo algorithm for the automated structure assignment of complex N-linked glycans from glycoproteins using tandem mass spectrometry. Rapid Commun. Mass Spectrom. 2003, 17, 2713-2720. (20) Gaucher, S. P.; Morrow, J.; Leary, J. A. STAT: A Saccharide Topology Analysis Tool Used in Combination with Tandem Mass Spectrometry. Anal. Chem. 2000, 72, 2331-2336.

(21) Tang, H.; Mechref, Y.; Novotny, M. V. Automated Interpretation of MS/MS spectra of oligosaccharides; 13th International Conference on Intelligent Systems for Molecular Biology (ISMB): Detroit, Michigan, 2005; pp i431-i439. (22) An, H. J.; Peavy, T. R.; Hedrick, J. L.; Lebrilla, C. B. Determination of N-glycosylation sites and site heterogeneity in glycoproteins. Anal. Chem. 2003, 75, 5628-5637. (23) Wu, Y.; Mechref, Y.; Klouckova, I.; Novotny, M. V.; Tang, H. A. Computational Approach for the Identfication of Site-specific Protein Glycosylations through Ion-Trap Mass spectrometry; RECOMB Satellite Conferences on: Systems Biology and Compuational Proteomics, 2006. (24) Bern, M.; Goldberg, D. De novo analysis of peptide tandem mass spectra by spectral graph partitioning. J. Comput. Biol. 2006, 13, 364-378. (25) Press, W. H.; Flannery, B. P.; Teukolsky, S. A.; Vetterling, W. T. Numercial Recipes in C: The Art of Scientific Computing, 2nd ed.; Cambridge University Press: Cambridge, UK, 1992; p 1020. (26) Huddleston, M. J.; Bean, M. F.; Carr, S. A. Collisional fragmentation of glycopeptides by electrospray ionization LC/MS and LC/ MS/MS: methods for selective detection of glycopeptides in protein digests. Anal. Chem. 1993, 65, 877-884.

PR070239F

Journal of Proteome Research • Vol. 6, No. 10, 2007 4005