MAGIC: An Automated N-Linked Glycoprotein Identification Tool Using

Jan 16, 2015 - To facilitate the subsequent peptide sequence identification by common database search engines, MAGIC generates in silico spectra by ov...
1 downloads 7 Views 1MB Size
Article pubs.acs.org/ac

MAGIC: An Automated N‑Linked Glycoprotein Identification Tool Using a Y1-Ion Pattern Matching Algorithm and in Silico MS2 Approach Ke-Shiuan Lynn,†,‡ Chen-Chun Chen,¶,⊥,‡ T. Mamie Lih,∥,#,‡ Cheng-Wei Cheng,† Wan-Chih Su,§ Chun-Hao Chang,† Chia-Ying Cheng,† Wen-Lian Hsu,† Yu-Ju Chen,*,§,⊥ and Ting-Yi Sung*,† †

Institute of Information Science, Academia Sinica, Taipei 11529, Taiwan Genomics Research Center, Academia Sinica, Taipei 11529, Taiwan § Institute of Chemistry, Academia Sinica, Taipei 11529, Taiwan ∥ Bioinformatics Program, Taiwan International Graduate Program, Institute of Information Science, Academia Sinica, Taipei 11529, Taiwan ⊥ Department of Chemistry, National Taiwan University, Taipei 10617, Taiwan # Institute of Biomedical Informatics, National Yang-Ming University, Taipei 11221, Taiwan ¶

S Supporting Information *

ABSTRACT: Glycosylation is a highly complex modification influencing the functions and activities of proteins. Interpretation of intact glycopeptide spectra is crucial but challenging. In this paper, we present a mass spectrometrybased automated glycopeptide identification platform (MAGIC) to identify peptide sequences and glycan compositions directly from intact N-linked glycopeptide collision-induced-dissociation spectra. The identification of the Y1 (peptideY0 + GlcNAc) ion is critical for the correct analysis of unknown glycoproteins, especially without prior knowledge of the proteins and glycans present in the sample. To ensure accurate Y1-ion assignment, we propose a novel algorithm called Trident that detects a triplet pattern corresponding to [Y0, Y1, Y2] or [Y0−NH3, Y0, Y1] from the fragmentation of the common trimannosyl core of N-linked glycopeptides. To facilitate the subsequent peptide sequence identification by common database search engines, MAGIC generates in silico spectra by overwriting the original precursor with the naked peptide m/z and removing all of the glycan-related ions. Finally, MAGIC computes the glycan compositions and ranks them. For the model glycoprotein horseradish peroxidase (HRP) and a 5-glycoprotein mixture, a 2- to 31-fold increase in the relative intensities of the peptide fragments was achieved, which led to the identification of 7 tryptic glycopeptides from HRP and 16 glycopeptides from the mixture via Mascot. In the HeLa cell proteome data set, MAGIC processed over a thousand MS2 spectra in 3 min on a PC and reported 36 glycopeptides from 26 glycoproteins. Finally, a remarkable false discovery rate of 0 was achieved on the N-glycosylation-free Escherichia coli data set. MAGIC is available at http://ms.iis.sinica.edu.tw/COmics/Software_MAGIC.html.

P

hepatocellular carcinoma.10−12 Hence, there is a pressing need to develop analytical methods for identifying glycoproteins and determining the connectivity of monosaccharides. Because of the variance in branching arrangements and anomeric linkages of monosaccharides as well as the sitespecific heterogeneity of glycan occupancy, the study of glycoproteomics remains technically challenging.13 Currently, mass spectrometry (MS) has become an indispensable technology for such applications because of its high speed

rotein glycosylation is a prevalent and highly complex posttranslational modification that influences a variety of biological processes, including protein folding,1 enzymatic properties,2 and intermolecular interactions.3 Aberrant glycosylation has been reported to be strongly associated with the progression of diseases such as cancers,4 neurodegenerative disorders,5 and other genetic abnormalities.6,7 Moreover, glycoprotein analysis is also important in drug development8 and biomarker discovery.9 For instance, an antibody with fucose-removed from its oligosaccharides has demonstrated enhanced therapeutic effects in a preliminary study for breast cancer.8 AFP-L3, an α-1,6 core-fucosylated glycoform of alphafetoprotein, has been used as the tumor biomarker for © XXXX American Chemical Society

Received: December 2, 2014 Accepted: January 16, 2015

A

DOI: 10.1021/ac5044829 Anal. Chem. XXXX, XXX, XXX−XXX

Article

Analytical Chemistry and high sensitivity.14,15 Compared to the conventional approach of glycopeptide identification following deglycosylation, the analysis of intact glycopeptides by tandem mass spectrometry is becoming more popular because it allows inferences of glycosylation sites and glycan compositions. However, MS2 data obtained by commonly applied collisioninduced dissociation (CID) often contain abundant B-type ions (or oxonium ions) as well as Y-type ions (or glycopeptide fragment ions)13,16 that arise from sequential neutral losses of terminal monosaccharides. Therefore, the comparatively lower abundance peptide backbone fragments, namely, sequencecharacteristic b- and y-ions, are usually masked by these B- and Y-ions in the spectra, which interferes with peptide sequence identification.13 Furthermore, the challenge of isolating glycopeptides with high purity on the proteome scale adds another level of complexity to differentiating glycopeptide and nonglycopeptide spectra. Therefore, glycobioinformatics plays an essential role in facilitating MS-based glycoproteomics analysis.17 In the past decade, various automated software tools have been developed for glycopeptide analysis. Many of these tools are designed to identify glycopeptides or glycan structures from CID, HCD, and/or ETD data using prior knowledge of proteins and/or glycans to deduce possible combinations of peptide and glycan or to match potential glycopeptides from databases. These software tools include GlycoMod,18 GlycoPep DB,19 GlyDB,20 GlycoSpectrumScan,21 GlycoPep ID,22 GlycoPep grader,23 GlycoPeptide Search,24 Peptoonist,25 Byonic,26 and GlycoMaster DB.27 Thus, such tools cannot analyze glycopeptide MS2 data with unknown protein or glycan information. Conversely, several bioinformatics tools have been developed for intact glycopeptide analysis without prior protein or glycan information. GlycoMiner28 relies on tracing six Y-ions arising from the fragmentation of the trimannosyl core in MS2 spectra to determine the naked peptide mass and glycan mass. The tool GlyPID29 attempts to determine additional peptide−glycan combinations by integrating the MS and MS2 spectra. By using Pronase digestion to generate short glycopeptides, GP Finder30,31 was developed to determine glycosites and identify microheterogeneity in nonspecifically cleaved glycopeptides. Nonetheless, these tools are not suitable for large-scale glycoproteome data sets. Sweet-Heart32 is a recently developed tool that requires an MS3 spectrum for each predicted Y1 (peptide + GlcNAc) ion to determine the underlying peptide backbone and has been used to analyze glycoproteomics in mouse serum. The advent of automated glycopeptide analysis has been recently reviewed by Dallas et al.33 and Woodin et al.34 Dallas et al. noted that there is room to introduce software programs that can automatically analyze high-throughput MS2 spectral data to identify glycoproteins from complex biological samples on the proteome scale. For large-scale glycoproteomics computation, these reviews also suggested incorporating several important features into future software programs, such as eliminating the requirement for prior knowledge of peptide sequences, processing tandem mass spectra in batch, and determining glycan compositions. Given the existing challenges in analyzing intact glycopeptide MS2 spectra, automated glycopeptide analysis tools that are freely available incorporating the above suggested features are still in need to facilitate glycoproteomics research. To address the above issues, we present a software system called MAGIC (Mass spectrometry-based Automated Glyco-

peptide IdentifiCation platform) for the automated identification of N-linked glycoproteomics beam-type CID data sets without prior known protein and glycan information. To ensure that glycopeptide spectra are processed in typical large-scale data sets that very likely also include nonglycopeptides, MAGIC first detects B-ions to eliminate nonglycopeptide spectra. For unknown glycopeptide identification, the accurate determination of the stable and abundant Y1 ions13,14,35 is the most vital step for deducing peptide mass for further glycan mass calculation and glycopeptide sequence identification. We propose a novel pattern-matching method called Trident indicating its use of triplet patterns for accurate Y1-ion identification. Trident detects unique triplets of peaks with m/z differences of two consecutive GlcNAc’s or an NH3 followed by a GlcNAc, which correspond to [Y0, Y1, Y2] and [Y0−NH 3, Y0, Y1] patterns, respectively, in the trimannosyl core structure, where Y0 denotes the naked peptide and Y2=Y1 + GlcNAc. A detailed analysis of the seven dissociation fragment ions from the trimannosyl core structure revealed that Trident’s Y1-ion patterns represent the minimum peaks required to ensure efficient and accurate Y1-ion detection. To the best of our knowledge, currently, no existing software packages have reported using such triplet patterns. By adopting the Trident algorithm, N-linked glycopeptides with truncated core structures1,36 can be detected as well. On the basis of the detected peptide mass, a module called in silico peptide spectrum generator was developed to produce new spectra containing primarily b- and y-ions as input for database searching methods such as Mascot and X!Tandem. Therefore, MAGIC can provide the intact glycopeptide sequence identification by a conventional database sequence search without prior input of a protein sequence or glycan database. Finally, MAGIC processes and annotates MS 2 spectra containing multiply charged ions, constructs glycopeptide fragmentation paths to evaluate glycan composition assignments, and reports glycopeptide identification results. To demonstrate MAGIC’s performance on glycoprotein samples of different complexity, we used four data sets, namely, horseradish peroxidase (HRP), a mixture of five known glycoproteins, large-scale HeLa cells, and Escherichia coli (E. coli). In the former two standard protein data sets, all of the known peptide and glycan compositions were accurately determined by MAGIC. In the HeLa cell data set, the peptide−glycan combinations of 36 glycopeptides from 26 glycoproteins were automatically identified by MAGIC, whereas in the E. coli data set, which was used as a negative control because it lacks glycosylation machinery, MAGIC achieved a remarkable false discovery rate (FDR) of 0. MAGIC is implemented in Windows environments and provides userfriendly interfaces and visualizations (Supporting Information, Figure S-1). Its efficiency was demonstrated by taking only 3 min on a PC to generate identification results from the HeLa cell data set (1133 spectra). MAGIC is available at http://ms. iis.sinica.edu.tw/COmics/Software_MAGIC.html.



EXPERIMENTAL SECTION All of the standard proteins and other chemicals were obtained at the highest available grade from commercial suppliers and were used as received. Purified membrane proteins from HeLa cell were digested by trypsin following the procedures described in Han et al.37 Further details of the material and reagents, along with the experimental protocols for sample preparation, MS analysis of glycopeptides, and settings of data B

DOI: 10.1021/ac5044829 Anal. Chem. XXXX, XXX, XXX−XXX

Article

Analytical Chemistry

with the conventional approaches that attempt to identify peaks associated with the trimannosyl core as many as possible, we use a minimum of triplet peaks that are matched to three ions in the core to detect Y1 ions without a list of protein sequences to ensure the accurate identification of Y1 ions. To further enhance the sensitivity of Y1-ion identification, we evaluate the occurrence frequency of the seven peaks and propose a novel Trident Y1-ion pattern matching algorithm by detecting two sets of triplet peaks that are matched to consecutive mass losses of either two GlcNAc’s (i.e., [Y0, Y1, Y2]) or the dissociation of an NH3 followed by a GlcNAc (i.e., [Y0−NH3, Y0, Y1]) (Supporting Information Figure S-2B). This algorithm also detects modified Trident Y1-ion patterns, such as a sodium adduct or water loss, on the the Y0, Y1, and Y2 ions. These triplet patterns can also be applied to a truncated N-linked core since MAGIC does not rely on a complete trimannosyl core. Once the Y1 ion has been determined, other Y-ions are detected through dynamic programming to search successive peaks above the Y0 ion with mass differences that correspond to any monosaccharide. Step 3: Peptide Sequence Identification. Database sequence searching is generally preferable to de novo sequencing because it provides a FDR and is more efficient in large-scale analyses. To the best of our knowledge, no existing software provides a direct protein database search to identify peptide sequences from intact glycopeptide spectra unless given a list of potentially glycosylated protein sequences. Failure of glycopeptide sequence identification in conventional database sequence searches is partially caused by unknown naked peptide precursors and the interference of dominant B- and Y-ions. To resolve these issues, MAGIC performs an ion removal process that generates in silico peptide MS2 spectra and includes two essential operations: (i) reassigning precursor mass to the mass of the Y0 ion and (ii) removing all of the B- and Y-ions to eliminate their interference with peptide identification and enhance the relative intensity of the b- and y-ions. Then, a new MGF file will be generated as input to the database search engine. Step 4: Glycan Composition Determination. From a computational perspective, the glycan mass is a linear combination of masses of different monosaccharide types. Determining the glycan composition is equivalent to solving the linear equation defined by the glycan mass, which is the well-known knapsack problem in a combinatorial optimization area. To solve this computational problem, a look-up table for potential glycan compositions is constructed to list the exhaustive combinations of up to 29 different monosaccharides (nearly the detection limit of conventional mass spectrometers), with consideration of the general biosynthesis rules for glycan formation.41 The glycan composition is determined by matching the glycan mass, i.e., the difference between the intact glycopeptide precursor mass and naked peptide mass, to the look-up table. Furthermore, on the basis of all of the detected B- and Y-ions and several rules mentioned in the literature,18,32 MAGIC can eliminate false compositions and define a scoring function to rank the feasible compositions. Finally, MAGIC provides the glycopeptide identification results, which include the peptide sequence and glycan compositions with scores.

preprocessing and database searches, can be found in the Supporting Information File S-1 (PDF). MAGIC: A Bioinformatics Tool for Automated Glycopeptide Identification. MAGIC is designed to automate the analysis of N-linked glycopeptides from high-throughput beamtype CID MS2 spectra acquired from intact glycoproteomics experiments. MAGIC requires the QTOF-CID MS2 data sets in the standard MGF files as the input, which can be converted from raw data files generated by a variety of mass spectrometers, but it, by default, requires neither potentially glycosylated protein sequences nor a glycan database. Essentially, automated glycopeptide analysis involves the following two major components: (i) spectrum filtering to select glycopeptide spectra based on detected B-ions and Y1 ions and (ii) utilizing the assigned Y1 ions to decouple peptide backbone b- and y-ions from glycan-related B- and Y-ions for identifying peptide sequences and determining glycan compositions (Figure 1). The workflow of MAGIC consists of four major steps summarized below. Detailed description of the workflow is in the Supporting Information File S-1 (PDF).

Figure 1. Workflow of MAGIC for automated intact glycopeptide identification. Four major steps are designed to accomplish intact glycopeptide identification.

Step 1: Glycopeptide Spectrum Filtering. The first step is designed to filter glycopeptide MS2 spectra. Because of the abundant B-ions in the low m/z region,13 such as HexNAc+ (m/z 204.08) and HexHexNAc+ (m/z 366.14), MAGIC can detect the B-ions in the input file to retrieve the potential glycopeptide spectra. In addition to the built-in list of 22 B-ions reported in the literature (Supporting Information Table S1),28,38−40 MAGIC also allows users to upload their B-ion lists. On the basis of the ion list, the user can choose particular forms, such as the most abundant HexNAc+ and HexHexNAc+ (MAGIC’s default setting), to retain possible glycopeptide spectra. Alternatively, users can specify a minimum number of B-ions for filtering. Step 2: Trident Pattern Matching for Y1-Ion Detection. The trimannosyl core structure is commonly present in Nlinked glycopeptides. Complete dissociation of the core structure in CID may generate seven peaks, including peptide ions [Y0, peptide] and [Y0−NH3] and trimannosyl core ions of [Y1, Y0 + GlcNAc], [Y2, Y0 + 2GlcNAc], [Y3, Y0 + 2GlcNAc + Man], [Y4, Y0 + 2GlcNAc + 2Man], and [Y5, Y0 + 2GlcNAc + 3Man] (Supporting Information Figure S-2A). Compared



RESULTS AND DISCUSSION The performance of MAGIC was evaluated by four data sets that were generated from a standard HRP protein, mixture of five proteins, and large-scale glycoproteomics analysis from the C

DOI: 10.1021/ac5044829 Anal. Chem. XXXX, XXX, XXX−XXX

Article

Analytical Chemistry human HeLa cells and E. coli. HRP was selected as the model glycoprotein for the initial development of the MAGIC algorithm because its glycosylation sites and glycoform microheterogeneity have been well characterized.42,43 Next, five well-studied glycoproteins were mixed to evaluate the performance of MAGIC on a complex glycoprotein mixture. On the proteome scale, proteins extracted from human HeLa cells were used for the large-scale performance evaluation. The analytical workflow for the above-mentioned data set consisted of three experimental steps: (i) tryptic digestion, (ii) glycopeptide enrichment by ZIC-HILIC, and (iii) LC-MS2 analysis by Q-TOF MS. To evaluate the FDR, tryptic digests from E. coli were directly subjected to LC-MS2 analysis and used as a negative control because of E. coli’s lack of glycosylation machinery. A summary of the peptide identification for HRP, protein mixture, and HeLa cells is provided in the Supporting Information File S-1 (PDF). Performance Evaluation on the HRP Data Set. The performance of the proposed algorithms on the automated identification of peptide sequences and glycan compositions was first demonstrated on HRP, a glycoprotein with seven known tryptic peptides and paucimannose-type N-glycan at eight glycosylation sites.42,43 The raw data were first converted into an MGF file by Mascot Distiller, which produced a total of 734 MS2 spectra. After detection of B-ions at 204.08 and 366.12 Da, 628 of the 734 spectra were filtered as potential glycopeptide spectra. The stepwise evaluation of MAGIC’s algorithms is described in the following subsections. Y1-Ion Detection by Trident Pattern Matching. To evaluate the sufficiency of the patterns used by the Trident approach for Y1-ion identification, the occurrence frequencies of the seven N-linked glycan core fragment ions in the 628 spectra were plotted (Figure 2A). Among the 628 potential glycopeptide spectra, Y0−NH3, Y0, Y1, and Y2 were the top four observed ions. The Y1 ion was observed with the highest frequency (478 spectra, 76.1%), followed by the Y0 ion (380 spectra), Y2 ion (298 spectra), and Y0−NH3 ion (267 spectra). In contrast, the number of spectra containing fragment ions larger than the Y2

ion gradually decreased to only 71 spectra observed for the Y5 ion. Thus, when all seven peaks were used for detecting Y1 ions, only 20 spectra would be detected with the complete trimannosyl core pattern. On the basis of the high frequency of the four peaks Y0−NH3, Y0, Y1, and Y2, four combinations, including [Y0−NH3, Y0, Y1], [Y0, Y1, Y2], [Y0−NH3, Y0, Y1, Y2], and the union of first two pattern sets, were evaluated. As shown in Figure 2B, only 184 spectra possess Trident Y1-ion patterns of both [Y0−NH3, Y0, Y1] and [Y0, Y1, Y2]. Without prior information on the peptide sequences and based on the use of the [Y0−NH3, Y0, Y1] or [Y0, Y1, Y2] pattern, the Trident pattern matching approach successfully identified Y1 ions in 346 (72.4%) of the 478 Y1-containing spectra. Notably, our Trident strategy also offers advantages for the identification of multiple glycosylation sites or modified glycopeptides. First, it is capable of detecting the occurrence of two glycosylation sites at the same peptide backbone by detecting two different sets of [Y0, Y1, Y2] patterns that differ by a single GlcNAc in a spectrum. Trident then assigns the Y0 ion with the smaller m/z as the naked peptide precursor, which assumes double glycosites for subsequent peptide sequence identification. This idea is demonstrated in Figure 3A by a

Figure 3. Examples of detecting multiple glycosylation sites and a modified glycopeptide using Trident. (A) Trident detects two overlapping [Y0, Y1, Y2] patterns in a glycopeptide with two glycosylation sites and deduces the correct naked peptide mass. (B) Trident detects the sodiated pattern [Y0 + Na, Y1 + Na, Y2 + Na] as well as the unmodified Y0 and Y1 ions to deduce the correct naked peptide mass.

spectrum of the glycopeptide LYNFSNTGLPDPTLNTTYLQTLR, which is known to exhibit two glycosylation sites. As shown in Figure 3A, two [Y0, Y1, Y2] patterns are detected that differ by a GlcNAc, suggesting two rather than one glycosylation sites on the peptide. Second, Trident is capable of detecting patterns from adducted ions, such as the commonly observed sodiated glycopeptides, even in the absence of the unadducted form. The spectrum of GLIQSDQELFSSPNATDTIPLVR exhibiting glycopeptide fragments of Y0 + Na, Y1 + Na, and Y2 + Na (Figure 3B). In this case, although the Y2 ion is missing in the original [Y0, Y1, Y2] pattern, Trident successfully detects the pattern of the sodium adduct, [Y0 + Na, Y1 + Na, Y2 + Na], by a mass increase of +22 Da and then deduces the protonated Y1 and Y0 ions for peptide identification. By considering the sodium adduct or water loss for Y1-ion identification, 44 additional glycopeptide spectra were detected with the Y1 ion,

Figure 2. Analysis of the occurrence frequency of seven trimannosyl core-related fragment ions in the 628 potential glycopeptide spectra of the HRP data set. (A) Number of spectra with one of the seven ions. (B) Number of triplet-peak patterns found in the 628 spectra. D

DOI: 10.1021/ac5044829 Anal. Chem. XXXX, XXX, XXX−XXX

Article

Analytical Chemistry

Figure 4. Illustration of the performance of in silico peptide MS2 spectrum generation. (A) A glycopeptide spectrum with peptide backbone of SFANSTQTFFNAFVEAMDR was unidentifiable by Mascot. (B) After ion removal, the spectrum was identified with a high Mascot ion score of 100.98 (>46). (C) A glycopeptide spectrum of GLIQSDQELFSSPNATDTIPLVR was falsely identified by Mascot. (D) After ion removal, the spectrum was identified with a high Mascot ion score of 79.35 (>46).

To explore why the ion removal strategy could improve peptide identification, we first investigated whether the ion removal strategy increased the relative intensity of the b- and yions by matched ion ratio, defined as the total intensity of matched b- and y-ions divided by the total of peak signal intensities in the spectrum. Note that a matched ion ratio of 1.0 indicates perfect ion removal, where no peaks other than the band y-ions remain in the spectrum. After the B- and Y-ions were removed from the peak list, the matched ion ratios exhibited a 2- to 31-fold increase for the seven peptides (Figure 5A). The average ratio was a 2-fold increase from 0.30 ± 0.14 to 0.60 ± 0.15 (p = 1.0 × 10−26). The improved matched ion ratio together with the assignment of naked peptide precursor mass subsequently contributed to the accuracy and confidence of peptide identification. Without the ion removal strategy, no scans could be confidently identified by Mascot at the 95% confidence level. However, with the newly generated B- and Yion-free spectra, Mascot reported results for 81 spectra, including 61 confidently identified spectra (ion score >46, FDR 48, FDR 50% B- and Y-ions. In addition, 58 of the 67 calculated compositions fit the proposed formula of (Xyl)xManm(Fuc)f GlcNAc2 (m = 2−6; f = 0 or 1; x = 0 or 1) in Yang et al. for HRP.42 Furthermore, all of the 58 literature-validated glycan compositions were reported as top ranked compositions by MAGIC. In addition to the spectra that were consistent with the literature, five spectra were assigned with one additional HexNAc, which may be worth further investigation. Among the 14 (81−67) unsolvable spectra, nine were approximately 1.0 Da less than the theoretical peptide mass and three resulted from an incorrect precursor charge provided in the input file. In summary, MAGIC demonstrated automated glycopeptide identification on a single glycoprotein without the input of prior protein or peptide information. In this HRP data set, both the peptide sequence and its corresponding glycan composition for all seven tryptic glycopeptides can be reliably determined from F

DOI: 10.1021/ac5044829 Anal. Chem. XXXX, XXX, XXX−XXX

Article

Analytical Chemistry

hyaluronan (HA).53,54 CD44 is a major receptor for HA, with five potential N-glycosylation sites located in the HA-binding domain. The CD44-HA binding affinity is most likely modulated by these N-glycosylation sites, especially the first and fifth Asn.53,55,56 Although their exact glycoforms are unknown, glycan compositions of HexmHexNAcydHexNeuAcr (where m = 4, 6, 7, or 8, y = 3−5, and r = 0 or 1) have been reported for recombinant human CD44 expressed in a mouse myeloma cell line.57 One of the five glycosylation sites in the CD44, AFNSTLPTMAQMEK, was identified by MAGIC with glycan compositions Hex5HexNAc4, Hex5HexNAc4NeuAc, and Hex5HexNAc4dHexNeuAc. The result indicates that MAGIC has the ability to provide site-specific glycan compositions, which permits additional investigations. The other glycoprotein, TfR, is known for mediating the cellular uptake of iron.58 Its glycosylation is important for TfR to form a fully functional receptor.59 Its most critical N-glycosite reported on the peptide KQNNGAFNETLFR was identified via the workflow of MAGIC with a high-mannose glycan composition (Hex8HexNAc2) that was consistent with the report by Hayes et al.59 The overall results on HeLa cells showed that MAGIC would be an effective platform to process hundreds of tandem mass spectra simultaneously with high reproducibility and accuracy. Performance Evaluation on a Negative Control Escherichia coli Data Set. Finally, the E. coli data set was selected as a negative control for the evaluation of false-positive identifications by MAGIC because E. coli lacks N-linked glycosylation in its biosynthetic machinery. From the 1880 spectra, ten were retained after B-ion filtration and only two were further considered as glycopeptide spectra because of the existence of Trident Y1-ion patterns. After assigning the m/z of the Y0 ion as the precursor m/z and performing ion removal, the resulting two new spectra were searched by Mascot against Swiss-Prot. None of the spectra were identified as glycopeptides. This result showed that MAGIC achieved a remarkable FDR of 0 for glycopeptide identification in this large data set. Further Improvement on Glycopeptide Identification. Compared with the large number of predicted glycoproteins in the human proteome,48,60 the number of identified glycopeptides obtained in this study remains under-representative. However, this representation can be further improved by using an advanced mass spectrometer with higher sensitivity and acquisition speed and the incorporation of subcellular or chromatographic fractionation strategies. In addition, we further performed detailed analyses for the factors that resulted in unidentified spectra in our data set. Although MAGIC identified seven known glycopeptide sequences from the HRP data set, 265 out of 346 glycopeptide spectra in the HRP data set were unidentified by Mascot. Further manual inspection revealed two major reasons for this issue: (i) an insufficient number or intensity of b- and y-ions was observed in 39 spectra (14.7%), which were all associated with the short peptide NVGLNR; (ii) shifted peptide masses were observed in 202 spectra (76.2%). Substantial semitryptic peptides have been reported as occurring in tandem mass spectra (3−40%).61−64 In 128 of 202 spectra, we found that the deduced peptide m/z values and their corresponding b- and y-ions might have been derived from truncated peptides (semitryptic peptides) of the seven known glycopeptide backbones in HRP (see examples in the Supporting Information Figure S-3). The high proportion (48.3%) of semitryptic peptides suggested their frequent occurrence in the typical glycoproteomics analytical workflow. Therefore, to improve glycopeptide sequencing, we propose

Intel Core2 Quad CPU 2.83 GHz processor, 1TB hard disk drive, and 8 GB RAM) for the entire workflow, which included glycopeptide spectra filtering, Trident pattern matching for Y1ion identification, in silico MS2 spectra generation, and Mascot search as well as the output of peptide sequences and their glycan compositions. After loading a total of 1133 spectra, MAGIC obtained 591 spectra that showed the presence of Bions at m/z 204.08 and 366.12. Among them, 384 spectra, i.e., 65% of the 591 spectra, contained at least one adequate Y1-ion pattern detected by Trident and were further processed by MAGIC. Notably, in this HeLa cell data set and in the previous two data sets, approximately one-third of the entire spectra were regarded as glycopeptide spectra, which shows that filtering glycopeptide spectra is an essential step for ensuring the efficiency and accuracy of intact glycopeptide analysis. After performing in silico peptide MS2 spectra generation on the 384 glycopeptide spectra, Mascot reported peptide sequences of 112 spectra (36 glycopeptides of 26 glycoproteins), including 21 confidently identified spectra (18 glycopeptides of 17 glycoproteins) with an ion score >35 and FDR