GlycoMaster DB: Software To Assist the Automated Identification of N

Aug 12, 2014 - David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, Ontario N2L 3G1, Canada. ‡. Bioinformatics Solutions ...
0 downloads 0 Views 2MB Size
Subscriber access provided by NATIONAL SUN YAT SEN UNIV

Article

GlycoMaster DB: Software to Assist the Automated Identification of N-Linked Glycopeptides by Tandem Mass Spectrometry Lin He, Lei Xin, Baozhen Shan, Gilles A. Lajoie, and Bin Ma J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/pr401115y • Publication Date (Web): 12 Aug 2014 Downloaded from http://pubs.acs.org on August 23, 2014

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 46

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

GlycoMaster DB: Software to Assist the Automated Identification of N-Linked Glycopeptides by Tandem Mass Spectrometry Lin He1, Lei Xin2, Baozhen Shan2, Gilles A. Lajoie3, Bin Ma1,* 1

David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, ON, Canada 2

3

Bioinformatics Solutions Inc., Waterloo, ON, Canada

Department of Biochemistry, Western University, London, ON, Canada

Email: {l22he, binma}@uwaterloo.ca, {lxin, bshan}@bioinfor.com, and [email protected]

*

To whom correspondence should be addressed. Tel: +1 (519) 888-4567 x32747. Fax: +1 (519) 885-

1208. Email: [email protected]. ACS Paragon Plus Environment

1

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 46

ABSTRACT: Glycosylation is one of the most commonly observed posttranslational modifications (PTMs) in eukaryotes. It is believed that more than 50% eukaryotic proteins are glycosylated. To reveal the biological functions of protein-linked glycans involved in numerous biological processes, the highthroughput identification of both glycoproteins and the attached glycan structures becomes fundamentally important. Tandem mass spectrometry (MS/MS) is an effective method for glycoproteomic analysis for its high sensitivity and selectivity. Two experimental approaches exist to obtain MS/MS spectral data of glycopeptides. One consists of isolating glycans from glycopeptides and generating MS/MS spectra of the glycans and peptides separately.

The other approach produces spectra directly from intact

glycopeptides. The latter approach has the advantage of retaining the glycosylation site information. However, the spectral data cannot be readily analyzed because of the lack of software specifically designed for the identification of intact glycopeptides. To address this need, we developed a novel software tool, GlycoMaster DB, to assist the automated and high-throughput identification of intact N-linked glycopeptides from MS/MS spectra. The software simultaneously searches a protein sequence database and a glycan structure database to find the best pair of peptide and glycan for each input spectrum. GlycoMaster DB can analyze mass spectral data produced with HCD/ETD mixed fragmentation, where HCD spectra are used to identify glycans and ETD spectra are used to determine peptide sequences. When only HCD spectra are available, GlycoMaster DB can still help to identify the glycans, and a list of possible peptide sequences are reported according to the accurate precursor mass and the N-linked glycopeptide

sequon.

GlycoMaster

DB

is

freely

accessible

at

http://www-

novo.cs.uwaterloo.ca:8080/GlycoMasterDB.

KEYWORDS: Bioinformatics, Glycoproteins, Protein Identification, Protein Modification, Tandem Mass Spectrometry, Glycopeptides, Glycosylation, N-linked Glycans.

ACS Paragon Plus Environment

2

Page 3 of 46

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

1. Introduction Glycosylation is an enzymatic process that attaches glycans to proteins, lipids, or other organic molecules. It is one of the most frequently observed PTMs and more than 50% eukaryotic proteins are predicted to be glycosylated.1 Glycosylation of proteins provides either specific structural function induced by conformation changes, or specific recognition sites which are vital to cell-cell interactions. 2,3 Additionally, increasing evidence suggests that some abnormal glycosylation is strongly correlated with many diseases, such as cancer4 and congenital disorders.5 It is commonly believed that these glycaninvolved biological processes are closely related to specific Glycan structures.6-8

Thus, the

characterization of glycoproteins, including the amino acid sequences, glycan structures, and glycosylation sites, is of great interest in the emerging glycoproteomics field.9,10 Glycoproteomic analysis is more challenging compared with conventional proteomic analysis, due to the variety of glycan structures and the different linkages to proteins. Three types of glycans have been reported: N-linked, O-linked and C-linked. Among the three types, N-linked and O-linked are the most commonly observed ones. N-linked glycans are predominantly found on the Asn residue within a consensus peptide sequence, -Asn-Xxx-Ser/Thr-, where Xxx is any amino acid residue except Pro.11 This consensus sequence is known as a glycosylation sequon.11 Occasionally, Cys can take the place of Ser and Thr to generate another acceptable N-linked glycosylation sequon, -Asn-Xxx-Cys-, which is less frequently observed.12 Furthermore, N-linked glycans share a single core structure, (GlcNAc)2Man3, derived from the same precursor (GlcNAc)2Man9Glc3.13 In contrast, O-linked glycans have more varied core structures, and O-linked glycopeptides have less defined consensus peptide sequences (sequons). This paper focuses on the analysis of N-linked glycopeptides, including the identification of both glycan composition and peptide sequences. Tandem mass spectrometry (MS/MS) is a powerful tool for the analysis of the glycoproteome because of its high sensitivity and selectivity. In one approach, glycopeptides are deglycosylated partially or totally using a specific glycosidase, such as peptide N-endoglycosidase F (PNGase-F), and the resultant peptides ACS Paragon Plus Environment

3

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 46

and glycan moieties are then analyzed separately by mass spectrometry.14-17 Although this method simplifies the data interpretation, it is nontrivial to locate the glycosylation sites for the identified glycans. In another approach, experiments have been reported to characterize intact glycopeptides by MS/MS experiments.9,18,19 Earlier experiments used only one type of fragmentation method, typically collision induced dissociation (CID), to produce MS/MS spectra. Later on, strategies of using a combination of different fragmentation methods for intact glycopeptide analysis emerged. CID and higher energy collision dissociation (HCD) mainly result in fragment ions through breaking the glycosidic bonds, while electron transfer dissociation (ETD) and electron capture dissociation (ECD) predominantly produce fragment ions by breaking the peptide backbone but leaving the attached glycan intact. Combining two complementary fragmentation techniques in MS/MS analysis enables the identification of peptide sequences, glycan structure, as well as the glycosylation sites.20-23 Algorithm development for interpreting the MS/MS data of glycopeptides has been reported, but is still in a rudimentary stage.24,25 GlycoMod26 is a web-based tool to calculate all possible glycan compositions for a given mass.

It does not use the MS/MS data in the analysis.

Glyco-Peakfinder27 and

GlycoFragments28,29 can calculate the theoretical fragment ions of a given glycan structure, and use them to annotate an MS/MS spectrum. However, these tools cannot identify glycopeptides automatically and require a human expert to specify the glycan structure.

GlycosidIQ30, GlycoSearchMS29 and

GlycoWorkBench31 accept a spectrum, search glycan databases, and annotate the spectrum using glycan fragments. A peptide sequence has to be provided to these software tools in advance. This excludes the possibility for a large-scale data analysis. Peptoonist32, which is an extension of Cartoonist33, and GlypID 2.034 search theoretical glycan structure databases instead of a real database that comprises experimentally validated N-linked glycans. Selected rules are used for the generation of the theoretical glycan structures. However, not all glycan structures reported in existing glycan databases are covered by those rules. GlycoPep Grader35 and GlycoPep Detector36 are web-based tools for assigning the compositions of Nlinked glycopeptides. However, only one MS/MS spectrum can be processed at a time and users have to ACS Paragon Plus Environment

4

Page 5 of 46

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

input the possible candidate compositions for glycopeptides and glycans.

A more recent tool,

GlycoPeptideSearch (GPS)37, can identify intact N-linked intact glycopeptides from the CID MS/MS spectral data. The GPS algorithm first computes a short list of peptides with N-linked glycosylation motif. Then it filters the MS/MS spectra with the signature ions at m/z 204 and 366, and the peaks matching one of the peptide mass and the peptide with up to three monosaccharide residues. Glycans from GlycomeDB that match the putative glycan masses are grouped according to the mass and reported as a single match. Other software tools, such as STAT38, Oscar39, StrOligo40, GlycoMaster41, and GLYCH42, attempt to deduce the glycan structures directly from MS/MS spectra using de novo sequencing approaches. These de novo sequencing tools typically require high-quality MS/MS spectra for a reliable analysis. This may potentially leave many spectra with medium quality un-interpreted in a high-throughput experiment. A recent review by Dallas et al. discussed the current state of glycopeptide assignment software in details and also pointed out the lack of software that could analyze MS/MS data in batch for the unambiguous characterization of N-linked glycopeptides.43 The protein sequence databases commonly used in proteomics research seldom record the glycan structure information for glycosylated proteins. According to our counting, only 8,482 (21. 4%) out of the 39,641 human proteins in the UniProt database (Release 2013_10) contain glycosylation site information. The percentage of glycoproteins is much lower than expected. Thus searching a protein database alone is not a viable approach to glycopeptide identification. On the other hand, databases for isolated glycan structures have recently become available, such as CCSD/CarbBank44,45, CFG database46, EUROCarbDB47, GLYCOSCIENCES.de48, KEGG49,50 and GlycomeDB51. Therefore, it is theoretically possible to search a protein sequence database and a glycan structure database simultaneously to characterize the glycopeptides from the mass spectral data. In this paper, we describe a new software tool, GlycoMaster DB, to assist the automated and highthroughput characterization of intact N-linked glycopeptides from MS/MS data generated by HCD/ETD or HCD-only fragmentation. The software takes MS/MS spectra as input, searches in a given protein ACS Paragon Plus Environment

5

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 46

sequence database and an integrated glycan structure database simultaneously, and reports a peptideglycan pair that best matches each spectrum. The performance of the software was evaluated with four data sets. Its performance shows the promising utility of the software. Three key requirements were kept in mind in the development of GlycoMaster DB: (1) High throughput: It can process hundreds of thousands of MS/MS spectra, thousands of glycan structures, and hundreds of glycoproteins on a desktop computer within hours. (2) Semi-statistical score: A score is assigned to each identified glycopeptide. A human can therefore sort the identifications according to the score and focus on the high-scoring identifications first. (3) Flexible ETD utilization: When ETD spectra are available, the software attempts to use the ETD spectra to confirm the peptide sequence and glycosylation site. However, when ETD is not available, the software can still identify the glycan with HCD and report the peptide mass. In pursuing these requirements, GlycoMaster DB integrates both newly invented algorithms and many search strategies existing in different literatures. However, the GlycoMaster DB software differs from the existing software tools by at least one of these three characteristics. 2. Methods GlycoMaster DB processes MS/MS data from intact glycopeptides. Glycopeptides can be fragmented by either HCD/ETD or HCD-only fragmentation. The HCD/ETD protocol is preferred since the ETD spectra can be used to identify glycopeptide sequences more confidently. A short list of protein sequences need to be specified by users in a FASTAfile. If the glycoproteins are not enriched or enriched at the protein level, a large number of non-glycosylated peptides will be fragmented. Conventional database search tools, such as PEAKS52, Mascot53 or Sequest54, can identify the possible proteins from these non-glycosylated peptides. If the enrichment is performed at the peptide level, the proteins can be identified through separate experiments. The list of proteins provided to GlycoMaster DB can be a mixture of glycosylated and non-glycosylated proteins. GlycoMaster DB incorporates an N-linked glycan database that was extracted from GlycomeDB. If required, users can also easily append their own glycans data into this database. ACS Paragon Plus Environment

6

Page 7 of 46

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

The GlycoMaster DB software analyzes the data in the following three steps: (1) filtration of glycopeptide spectra, (2) glycan assignment, and (3) peptide identification. HCD spectra are used in the first two steps for glycan identification. The third step determines the peptide sequences using either ETD data (if available) or the calculated mass values of the peptides that bear the glycans and contain the consensus sequence (sequon) of N-linked glycopeptides. 2.1 Filtration of Glycopeptide Spectra The input MS/MS data contains a mixture of spectra from both glycosylated and non-glycosylated peptides. GlycoMaster DB first selects the spectra of glycosylated peptides. Only analyzing these selected spectra can help to improve the search speed and reduce false positives in later steps. HCD spectra generated from N-linked glycopeptides have two types of characteristics that are not frequently observed in the spectra of non-glycosylated peptides.

First, most spectra of N-linked

glycopeptides have two diagnostic peaks at m/z 204.09 and 366.14, corresponding to oxonium ions formed by a HexNAc and a disaccharide Hex-HexNAc, respectively.55,56 Secondly, peaks of a glycopeptide form ion ladders in the high m/z region. The m/z values of two adjacent singly charged peaks in a ladder differ by the mass of a monosaccharide residue, rather than the mass of an amino acid. Both types of characteristics are used in the algorithm to select the probable glycopeptide spectra. For each spectrum, the diagnostic peaks are first checked.

The presence of these two peaks triggers the subsequent

examination on the existence of ion ladders. By default, the spectrum is regarded as a glycopeptide spectrum only if it has both the diagnostic peaks and an ion ladder of length at least four (corresponding to a sequence of three monosaccharide residues) in GlycoMaster DB. Users can also specify the minimum length of the ion ladder. A shorter minimum length will let more spectra pass the filter and get examined in later steps of the algorithm, causing a slower search speed but potentially improved sensitivity. But the choice of this parameter will not affect the scoring function calculation in latter step, as long as the spectrum passes the filtration. Examination of GlycomeDB reveals that all recorded glycans that are classified into the category of N-linked glycans have the core structure (GlcNAc)2Man3. Therefore ACS Paragon Plus Environment

7

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 46

GlycoMaster DB was designed to characterize only glycopeptides with long chain glycans, i.e., glycans with at least five monosaccharide residues of the score structure. For the identification of these glycans, the default minimum tag length (three monosaccharide residues) should be sufficient. Note that at this filtration stage, the peptide masses are unknown. Therefore, the ion ladders can start from any m/z in the spectrum. It would be too slow to either exhausitively search the spectrum, or by enumerating every possible peptide mass. Thus, a dynamic programming algorithm is designed to compute the longest sequence of monosaccharide residues that matches a series of high-intensity peaks. In the algorithm, all the mass values are converted to the equivalent nominal mass by multiplying a factor 0.9995 and then rounding to the nearest integers.57,58 We pick the 50 most intensive peaks of the spectrum to calculate the longest sequence. In a preprocessed spectrum, a monosaccharide residue sequence of length k is represented by a series of peaks at m/z values m1, …, mk+1, where (mi+1 - mi) is equal to the mass of a monosaccharide residue. The longest sequence of monosaccharide residues (LSMR) problem is to find the maximum length k from a given spectrum. Three most frequently observed monosaccharide residues, Hex, HexNAc, and Fuc, are considered in this algorithm as the residue set. Let m1, m2 and m3 denote the mass values of these three monosaccharide residues. Let L[m] be the length of the longest sequence that ends at mass m. Then L[m] can be computed by the MaxTagLength algorithm as shown in Figure 1. [Figure 1] 2.2 Glycan Assignment If an MS/MS spectrum is detected as a possible glycopeptide spectrum, the N-linked glycan database is searched for its best matching glycan corresponding to glycan fragment ions in the MS/MS spectrum. Only glycans that have smaller mass than the precursor ion are considered. Each glycan-spectrum match (GSM) is evaluated and the glycan with the highest score is reported. The GSM scoring scheme is designed similarly to the ones commonly used for peptide identification: (1) the theoretical m/z values of the possible fragment ions are calculated, (2) for each fragment ion, a reward or a penalty is added to the ACS Paragon Plus Environment

8

Page 9 of 46

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

score depending on whether its m/z value matches a peak in the spectrum or not. These two components are described in details in the following two subsections. 2.2.1 Glycan Structure Fragmentation HCD strongly favors the fragmentation of glycosidic bonds than the peptide bonds and produces B-, Y-, C-, and Z-ions.59 In theory, a breakage can also occur across the ring of a monosaccharide to produce Aand X-ions. However, in practice, Y-ions are the most commonly observed ions in HCD spectra. Peaks representing oxonium ions can be observed in the low m/z region,56 and in most cases, only those product ions with at most three monosaccharide residues generate significant peaks according to our observation. Therefore, oxonium ions with at most three monosaccharide residues and Y-ions are considered in our scoring scheme. GlycoMaster DB takes a condensed GlycoCT file in GlycomeDB as input, parses it into a tree structure, and enumerates all the expected oxonium and Y-ions as discussed above. The theoretical m/z values of the singly charged ions are calculated during the ion enumeration. For example, for singly charged ions, the m/z value of an oxonium ion is equal to the total mass of the monosaccharide residues plus an additional proton, and the m/z value of a Y-ion is equal to the singly charged precursor m/z value subtracting the mass of the removed monosaccharide residues. The computation of the Y-ions includes the implicit mass (m/z) of the peptide substrate, which has not yet been identified in the current stage but derived by subtracting the glycan mass from the precursor mass. The list of theoretical m/z values and the corresponding fragment ion types are provided to our scoring scheme for GSM evaluation. 2.2.2 Glycan-Spectrum Matching Score In contrast with the development of PSM score in proteomics, the main challenge for developing the scoring scheme for GSM is the lack of large-scale training data. The proper values of the reward and penalty for a fragment ion matching and mismatching may depend on many factors such as the fragment ion type, the intensity of the matching peak, and the mass error. In proteomics, these values are usually statistically learned from a large number of training spectra annotated with known results. Unfortunately, ACS Paragon Plus Environment

9

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 46

in the glycoproteomics field, such a large-scale training data set is not yet available. Therefore, a much simpler scoring function is used. The scoring scheme in GlycoMaster DB calculates raw scores of GSMs first. Given a glycan structure and a spectrum, the theoretical m/z values of the glycan fragment ions are searched in the spectrum. GlycoMaster DB requires the spectrum to be deisotoped and charge deconvoluted in a preprocessing step before the search. Therefore, only charge one fragment ions are considered. In our experiment, the PEAKS software (Bioinformatics Solutions Inc., Waterloo, Canada) was used for such data preprocessing. The score S for a fragment ion matched by a peak with relative intensity I is calculated using the following equation: 𝑆={

log10 (100 × 𝐼) , log10 0.5 ,

If a peak with relative intensity 𝐼 ≥ 0.5% is matched Otherwise

(1)

The GSM raw score is the sum of all fragment ion scores. The glycan structures with the highest GSM raw score are reported as the best matches for the given spectrum. For every glycan in the database, if its mass is smaller than the spectrum precursor mass subtracting the mass of an asparagine (N), it will be selected to match the spectrum and calculate a GSM raw score. It is noticeable that this algorithm only aims to identify the best matching glycan according to the spectrum, and the potential validity of the implicit peptide mass used for Y-ion m/z calculation is not yet considered at this stage. The GSM raw score serves the purpose of selecting the best matching glycan structure for any specific spectrum since a correct structure often produces more high-intensity matches and generates a higher score than a false structure. However, the raw scores of two different spectra cannot be compared directly. This is because an incorrect GSM of a spectrum with many peaks can still possibly get a higher raw score than a correct GSM of a spectrum with few peaks due to the random matching. To compare the GSMs of different spectra, the raw score is further normalized to a -10log10P score as follows, where P denotes the p-value.

ACS Paragon Plus Environment

10

Page 11 of 46

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Given a spectrum, the raw scores of all glycans in the database are used to fit a normal distribution 𝒩(𝜇, 𝜎 2 ), where 𝜇 and 𝜎 are the mean and the standard deviation of the GSM raw scores, respectively.

Each raw score x is used to compute a p-value P that denotes the probability in which a random variable under 𝒩(𝜇, 𝜎 2 ) exceeds x. The final GSM score is -10log10P and is reported by GlycoMaster DB. According to this definition, the GSM score 20 corresponds to a p-value of 0.01. The identification results are sorted according to GSM scores. 2.3 Glycopeptide Identification Two different approaches for glycopeptide identification were implemented separately, depending on whether the ETD spectral data is available. Very few ions from the peptide backbone fragmentation are observed in an HCD spectrum. In addition to the regular way to calculate the peptide y and b ions, we also have checked the peptide backbone fragment ions that include the putative glycan attachment site in two different ways: (1) including the implicit mass of the glycan, and (2) including the mass of a HexNAc residue. None of the above cases could generate fragment ions that convincingly match to signals in an HCD spectrum. Therefore, the peptide sequence cannot be derived confidently from the HCD spectrum. In contrast, ETD predominantly produces fragment ions by breaking a peptide backbone but leaving the attached glycan intact and therefore can be used to identify the peptide. Thus, when the ETD spectrum is available, the peptide identification is primarily determined by the ETD spectrum. First, the proteins provided by the user are in silico digested with the user-specified enzyme, and the peptides containing the N-linked glycopeptide sequons are selected. Since the peptide mass is unknown, each peptide with the mass smaller than the precursor mass is considered as a candidate sequence for the given ETD spectrum. For such a peptide candidate, the mass difference between the theoretical peptide mass and the spectrum’s precursor mass is calculated and regarded as caused by a glycan. The m/z of the peptide backbone fragment ions are calculated and matched to the ETD spectrum to calculate a peptide-spectrum match (PSM) raw score. The same scoring function in Eq. 1 is used in the PSM raw score calculation with consideration of c-, c-H, z-, ACS Paragon Plus Environment

11

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 46

z'- and z'-ions. This raw score is then converted into a -10log10P score as the final PSM score, using the same procedure of the GSM score calculation. Thus, for a pair of HCD and ETD spectra, a list of glycans and a list of peptides are independelty obtained from the HCD spectrum and ETD spectrum, respectively. These two lists are then combined together to build intact glycopeptides as follows. All pairs of glycan and peptide are examined. A pair of glycan and peptide is regarded as a possible identification to a spectrum if (1) the sum of the peptide and glycan mass is within the allowed mass error tolerance comparing to the spectrum’s precursor ion mass, and (2) either the GSM score or the PSM score is greater than or equal to a user-specified threshold. Human examination of the data sets presented in this paper indicates that -10log10P ≥ 20 (corresponding to a p-value of 0.01) is a good empirical cut-off for both the GSM and PSM scores. Majority of the software's results above such a score threshold are plausible identifications. The GSM and PSM scores enable the sorting of the software's results for easier human examination. However, these two scores alone should not be regarded as a proof of the identification correctness of each specific spectrum. If multiple glycopeptides satisfy these two criteria, the one with the highest GSM score is kept in the main report, and the others are stored in a secondary table that can be further examined by users. If no peptide sequence is found for an HCD/ETD spectrum-pair, the glycans with the same top GSM score from the HCD spectrum and a calculated peptide mass are reported. In the case that the ETD data is not available, the peptide sequences cannot be identified confidently. Instead, the mass of the peptide is determined by subtracting the mass of the top-scoring glycan structure from the precursor mass of the HCD spectrum. Peptides that both contain the glycopeptide sequon and match the calculated mass within the user-specified error tolerance are reported as possible sequences. If no such peptide is found, only the calculated peptide mass is reported. It is noticeable that multiple glycans sharing the same top GSM score and the same composition may be identified from an HCD spectrum. Using these glycans, the same peptide mass will be generated as the glycans with the same composition share the same mass. As well, all these glycans will be considered as possible identifications. ACS Paragon Plus Environment

12

Page 13 of 46

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

3. Results Four previously published data sets (Ribonuclease B, Human Immunoglobin G, Lectin-Enriched Human Urinary Proteome, and Human Urinary Proteome) by Singh et al.23 and Marimuthu et al.18 were used to evaluate the performance of GlycoMaster DB. The first two data sets were obtained only with HCD/ETD fragmentation and thus glycopeptides could be characterized both on glycan composition and peptide sequences. For the human urinary proteome data sets obtained with HCD fragmentation, GlycoMaster DB identified the glycan structures, while the peptide sequences were reported only according to the calculated masses and the sequon. Clearly, several peptides may share the same mass value, resulting in peptide identification ambiguities if HCD-only data is used. To study the severity of this ambiguity, statistical analysis by computational simulation was conducted and its results are illustrated at the end of this section. Experimental procedures for the sample preparation, glycoprotein enrichment and LC-MS/MS analysis were described in details in Singh et al.23 and Marimuthu et al.18 Here, we briefly describe the four data sets in the following: Ribonuclease B (RNase-B) Data Set: This data set was from the study of HCD product ion-triggered ETD (HCD PI ETD) analysis for characterization of glycoproteins proposed by Singh et al.23 Ribonuclease B (RNase B) from bovine pancreas was digested using Lys-C. The digested peptides were separated using a zwitterionic hydrophilic interaction liquid chromatography nano-column, and then analyzed using the LTQ-Orbitrap Velos (Thermo Fisher Scientific, Bremen, Germany). The mass spectrometer performed a full survey scan with Orbitrap and subsequent HCD MS/MS scans of the 40 most abundant ions. If peaks at m/z 204.09 (HexNAc) or 366.14 (Hex-HexNAc) (±0.05 Th) were within the top 20 most abundant peaks, a supplemental activation ETD MS/MS scan of the precursor ion in the linear ion trap was triggered. This data set contained 3,111 MS spectra and 774 MS/MS spectra (632 HCD and 142 ETD spectra). Human Immunoglobin G (Human-IgG) Data Set: This data set was from HCD PI ETD analysis for characterization of glycopeptides in human IgG proteins.23 Human IgG is an antibody isotype and its ACS Paragon Plus Environment

13

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 46

fragment crystallizable region bears a highly conserved N-linked glycosylation site. Four subclasses of human IgG, IgG1, IgG2, IgG3, and IgG4, were present in this analysis. These proteins were digested by trypsin and analyzed with the same HCD PI ETD strategy used for the acquirement of the RNase-B data set. This data set comprised 952 MS spectra and 5,710 MS/MS spectra (5,436 HCD and 274 ETD spectra). Lectin-Enriched Human Urinary Proteome (Enriched-HUP) Data Set: This data set was from a comprehensive analysis of human urine proteome by Marimuthu et al.18 and contained 24 raw data files. The sample was incubated with a mixture of three agarose conjugated lectins - concanavalin A, wheat germ agglutinin and jacalin (Amersham BioSciences) - for glycoprotein enrichment. The concentrated protein was then resolved by SDS-PAGE and visualized using colloidal Coomassie staining. Twenty-four bands were excised and subjected to in-gel trypsin digestion procedure and then analyzed using the LTQOrbitrap Velos (Thermo Fisher Scientific, Bremen, Germany) interfaced with an Agilent's 1200 Series nanoflow LC system. The mass spectrometry analysis was carried out in a data dependent mode with survey scans acquired using the Orbitrap mass analyzer, and 20 most abundant precursor ions from a survey scan were selected for HCD MS/MS scans. This data set contained 22,886 MS spectra and 199,890 MS/MS spectra in total. Human Urinary Proteome (HUP) Data Set: This data set was also from the comprehensive analysis of human urinary proteome and included 30 raw data files. The sample was separated by SDS-PAGE without lectin-enrichment of glycoproteins. Thirty gel bands were excised and subjected to in-gel trypsin digestion. The sample analysis was carried out as described in the Enriched-HUP data set. This data set included 35,788 MS spectra and 170,215 MS/MS spectra in total. All four data sets were analyzed using PEAKS to identify the lists of proteins with FDR ≤ 1%. The resultant proteins were exported as FASTA files for GlycoMaster DB analyses. The RNase-B data set was searched against the UniProt bovine database (5,973 entries), and the Human-IgG data set was searched against the UniProt human database (39,641 entries, Release 2013_10). Oxidation of Met was set as a variable PTM and carbamidomethylation of Cys as a fixed PTM. The maximum allowed number ACS Paragon Plus Environment

14

Page 15 of 46

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

of missed-cleavages was set to two. The precursor and fragment error tolerances were 10 ppm and 0.1 Da, respectively. The two human urinary proteome data sets were searched against the UniProt human database. Oxidation of Met, deamidation at Asn and Gln, and protein N-terminal acetylation were selected as variable PTMs and carbamidomethylation of Cys as a fixed PTM. One missed-cleavage was allowed for tryptic peptides. The precursor and fragment error tolerances were 20 ppm and 0.1 Da, respectively. The error tolerance values and PTMs were chosen according to the original papers18,23 of these data sets. In subsequent GlycoMaster DB analyses, these four data sets were searched against our integrated Nlinked glycan database. This integrated glycan database is extracted from the GlycomeDB database.51 The GlycomeDB database records 41,114 glycans, including both N- and O-linked ones. Only the Nlinked glycans were extracted. Additionally, the structures with the same topology (a two-dimensional monosaccharide sequence) but different linkages (the carbon sites involved in glycosidic bonds that connect monosaccharides) were grouped as a single group. After this preprocess, there are 2,927 N-linked glycan groups in our glycan database. The glycans in each group may have different linkages, but must have the same composition and topology. Two groups may have the same composition, but must have different topologies. For each data set, the PTMs, the mass error tolerances of precursor and fragment ions, and the maximum number of missed-cleavage were set the same to the ones used in PEAKS analysis. Sodium and potassium adducts60 were also considered in the search. 3.1 RNase-B Data Set This data set was obtained using the HCD PI ETD strategy. HCD spectra were preprocessed by the Data Refine module of PEAKS and thereafter used to identify the short list of proteins. Among the 774 MS/MS spectra, 31 were identified as non-glycosylated peptides with high confidence (-10log10P ≥ 34.4 and FDR ≤ 1%) and nine proteins were reported by PEAKS. GlycoMaster DB then analyzed HCD/ETD spectrumpairs for the identification of glycopeptides. In our experiment, we set the threshold of GSM and PSM scores as 20. According to this threshold, 40 HCD/ETD spectrum-pairs, out of 142 pairs, were identified by GlycoMaster DB. The 40 spectrum-pairs have 11 unique precursor m/z and charge combinations as ACS Paragon Plus Environment

15

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 46

listed in Table 1. All these identifications share the single peptide sequence SRNLTK. Our results cover all the identifications from manual interpretation of the spectrum-pairs by Singh et al . In addition, GlycoMaster DB also reported a putative glycan with the composition (HexNAc)3Hex6, as shown in Figure 2. Manual validation of the result reveals that the glycan and the corresponding glycopeptide match the HCD/ETD spectrum-pair well. [Figure 2] [Table 1] [Figure 3] Figure 3 illustrates an example of a glycopeptide reported from an HCD/ETD spectrum-pair by GlycoMaster DB. Both the HCD spectrum and the triggered ETD spectrum have a same precursor m/z value and retention time. Clearly, in the HCD spectrum (Figure 3(a)), the peak ladder started from m/z 921.5 is definitely from the Y-ions of the glycopeptide. A glycan with the composition (HexNAc)2Hex8 was reported by GlycoMaster DB as the best matching glycan with the GSM score 50.65. The calculated mass of the peptide has only one peptide SRNLTK matched within the given mass error tolerance in the nine proteins identified by PEAKS. The same peptide was also independently identified from the ETD spectrum as shown in Figure 3(b) with the PSM score 75.25. Therefore, GlycoMaster DB reported the glycopeptide SRN((HexNAc)2Hex8)LTK as the identification of this HCD/ETD spectrum-pair. The theoretical triply charged precursor m/z is 807.6717 and it differs from the experimental precursor m/z with only 0.38 ppm. As an optional step to further validate the peptide sequence reported by GlycoMaster DB, the PEAKS database search software was used to analyze the ETD spectra. All glycans reported by GlycoMaster DB were provided to PEAKS as user-sepcified variable PTMs. PEAKS checked all the in silico digested peptides, rather than only the peptides with N-linked glycopeptide sequons. Therefore, if the best matching peptide for an ETD spectrum turns out to have a sequon and a high score, the identification is of high confidence. Out of the 40 spectra that GlycoMaster DB reported a peptide, PEAKS was able to ACS Paragon Plus Environment

16

Page 17 of 46

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

identify the same peptide SRNLTK for 38 of them. This indicates that GlycoMaster DB's peptide identification using ETD spectra is reliable. PEAKS did not identify any peptide for the other two lowquality spectra. However, GlycoMaster DB could report the glycan from the HCD counterpart with high confidence, and it also used the sequon information, therefore, the peptides with low PSM scores were still reported for users' consideration. 3.2 Human-IgG Data Set Similarly to the analysis of the RNase-B data set, HCD spectra in the Human-IgG data set were used to identify a short list of proteins. Among the 5,710 MS/MS spectra, 306 were identified as non-glycosylated peptides and 36 proteins were reported. HCD/ETD spectrum-pairs were then extracted for glycopeptide analysis using GlycoMaster DB. Out of the 274 HCD/ETD spectrum-pairs, 10 spectrum-pairs were reported with either the GSM score or the PSM score higher than 20. The reported glycopeptides are listed in Table 2. [Table 2] [Figure 4] Figure 4 illustrates a glycopeptide identified from an HCD/ETD spectrum-pair by GlycoMaster DB. Figure 4(a) shows the HCD spectrum recorded at RT 18.49 minutes. The glycan reported by GlycoMaster DB has the composition (HexNAc)4Hex3Fuc1, which forms a clear Y-ion ladder in the high m/z region. From the calculated mass of the peptide, TKPREEQFNSTFR is selected as the possible peptide sequence. The same peptide is also identified by GlycoMaster DB independently from the ETD spectrum shown in Figure 4(b). The difference between the theoretical and the experimental m/z of this glycopeptide is 1.42 ppm. PEAKS database search on the ETD mass spectra was further carried on to validate the peptide identification reported by GlycoMaster DB. We set all the glycans reported by GlycoMaster DB as variable PTMs for PEAKS database search. Among 274 ETD mass spectra, PEAKS reported 20 PSMs with FDR ≤ 1% and all the peptides identified by GlycoMaster DB with GSM scores of higher than 20 ACS Paragon Plus Environment

17

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 46

were included. PEAKS also identified nine non-glycosylated peptides, which matched both HCD and ETD mass spectra with high PSM scores. Manual checking revealed that their HCD spectra had low peaks at m/z 204.09, which falsely triggered the generation of the corresponding ETD spectra. However, because of the filtration using the ion ladder, as well as the use of the -10log10P score threshold, these spectrum-pairs for the non-glycosylated peptides did not result in false positives in GlycoMaster DB's results. 3.3 Enriched-HUP Data Set The 24 spectral data files were searched separately in the UniProt human protein database for the short lists of proteins. We then used GlycoMaster DB for intact glycopeptide identification. The results from the 24 spectral files are listed in Table 3. [Table 3] In total, 5,455 MS/MS spectra passed both the diagnostic peak and ion ladder filters. These spectra were searched against the N-linked glycan database. 1,052 spectra matched glycans with high confidence (-10log10P ≥ 20), and 989 of them matched some peptides by accurate mass values. Possible reasons for not reporting peptide sequences for the other 63 spectra include: (1) the peptides may not be in the proteins identified by PEAKS, and (2) the peptides may be the result of non-specific trypsin digestion, have more missed-cleavages, or have variable PTMs other than those considered. 56 proteins were reported as glycoproteins with at least one glycopeptide identified by GlycoMaster DB in each protein. [Figure 5] Figure 5 shows three example glycopeptides identified by GlycoMaster DB. Figure 5(a) shows an MS/MS spectrum recorded at RT 29.87 minutes. The precursor m/z 1247.005 corresponds to a doubly charged glycopeptide N((HexNAc)2Hex8)WTITR (m/zcalc = 1247.007 and Δm/z = -1.27 ppm). The peak at m/z 993.5 corresponds to the singly charged [peptide+HexNAc] ion, which is the Y1 fragment according to the Domon and Costello nomenclature.59 Figure 5(b) and Figure 5(c) show other two HCD spectra identified as having the same peptide sequence but slightly different glycans. The differences between ACS Paragon Plus Environment

18

Page 19 of 46

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

their precursor masses are the masses of monosaccharide residues. As the retention time is mainly determined by the hydrophobicity of amino acids instead of the glycans attached on the glycopeptides, the precursor ions of these spectra have similar retention times. [Figure 6] Figure 6 illustrates another example of two similar glycans on the same peptide. The retention time of these two spectra differs by 3.1 minutes. The identified glycans are very similar and SLHVPGLNK is the only glycopeptide that has the calculated peptide mass. Consequently, the two spectra are very similar to each other, except that Figure 6(b) contains two intense peaks at 292.10 and 274.09, which are missing from Figure 6(a). These peaks demonstrate the existence of sialic acid residues. Sialic acids have been reported to influence the retention time of glycopeptides.61 This is consistent with the retention time difference of 3.1 minutes between the two spectra. 3.4 HUP Data Set This data set was obtained from the same human urine sample as the Enriched-HUP data set. The only difference was the glycoproteins were not enriched. GlycoMaster DB was used to process this data set as we noticed that many spectra contained high-intensity diagnostic peaks of N-linked glycopeptides. Similarly to the analysis of the Enriched-HUP data set, the 30 spectral data files were searched separately in the UniProt human protein database for the short lists of proteins. The GlycoMaster DB results of those 30 spectral data are listed in Table 4. 339 spectra have matched glycans with high confidence (-10log10P ≥ 20), and 319 of them have found corresponding peptide sequences from 68 proteins. [Table 4] [Figure 7] Figure 7 illustrates three example glycans identified by GlycoMaster DB from this data set. 3.5 Comparison of Identified Glycans between the Enriched-HUP and HUP Data Sets

ACS Paragon Plus Environment

19

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 46

GlycoMaster DB identified 1,052 and 339 GSMs from the Enriched-HUP and HUP data sets, respectively. Since the peptide sequences were searched only according to the calculated mass values, there might be ambiguities in the sequence identification. Moreover, the best matching structure from GlycoMaster DB might not be the real one because a spectrum could be matched equally well by several glycan structures sharing the same composition.

Thus, the identified glycopeptides were grouped

according to the combination of glycan composition and the peptide mass for each human urinary proteome data set. These groups, instead of individual glycopeptides, were compared to analyze the relationship of identified glycans between the two data sets. 124 and 133 such glycopeptide groups were discovered from the Enriched-HUP and HUP data set, respectively. The Venn diagram in Figure 8 illustrates the overlaps between these two sets of glycopeptide groups. [Figure 8] The comparison shows that GlycoMaster DB could identify more GSMs from the Enriched-HUP data set. That is, more spectra form the enriched data set were assigned with glycopeptides, which was expected. But after merging the identified GSMs according to glycan composition and peptide mass, fewer glycopeptide groups were formed from this enriched data set. This phenomenon is likely related to the different mass spectrometry settings in the two data sets. The enriched data set contains a large number of MS/MS scans that have the same precursor mass at similar retention time. More precisely, the average difference between the retention times of consecutively fragmenting the same precursor ion is around 0.15 minutes in the Enriched-HUP data set, but this retention time difference in the HUP data set is around 0.6 minutes. The repetitive fragmentation of the same glycopeptide increased the number of spectra assigned, but not necessarily the number of glycopeptide groups. To examine this, the same data analysis was also performed after merging the repetitive scans together. The PEAKS software was used to combine the MS/MS scans with the same precursor m/z (within error tolerance 20 ppm) and similar retention time (within 0.2 minutes difference). The peak intensities at the same m/z values of the merged spectra are added together, and the resulting peak list is processed with the same peak centroiding, noise reduction, ACS Paragon Plus Environment

20

Page 21 of 46

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

charge-deconvolution and deisotoping as before. Such a merging process was recommended for the identifications of regular peptides. But for both of the Enriched-HUP and HUP data sets, we found the glycopeptide identification performance (in terms of number of identified glycopeptide groups) decreased slightly after merging. This suggests that a different merging algorithm may be needed for glycopeptide spectra. However, the relative performance of the two datasets (Enriched-HUP vs. HUP) was not changed. 3.6 Glycopeptides with Same Mass If only HCD data is available, the peptide is only reported according to the accurate mass and the presence of N-linked glycopeptide sequon. This may result in ambiguous identification of the actual peptide when the size of the protein list is large or the mass accuracy is low. Computer simulation was carried out to study the severity of such ambiguity. [Figure 9] For each combination of mass accuracy (δ) and number (n) of proteins, n proteins were randomly selected from the UniProt human database (39,641 entries, Release 2013_10). The tryptic peptides containing the N-linked glycopeptide sequon were generated in silico. The percentage of such peptides with unique mass (mass error ≤ δ ppm) was calculated. The random selection was repeated 1,000 times for each δ and n, and the average percentage was plotted in Figure 9. It is noticeable that when the protein list contains no more than 100 entries, and the mass accuracy is better than 5 ppm, 99% of the tryptic peptides with the N-linked glycopeptide sequon can be unambiguously identified from the mass. 4. Discussion N-linked glycosylation is a special type of PTM. The attached glycan on a glycopeptide is relatively labile when collision-based fragmentation approaches are used. Therefore, compared with modified peptides with other commonly observed PTMs, intact glycopeptides can generate quite different MS/MS spectra. Consequently, the algorithms used to identify other PTMs cannot be readily used for identify Nlinked glycopeptides.

ACS Paragon Plus Environment

21

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 46

Our experiments demonstrate the feasibility of using GlycoMaster DB to assist the identification of Nlinked glycopeptides as well as the glycan structures from high-throughput HCD/ETD and HCD-only MS/MS data. The software is designed for the analysis of MS/MS data acquired from intact glycopeptides, rather than deglycosylated glycopeptides. Therefore, it can simultaneously report glycans and peptide sequences. Such an application is important for large-scale glycoproteomic analysis since the connection between glycans and their peptides can be readily determined. Figure 5, 6, and 7 illustrate multiple glycan forms on the same glycosylation sites. This is useful information to facilitate the study of the glycan synthesis and degradation process. In Figure 5, the different glycopeptides with the same peptide sequence have slightly different retention time. It excludes the possibility that the different forms are due to the post-source fragmentation in the mass spectrometer. Most peaks in the HCD spectrum of a glycopeptide are from the fragmentation of the glycan but not the peptide. This makes it difficult to confidently identify the peptide sequence. Thus, a list of peptide sequences matching the calculated peptide mass and containing the N-linked glycopeptide sequon are reported. However, such identification may be ambiguous when there are a large number of proteins, especially when missed-cleavages and non-specific digestions were considered. The high mass accuracy of instruments, e.g., the LTQ-Orbitrap, can greatly help to determine the accurate precursor mass. In addition, the retention time of glycopeptides is another piece of potentially useful information. However, when the predicted retention time from published software, ELUDE,62 was used, it was found to be inaccurate for glycopeptides. Future versions of GlycoMaster DB will consider including the retention time information when a glycopeptide retention time predictor becomes available. If the glycopeptides are fragmented with both HCD and ETD, GlycoMaster DB can use the spectrum-pairs simultaneously to report both the glycans and the peptide sequences more confidently. In our experiments GlycoMaster DB run efficiently and required only a moderate PC. For example, the Enriched-HUP data set that contain 199,890 MS/MS spectra, was analyzed in around one hour on a laptop computer with an Intel® Core™ i5-3360M CPU (2.80 GHz). ACS Paragon Plus Environment

22

Page 23 of 46

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Partly due to the complexity of glycosylation, there are a few known limitations of the GlycoMaster DB software in the data analysis. Understanding these limitations is important for the correct use of the software: First, GlycoMaster DB is designed to assist the identification of N-linked glycopeptides, and at this stage cannot deal with O-linked glycopeptides. There is a small but non-negligible chance that an Olinked spectrum may be mistakenly identified as an N-linked glycopeptide. Secondly, the current version of GlycoMaster DB will only attach one glycan to a peptide at a time. Thus, if a peptide has more than one glycans attached to different sites, GlycoMaster DB will make a mistake. According to our in silico digestion of the human UniProt database, approximately 6.3% of the tryptic glycopeptides (with up to two miss cleavages) have two or more N-linked glycosylation sites. Thus, we suspect that even fewer glycopeptides in real samples actually have multiple glycans attached. This makes it difficult to collect enough data for the algorithm development purpose. Thirdly, the GSM and PSM scores are designed mainly for the purpose of sorting the results to facilitate easier human examination. However, these scores are not fully-qualified statistical scores and should not be used alone as a proof of the correctness of any specific identification. Users should be aware of the potential false positives in the final report. Before a reliable statistical method is developed for the automated validation of the results, the following procedure is a practical way to deal with a large amount of results: (1) sort the results according to the GSM score; (2) manually examine a few results around a preset score threshold; and (3) adjust the score threshold according to the users’ confidence about the results at the score threshold. Since in our datasets, the HCD spectra are of higher quality than the ETD, the GSM score is more reliable than the PSM score for such sorting purpose. Fourthly, there are many glycans in the database that have the same mass and produce very similar theoretical spectra. Thus, unless the complete fragment ions are observed in the experimental spectrum, there may be multiple glycans receiving the same highest score. Very often the glycans receiving the same highest score also have the same composition. In fact, in the Enriched-HUP dataset, approximately 85% ACS Paragon Plus Environment

23

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 46

of the identified spectra match multiple glycans with the same highest score. As well, for each of these spectra, all of the top-scoring glycans have the same composition. In such a case, GlycoMaster DB will report one of them in the main report, while keep all the others in a secondary table that can be further examined by users. For the same reason, GlycoMaster DB does not distinguish two glycan structures with the same topology but different linkages. It simply groups them together and treats them as the same structure. To overcome these last limitations, improved mass spectrometry techniques are needed to produce more complete fragment ions and fragment ions that contain the linkage information. Despite these limitations, the software can efficiently analyze a high-throughput MS/MS data set, select most of the glycopeptides' spectra, and provide pairs of peptide sequences and glycan structures as the plausible identification. It works very well with HCD/ETD spectral data, but also works well with HCD data generated from a small number of proteins (e.g., 200 proteins). The strength of GlycoMaster DB is for the analysis of a fraction of glycopeptides containing few components that result from fractionation (200 or less proteins). Furthermore, the results are sorted so that the more confident identifications are more likely at the top of the list. These functions cannot replace all of the manual interpretation of MS/MS data generated from glycopeptides, but should be useful enough to greatly reduce the workload of human experts in such analyses. Acknowledgement The author thanks Dr. Helen J. Cooper and Dr. Akhilesh Pandey for providing their experimental data. This work is supported by NSERC (RGPIN 238748-2006) and Bioinformatics Solutions Inc. References (1)

Apweiler, R.; Hermjakob, H.; Sharon, N. On the frequency of protein glycosylation, as deduced from analysis of the SWISS-PROT database. Biochimica et Biophysica Acta (BBA)–General Subjects 1999, 1473, 4–8.

ACS Paragon Plus Environment

24

Page 25 of 46

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(2)

Journal of Proteome Research

Varki, A.; Cummings, R. D.; Esko, J. D.; Freeze, H. H.; Stanley, P.; Bertozzi, C. R.; Hart, G. W.; Etzler, M. E. Essentials of Glycobiology. 2nd edition. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, New York, 2009.

(3)

Wormald, M. R.; Dwek, R. A. Glycoproteins: glycan presentation and protein-fold stability. Structure 1999, 7, R155–R160.

(4)

Kim, Y. J.; Varki, A. Perspectives on the significance of altered glycosylation of glycoproteins in cancer. Glycoconjugate Journal 1997, 14, 569–576.

(5)

Freeze, H. H.; Aebi, M. Altered glycan structures: the molecular basis of congenital disorders of glycosylation. Current Opinion in Structural Biology 2005, 15, 490–498.

(6)

Cummings, R. D. The repertoire of glycan determinants in the human glycome. Molecular BioSystems 2009, 5, 1087–1104.

(7)

Ohtsubo, K.; Marth, J. D. Glycosylation in cellular mechanisms of health and disease. Cell 2006, 126, 855–867.

(8)

Varki, A. Biological roles of oligosaccharides: all of the theories are correct. Glycobiology 1993, 3, 97–130.

(9)

Pan, S.; Chen, R.; Aebersold, R.; Brentnall, T. A. Mass spectrometry based glycoproteomics from a proteomics perspective. Molecular & Cellular Proteomics 2011, 10, R110.003251.

(10) Srivastava, S. Move over proteomics, here comes glycomics. Journal of Proteome Research 2008, 7, 1799–1799. (11) Mellquist, J. L.; Kasturi, L.; Spitalnik, S. L.; Shakin-Eshleman, S. H. The amino acid following an asn-X-Ser/Thr sequon is an important determinant of N-linked core glycosylation efficiency. Biochemistry 1998, 37, 6833–6837.

ACS Paragon Plus Environment

25

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 46

(12) Brooks, S. A.; Dwek, M. V.; Schumacher, U. Functional and molecular glycobiology; Bios scientific Oxford, 2002. (13) Imperiali, B.; O’Connor, S. E. Effect of N-linked glycosylation on glycopeptide and glycoprotein structure. Current Opinion in Chemical Biology 1999, 3, 643. (14) Hägglund, P.; Bunkenborg, J.; Elortza, F.; Jensen, O. N.; Roepstorff, P. A new strategy for identification of N-glycosylated proteins and unambiguous assignment of their glycosylation sites using HILIC enrichment and partial deglycosylation. Journal of Proteome Research 2004, 3, 556– 566. (15) Hägglund, P.; Matthiesen, R.; Elortza, F.; Højrup, P.; Roepstorff, P.; Jensen, O. N.; Bunkenborg, J. An enzymatic deglycosylation scheme enabling identification of core fucosylated N-glycans and O-glycosylation site mapping of human plasma proteins. Journal of Proteome Research 2007, 6, 3021–3031. (16) Tarentino, A. L.; Gomez, C. M.; Plummer Jr, T. H. Deglycosylation of asparagine-linked glycans by peptide: N-glycosidase F. Biochemistry 1985, 24, 4665–4671. (17) Zhang, W.; Wang, H.; Zhang, L.; Yao, J.; Yang, P. Large-scale assignment of N-glycosylation sites using complementary enzymatic deglycosylation. Talanta 2011, 85, 499–505. (18) Marimuthu, A. et al. A comprehensive map of the human urinary proteome. Journal of Proteome Research 2011, 10, 2734–2743. (19) Wuhrer, M.; Catalina, M. I.; Deelder, A. M.; Hokke, C. H. Glycoproteomics based on tandem mass spectrometry of glycopeptides. Journal of Chromatography B: Analytical Technologies in the Biomedical and Life Sciences 2007, 849, 115–128.

ACS Paragon Plus Environment

26

Page 27 of 46

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(20) Alley, W. R.; Mechref, Y.; Novotny, M. V. Characterization of glycopeptides by combining collision-induced dissociation and electron-transfer dissociation mass spectrometry data. Rapid Communications in Mass Spectrometry 2009, 23, 161–170. (21) Saba, J.; Dutta, S.; Hemenway, E.; Viner, R. Increasing the productivity of glycopeptides analysis by using higher-energy collision dissociation-accurate mass-product-dependent electron transfer dissociation. International Journal of Proteomics 2012, 2012, 7 pages. (22) Scott, N. E.; Parker, B. L.; Connolly, A. M.; Paulech, J.; Edwards, A. V. G.; Crossett, B.; Falconer, L.; Kolarich, D.; Djordjevic, S. P.; Højrup, P.; Packer, N. H.; Larsen, M. R.; Cordwell, S. J. Simultaneous glycan-peptide characterization using hydrophilic interaction chromatography and parallel fragmentation by CID, higher energy collisional dissociation, and electron transfer dissociation MS applied to the N-linked glycoproteome of Campylobacter jejuni. Molecular & Cellular Proteomics 2011, 10, M000031MCP201. (23) Singh, C.; Zampronio, C. G.; Creese, A. J.; Cooper, H. J. Higher Energy Collision Dissociation (HCD) Product Ion-Triggered Electron Transfer Dissociation (ETD) Mass Spectrometry for the Analysis of N-Linked Glycoproteins. Journal of Proteome Research 2012, 11, 4517–4525. (24) Aoki-Kinoshita, K. F. An introduction to bioinformatics for glycomics research. PLoS Computational Biology 2008, 4, e1000075. (25) Ranzinger, R.; Maaß, K.; Lütteke, T. Functional and Structural Proteomics of Glycoproteins; Springer, 2011; pp 59–90. (26) Cooper, C. A.; Gasteiger, E.; Packer, N. H. GlycoMod – a software tool for determining glycosylation compositions from mass spectrometric data. Proteomics 2001, 1, 340–349.

ACS Paragon Plus Environment

27

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 28 of 46

(27) Maass, K.; Ranzinger, R.; Geyer, H.; von der Lieth, C.-W.; Geyer, R. “Glyco-peakfinder” – de novo composition analysis of glycoconjugates. Proteomics 2007, 7, 4435–4444. (28) Lohmann, K. K.; von der Lieth, C.-W. GLYCO-FRAGMENT: a web tool to support the interpretation of mass spectra of complex carbohydrates. Proteomics 2003, 3, 2028–2035. (29) Lohmann, K. K.; von der Lieth, C.-W. GlycoFragment and GlycoSearchMS: web tools to support the interpretation of mass spectra of complex carbohydrates. Nucleic Acids Research 2004, 32, W261–W266. (30) Joshi, H. J.; Harrison, M. J.; Schulz, B. L.; Cooper, C. A.; Packer, N. H.; Karlsson, N. G. Development of a mass fingerprinting tool for automated interpretation of oligosaccharide fragmentation data. Proteomics 2004, 4, 1650–1664. (31) Ceroni, A.; Maass, K.; Geyer, H.; Geyer, R.; Dell, A.; Haslam, S. M. GlycoWorkbench: a tool for the computer-assisted annotation of mass spectra of glycans. Journal of Proteome Research 2008, 7, 1650–1659. (32) Goldberg, D.; Bern, M.; Parry, S.; Sutton-Smith, M.; Panico, M.; Morris, H. R.; Dell, A. Automated N-glycopeptide identification using a combination of single- and tandem-MS. Journal of Proteome Research 2007, 6, 3995–4005. (33) Goldberg, D.; Sutton-Smith, M.; Paulson, J.; Dell, A. Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra. Proteomics 2005, 5, 865–875. (34) Mayampurath, A. M.; Wu, Y.; Segu, Z. M.; Mechref, Y.; Tang, H. Improving confidence in detection and characterization of protein N-glycosylation sites and microheterogeneity. Rapid Communications in Mass Spectrometry 2011, 25, 2007–2019.

ACS Paragon Plus Environment

28

Page 29 of 46

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(35) Woodin, C. L.; Hua, D.; Maxon, M.; Rebecchi, K. R.; Go, E. P.; Desaire, H. GlycoPep Grader: a web-based utility for assigning the composition of N-linked glycopeptides. Analytical Chemistry 2012, 84, 4821–4829. (36) Zhu, Z.; Hua, D.; Clark, D. F.; Go, E. P.; Desaire, H. GlycoPep Detector: a tool for assigning mass spectrometry data of N-Linked glycopeptides on the basis of their electron transfer dissociation spectra. Analytical Chemistry 2013, 85, 5023–5032. (37) Chandler, K. B.; Pompach, P.; Goldman, R.; Edwards, N. Exploring site-specific N-glycosylation microheterogeneity of haptoglobin using glycopeptide CID tandem mass spectra and glycan database search. Journal of Proteome Research 2013, 12, 3652–3666. (38) Gaucher, S. P.; Morrow, J.; Leary, J. A. STAT: a saccharide topology analysis tool used in combination with tandem mass spectrometry. Analytical Chemistry 2000, 72, 2331–2336. (39) Lapadula, A. J.; Hatcher, P. J.; Hanneman, A. J.; Ashline, D. J.; Zhang, H.; Reinhold, V. N. Congruent strategies for carbohydrate sequencing. 3. OSCAR: an algorithm for assigning oligosaccharide topology from MSn data. Analytical Chemistry 2005, 77, 6271–6279. (40) Ethier, M.; Saba, J. A.; Spearman, M.; Krokhin, O.; Butler, M.; Ens, W.; Standing, K. G.; Perreault, H. Application of the StrOligo algorithm for the automated structure assignment of complex Nlinked glycans from glycoproteins using tandem mass spectrometry. Rapid Communications in Mass Spectrometry 2003, 17, 2713–2720. (41) Shan, B.; Zhang, K.; Ma, B.; Zhang, C.; Lajoie, GlycoMaster – A software for interpretation of glycopeptides from MS/MS spectra. Proceedings of the 52nd ASMS Conference on Mass Spectrometry and Allied Topics. 2004.

ACS Paragon Plus Environment

29

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 30 of 46

(42) Tang, H.; Mechref, Y.; Novotny, M. V. Automated interpretation of MS/MS spectra of oligosaccharides. Bioinformatics 2005, 21, i431–i439. (43) Dallas, D. C.; Martin, W. F.; Hua, S.; German, J. B. Automated glycopeptide analysis – review of current state and future directions. Briefings in Bioinformatics 2012, (44) Doubet, S.; Bock, K.; Smith, D.; Darvill, A.; Albersheim, P. The complex carbohydrate structure database. Trends in Biochemical Sciences 1989, 14, 475–477. (45) Doubet, S.; Albersheim, P. CarbBank. Glycobiology 1992, 2, 505. (46) Raman, R.; Venkataraman, M.; Ramakrishnan, S.; Lang, W.; Raguram, S.; Sasisekharan, R. Advancing glycomics: implementation strategies at the consortium for functional glycomics. Glycobiology 2006, 16, 82R–90R. (47) von der Lieth, C.-W.; Freire, A.A.; Blank, D.; Campbell, M.P.; Ceroni, A.; Damerel, D.R.; Dell, A.; Dwek, R.A.; Ernst, B.; Fogh, R.; Frank, M.; Geyer, H.; Geyer, R.; Harrison, M.J.; Henrick, K.; Herget, S.; Hull, W.E.; Ionides, J.; Joshi, H.J.; Kamerling, J.P.; Leeflang, B.R.; Lütteke, T.; Lundborg, M.; Maass, K.; Merry, A.; Ranzinger, R.; Rosen, J.; Royle, L.; Rudd, P.M.; Schloissnig, S.;, Stenutz, R.; Vranken, W.F.; Widmalm, G; Haslam, S.M. EUROCarbDB: an open-access platform for glycoinformatics. Glycobiology 2011, 21, 493–502. (48) Lütteke, T.; Bohne-Lang, A.; Loss, A.; Goetz, T.; Frank, M.; von der Lieth, C.-W. GLYCOSCIENCES.de: an Internet portal to support glycomics and glycobiology research. Glycobiology 2006, 16, 71R–81R. (49) Kanehisa, M.; Goto, S.; Kawashima, S.; Okuno, Y.; Hattori, M. The KEGG resource for deciphering the genome. Nucleic Acids Research 2004, 32, D277–D280.

ACS Paragon Plus Environment

30

Page 31 of 46

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(50) Hashimoto, K.; Kanehisa, M. KEGG GLYCAN for integrated analysis of pathways, genes, and structures. Experimental Glycoscience 2008, 441–444. (51) Ranzinger, R.; Herget, S.; Wetter, T.; Von Der Lieth, C.-W. GlycomeDB – integration of openaccess carbohydrate structure databases. BMC Bioinformatics 2008, 9, 384. (52) Zhang, J.; Xin, L.; Shan, B.; Chen, W.; Xie, M.; Yuen, D.; Zhang, W.; Zhang, Z.; Lajoie, G. A.; Ma, B. PEAKS DB: de novo sequencing assisted database search for sensitive and accurate peptide identification. Molecular & Cellular Proteomics 2012, 11, M111.010587. (53) Perkins, D. N.; Pappin, D. J. C.; Creasy, D. M.; Cottrell, J. S. Probability-Based Protein Identification by Searching Sequence Databases using Mass Spectrometry Data. Electrophoresis 1999, 20, 3551–3567. (54) Eng, J. K.; McCormack, A. L.; Yates III, J. R. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. Journal of the American Society for Mass Spectrometry 1994, 5, 976–989. (55) Conboy, J. J.; Henion, J. D. The determination of glycopeptides by liquid chromatography/mass spectrometry with collision-induced dissociation. Journal of the American Society for Mass Spectrometry 1992, 3, 804–814. (56) Huddleston, M. J.; Bean, M. F.; Carr, S. A. Collisional fragmentation of glycopeptides by electrospray ionization LC/MS and LC/MS/MS: methods for selective detection of glycopeptides in protein digests. Analytical Chemistry 1993, 65, 877–884. (57) Bern, M.; Cai, Y.; Goldberg, D. Lookup peaks: a hybrid of de novo sequencing and database search for protein identification by tandem mass spectrometry. Analytical Chemistry 2007, 79, 1393– 1400. ACS Paragon Plus Environment

31

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 32 of 46

(58) Hines, W. M.; Falick, A. M.; Burlingame, A. L.; Gibson, B. W. Pattern-based algorithm for peptide sequencing from tandem high energy collision-induced dissociation mass spectra. Journal of the American Society for Mass Spectrometry 1992, 3, 326–336. (59) Domon, B.; Costello, C. E. A systematic nomenclature for carbohydrate fragmentations in FABMS/MS spectra of glycoconjugates. Glycoconjugate Journal 1988, 5, 397–409. (60) Harvey, D. J. Collision-induced fragmentation of underivatized N-linked carbohydrates ionized by electrospray. Journal of Mass Spectrometry 2000, 35, 1178–1190. (61) Guile, G. R.; Rudd, P. M.; Wing, D. R.; Prime, S. B.; Dwek, R. A. A rapid high-resolution highperformance liquid chromatographic method for separating glycan mixtures and analyzing oligosaccharide profiles. Analytical Biochemistry 1996, 240, 210–226. (62) Moruz, L.; Tomazela, D.; käll, L. Training, selection, and robust calibration of retention time models for targeted proteomics. Journal of Proteome Research 2010, 9, 5209–5216.

ACS Paragon Plus Environment

32

Page 33 of 46

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Table 1. The grouped results of GlycoMaster DB from the RNase-B data set. The spectra with the same precursor m/z and charge are grouped in a row if they have the same glycan composition. The GSM score, PSM score and mass error in a row are from the HCD/ETD spectrum-pair with the highest GSM score. Precursor Precursor Charge m/z 886.9 2 886.9 2 967.93 2 645.62 3 699.64 3 1048.95 2 753.65 3 1129.98 2 767.33 3 807.67 3 1211.01 2 861.69 3

RT Range 24.41-29.12 24.41-29.12 20.13-29.67 19.79-32.45 24.74-30.90 24.68-29.35 27.56-29.59 29.27 26.82-28.31 26.86-30.52 29.64 28.02-30.28

Glycan Composition (HexNAc)2Hex4 (HexNAc)2Hex4 (HexNAc)2Hex5 (HexNAc)2Hex5 (HexNAc)2Hex6 (HexNAc)2Hex6 (HexNAc)2Hex7 (HexNAc)2Hex7 (HexNAc)3Hex6 (HexNAc)2Hex8 (HexNAc)2Hex8 (HexNAc)2Hex9

GSM Score 33 33 40.87 39.27 49.81 46.74 42.79 38.62 32.11 50.65 30.79 62.83

Glycan Mass 1054.37 1054.37 1216.43 1216.43 1378.49 1378.49 1540.54 1540.54 1581.57 1702.59 1702.59 1864.64

PSM Score 113.34 113.34 82.14 51.7 43.95 71.28 58.02 130.28 48.32 75.25 127.78 37.52

Error (ppm) -0.89 -0.89 -0.44 -0.09 -0.96 -2.91 0.65 -0.76 -2.31 -0.38 -1.81 0.14

ACS Paragon Plus Environment

33

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

Page 34 of 46

Table 2. Glycopeptide reported by GlycoMaster DB from the Human-IgG data set. Precursor Precursor GSM PSM Error RT Glycan Peptide Charge Score Score (ppm) m/z 1301.53 2 12.55 (HexNAc)4Hex3Fuc1 27.51 EEQFN(+1444.54)STFR 10 -0.28 1317.53 2 12.52 (HexNAc)4Hex3Fuc1 25.59 EEQYN(+1444.54)STYR 13.93 -0.56 1301.53 2 11.89 (HexNAc)4Hex3Fuc1 20.43 EEQFN(+1444.54)STFR 16.62 -0.75 1028.79 3 18.49 (HexNAc)4Hex3Fuc1 20.14 TKPREEQFN(+1444.54)STFR 69.07 -1.42 1039.45 3 19.56 (HexNAc)4Hex3Fuc1 14.36 TKPREEQYN(+1444.54)STYR 47.2 -1.41 1082.8 3 18.69 (HexNAc)4Hex4Fuc1 13.79 TKPREEQFN(+1606.59)STFR 89.89 3.83 1093.47 3 19.32 (HexNAc)4Hex4Fuc1 12.27 TKPREEQYN(+1606.59)STYR 70.55 -2.57 1082.81 3 18.17 (HexNAc)4Hex4Fuc1 11.15 TKPREEQFN(+1606.59)STFR 34.58 -2.25 1093.47 3 19.9 (HexNAc)4Hex4Fuc1 10.91 TKPREEQYN(+1606.59)STYR 56 -2.01 1093.47 3 18.74 (HexNAc)4Hex5 10.65 TKPREEQFN(+1622.59)STYR 41.99 -2.46

IgG Subclass IgG 2/3 IgG 1 IgG 2/3 IgG 2/3 IgG 1 IgG 2/3 IgG 1 IgG 2/3 IgG 1 IgG 1

ACS Paragon Plus Environment

34

Page 35 of 46

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Table 3. This table lists the results of each spectral file in the Enriched-HUP data set. The spectral files named “gpe12” and “gpe23” is not listed since they have no glycan reported by GlycoMaster DB. The first column lists the names of the spectral files. The second column denotes the numbers of MS/MS spectra in each file after data preprocessing. The subsequent two columns give the numbers of spectra that passed the two filters, respectively. The number of identified GSMs (-10log10P ≥ 20) and PSMs are listed in the last two columns. The last row shows the total number of each column.

a

Data Name

MS/MS Protein Number Number

gpe01 gpe02 gpe03 gpe04 gpe05 gpe06 gpe07 gpe08 gpe09 gpe10 gpe11 gpe13 gpe14 gpe15 gpe16 gpe17 gpe18 gpe19 gpe20 gpe21 gpe22 gpe24 Total

8,850 9,272 9,246 9,350 9,013 8,842 8,206 9,325 9,000 9,254 9,391 7,977 7,671 8,218 8,134 8,140 8,256 8,431 8,485 8,981 8,334 8,320 190,696

53 63 65 53 59 68 100 83 77 87 86 71 65 68 91 94 88 84 81 44 79 65 497a

Pass Filter-1 (Diagnostic peaks)

Pass Filter-2 (Ion ladder)

348 252 177 272 188 292 227 179 161 135 231 340 664 563 264 182 245 245 190 122 113 65 5,455

84 62 56 89 60 72 34 32 32 32 24 109 316 272 127 87 153 80 92 69 27 24 1,933

GSM PSM Number Number 63 33 19 22 49 34 25 15 25 10 7 59 146 148 75 63 90 57 39 54 13 6 1,052

63 33 19 20 45 27 25 14 25 5 6 58 133 139 75 63 90 56 31 46 12 4 989

This is the total number of unique proteins, rather than the sum of protein numbers reported in each spectral data.

ACS Paragon Plus Environment

35

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 36 of 46

Table 4. This table lists the results of each spectral file in the HUP data set. Only the 24 spectral data having identified glycans by GlycoMaster DB are listed. The first column lists the names of the spectral files. The second column denotes the numbers of MS/MS spectra in each file after data preprocessing. The subsequent two columns give the numbers of spectra that passed the two filters, respectively. The number of identified GSMs (−10lgP ≥ 20) and PSMs are listed in the last two columns. The last row shows the total number of each column.

a

Data Name

MS/MS Number

Protein Number

Pass Filter-1 (Diagnostic peaks)

Pass Filter-2 (Ion ladder)

ig06 ig07 ig08 ig09 ig12 ig13 ig14 ig15 ig16 ig17 ig18 ig19 ig20 ig21 ig22 ig23 ig24 ig25 ig26 ig27 ig28 ig29 iga2 iga3 Total

8,255 8,555 8,370 6,044 8,982 8,999 5,827 5,724 5,376 5,046 4,530 4,464 5,034 5,037 4,884 5,325 4,888 4,210 4,110 3,961 3,829 4,140 8,860 9,398 143,848

71 61 70 87 78 79 274 340 289 312 325 309 325 278 287 303 308 326 310 300 292 362 68 87 1,485a

165 188 142 89 226 334 330 460 361 323 346 275 342 417 342 464 294 204 215 240 104 192 134 125 6,312

13 42 10 22 30 56 74 103 63 41 39 32 36 44 33 61 19 22 25 13 12 30 20 20 860

GSM PSM Number Number 3 17 11 15 16 42 25 36 34 16 20 14 18 8 12 11 7 6 8 2 3 4 1 10 339

3 17 11 15 16 38 23 35 32 16 19 14 18 7 11 9 7 6 6 2 1 3 0 10 319

This is the total number of unique proteins, rather than the sum of protein numbers reported in each spectral data.

ACS Paragon Plus Environment

36

Page 37 of 46

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 1. The MaxTagLength algorithm for solving the longest sequence of monosaccharide residues (LSMR) problem.

ACS Paragon Plus Environment

37

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 38 of 46

(a)

(b) Figure 2. A glycan with the composition (HexNAc)3Hex6 reported from the RNase-B data set generated by the HCD PI ETD strategy. (a) The annotated HCD spectrum of precursor ions with m/z 767.33167. GlycoMaster DB reported the best matching glycan with the composition (HexNAc)3Hex6. SRNLTK is the only potential glycopeptide having the similar mass to the calculated mass 699.404. (b) The annotated ETD spectrum triggered by product ions in the HCD spectrum shown in (a).

ACS Paragon Plus Environment

38

Page 39 of 46

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(a)

(b) Figure 3. An example of a glycopeptide reported from the RNase-B data set generated by the HCD PI ETD strategy. (a) The annotated HCD spectrum of precursor ions with m/z 807.672. GlycoMaster DB reported the best matching glycan with the composition (HexNAc)2Hex8. SRNLTK is the only potential glycopeptide having the similar mass to the calculated mass 699.404. (b) The annotated ETD spectrum triggered by product ions in the HCD spectrum shown in (a). It provides positive support for the identification of the peptide SRNLTK and the glycosylation site.

ACS Paragon Plus Environment

39

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 40 of 46

(a)

(b) Figure 4. An example of a glycopeptide identified from the Human-IgG data set generated by the HCD PI ETD strategy. (a) The annotated HCD spectrum of precursor ions with m/z 1028.79. The best matched glycan reported by GlycoMaster DB has the composition (HexNAc)4Hex3Fuc1. (b) The annotated ETD spectrum triggered by product ions in the HCD spectrum shown in (a). It provides positive support for the identification of peptide TKPREEQFNSTFR and the glycosylation site.

ACS Paragon Plus Environment

40

Page 41 of 46

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(a)

(b)

(c) Figure 5. An example of glycans identified from three HCD spectra by GlycoMaster DB in the EnrichedHUP data set. Three HCD spectra have similar retention time but different precursor mass values. GlycoMaster DB identified three glycans. The calculated peptide mass is approximate 771.41. NWTITR is the only tryptic glycopeptide matching this mass value from the protein short list provided to GlycoMaster DB. The mass errors of the identifications of these three spectra are -1.27 ppm, -1.29 ppm, and -1.99 ppm, respectively.

ACS Paragon Plus Environment

41

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 42 of 46

(a)

(a) Figure 6. Illustration of two HCD mass spectra that are interpreted as the same peptide but two slightly different glycans in the Enriched-HUP data set. (a) The oxonium ions from sialic acids are not present, and this indicates the absence of sialic acids in the glycan; (b) The peaks at m/z 292.10 and 274.09 indicate the existence of B-ions of sialic acid residues.

ACS Paragon Plus Environment

42

Page 43 of 46

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(a)

(b)

(c) Figure 7. Examples of glycans identified from three HCD spectra by GlycoMaster DB in the HUP data set. Three HCD spectra have similar retention time but different precursor mass values. GlycoMaster DB identified three glycans from them and these glycans differ from each other slightly. The calculated peptide mass is approximate 1449.74. VYKPSAGNNSLYR is one of the two peptides matching this mass value in the proteins provided to GlycoMaster DB but the other one has potassium adduct and much larger mass error at around 10 ppm. Therefore, VYKPSAGNNSLYR is selected as the glycopeptide and the precursor mass errors are 0.15 ppm, 0.07 ppm, and -0.15 ppm, respectively.

ACS Paragon Plus Environment

43

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 44 of 46

Figure 8. The Venn diagram showing the overlaps between the two sets of glycopeptide groups identified from the Enriched-HUP and HUP data sets, respectively.

ACS Paragon Plus Environment

44

Page 45 of 46

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 9. The average percentage of tryptic peptides containing the N-linked glycopeptide sequon that have unique mass.

ACS Paragon Plus Environment

45

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 46 of 46

Graphical TOC Entry

ACS Paragon Plus Environment

46