Article pubs.acs.org/ac
Computational Framework for Identification of Intact Glycopeptides in Complex Samples Anoop Mayampurath,† Chuan-Yih Yu,† Ehwang Song,‡ Jagadheshwar Balan,† Yehia Mechref,*,‡ and Haixu Tang*,† †
School of Informatics & Computing, Indiana University, Bloomington, Indiana 47408, United States Department of Chemistry and Biochemistry, Texas Tech University, Lubbock, Texas 79409, United States
‡
S Supporting Information *
ABSTRACT: Glycosylation is an important protein modification that involves enzymatic attachment of sugars to amino acid residues. Understanding the structure of these sugars and the effects of glycosylation are vital for developing indicators of disease development and progression. Although computational methods based on mass spectrometric data have proven to be effective in monitoring changes in the glycome, developing such methods for the glycoproteome are challenging, largely due to the inherent complexity in simultaneously studying glycan structures with their corresponding glycosylation sites. This paper introduces a computational framework for identifying intact N-linked glycopeptides, i.e. glycopeptides with N-linked glycans attached to their glycosylation sites, in complex proteome samples. Scoring algorithms are presented for tandem mass spectra of glycopeptides resulting from collision-induced dissociation (CID), higher-energy C-trap dissociation (HCD), and electron transfer dissociation (ETD) fragmentation modes. An empirical false-discovery rate estimation method, based on a target-decoy search approach, is derived for assigning confidence. The power of our method is further enhanced when multiple data sets are pooled together to increase identification confidence. Using this framework, 103 highly confident N-linked glycopeptides from 53 sites across 33 glycoproteins were identified in complex human serum proteome samples using conventional proteomic platforms with standard depletion of the 7-most abundant proteins. These results indicate that our method is ready to be used for characterizing sitespecific protein glycosylation in complex samples.
G
physiological and pathological responses, it is unsurprising that many glycosylation events have been connected to human diseases. Glycan recognition is fundamental to host-microbe interaction5 of which the infection of influenza viruses is the most studied.1 Glycosylation has also been associated with various cancers where variations in glycan and glycoprotein abundance have been observed in cancer patients in comparison with healthy individuals.6−9 This association warrants the study of glycosylation to develop potential disease biomarkers, especially from human serum samples.10−12 Within glycobiology, the use of mass spectrometry (MS) has been gaining popularity as a means of analyzing both glycans and glycoproteins. Glycans and their peptides can be studied as individual units. Alternatively, they can also be considered as a single entity thereby providing a holistic view of protein glycosylation. Using multiple fragmentation techniques such as collision-induced dissociation (CID), higher-energy C-trap dissociation (HCD) and electron transfer dissociation (ETD)
lycans are involved in a variety of biological processes such as intercellular communication and immune response to pathogen infections.1,2 As a common protein post-translational modification, glycosylation occurs when a glycan is linked to specific amino acid residues within a protein. N-Linked glycosylation (or N-glycosylation) usually involves the attachment of a sugar chain to any Asn in the common AsnXaa-Ser/Thr and the atypical Asn-Xaa-Cys motifs (where Xaa is any amino acid except proline). The attachment of similar oligosaccharides on Gln or nonconsensus Asn has also been recently reported in human recombinant antibodies.3 The modification occurs prior to protein folding, implying that Nglycosylation affects the tertiary structure and stability of a glycoprotein. O-Linked glycosylation (or O-glycosylation) commonly involves the attachment of a glycan to a Ser/Thr residue and occurs mainly after protein-folding. Glycans present enormous structural diversity through the presence of branching structures, stereomeric configurations and flexible glycosidic bonds. A majority of human proteins have been reported to be glycosylated,4 making glycosylation the most prevalent and heterogeneous post-translational modification of human proteins. Given their role in © 2013 American Chemical Society
Received: July 26, 2013 Accepted: November 26, 2013 Published: November 26, 2013 453
dx.doi.org/10.1021/ac402338u | Anal. Chem. 2014, 86, 453−463
Analytical Chemistry
Article
glycoproteins and glycopeptides. We show that a significant number of N-linked glycopeptides are identified in human serum samples at high confidence, implying the applicability of our framework in studying complex glycoproteomes. Our framework is implemented in a software tool called GlycoFragwork. The tool provided a platform for simultaneous scoring of fragmentation spectra of glycopeptides using different methods. GlycoFragwork analyzes multiple prealigned LC-MS/MS data sets, and reports a list (termed as a glycomap) of identified intact glycopeptides with their mass, elution time and abundance. Out of all three fragmentation modes, the availability of CID is the only mandatory requirement, but the availability of HCD and/or ETD will lead to better characterization of glycopeptides. The GlycoFragwork software is available for free download at http://darwin.informatics. indiana.edu/col/GlycoFragwork/. For details on system requirements and data set formats, see Supporting Information.
on enzymatically (e.g., trypsin) digested glycopeptides, information on both the glycosylation site and the glycan can be gleaned at the same time. Details on different fragmentation techniques for both N-linked and O-linked glycosylation are available in the literature.13−21 This paper focuses primarily on developing algorithms to identify and analyze tryptically digested N-linked glycosylations. All references to glycans and glycopeptides henceforth are of N-linked type. CID of N-glycopeptides using IT-MS/MS (ion-trap MS/ MS) is dominated by B and Y ions22 resulted from glycosidic fragmentation of the sugar chain, where the B ions correspond to fragmentation from the nonreducing end oxonium ions and the Y ions originate from the reducing-end fragmentation. There may also be b and y ions from the peptide fragmentation, but these are usually of minimal abundance since glycosidic bonds are weaker than peptide bonds. Hitherto developed bioinformatic tools utilize CID fragmentation information along with precursor ion mass to reduce spurious hits. StrOligo,23 OSCAR,24 Peptoonist,25 GlycoWorkBench,26 GlycoX,27 and GlypID28 are a few examples. A recent web-based tool, GlycoPepGrader,29 uses a set of devised fragmentation rules to assign glycopeptide composition to MS/MS data by looking for characteristic patterns. In HCD, oxonium ions of monosaccharide, disaccharide or even trisaccharide originating from the glycan strcutures are observed with high mass accuracy. These oxonium ions are commonly used to detect glycopeptides. 30 In earlier work, we have reported a corresponding scoring scheme for HCD.31 Finally, ETD of glycopeptides involves the fragmentation of the peptide backbone producing primarily c and z product ions with intact glycan attached to the residue although exceptions have been observed.32 These properties are commonly utilized by peptide search engines that improved identification by combining ETD/CID spectra of the same precursor ion.33−36 A recently introduced scoring scheme37−39 for ETD used various ad hoc preprocessing to boost signals in regions of the spectrum depending on the location of the glycan in the glycopeptide. The efficacy of this, however, is not well-understood and remains to be explored. In this paper, we present an informatics framework to characterize glycopeptides in biological proteome samples using the scoring of complementary fragmentation techniques, including CID, HCD, and ETD. In addition to filtering glycopeptide ions based on HCD scoring, the framework utilizes scoring schemes of ETD spectrum (to determine the peptide sequence) and a glycan sequencing algorithm from CID spectrum (to characterize the glycan). The term “glycan sequence” is used here to indicate a cartoon-graph representation of glycans that contain monosaccharide composition and topology. The algorithm of constructing the topological arrangement of monosaccharides from CID spectra is referred to as “glycan sequencing”. Combining multiple modes of fragmentation allows the accurate characterization of both the glycan and the peptide part of an intact glycopeptide. A novel target-decoy search approach that integrates these complementary scores is introduced to estimate false discovery rate (FDR) of glycopeptide identification. The performance of the framework is demonstrated using simple model glycoproteins, standard glycoprotein mixtures and complex human serum samples. Notably, these samples were analyzed using LC-MS/MS protocols commonly used in proteomics (including the depletion of the 7-most abundant proteins) without specifically designed procedures for the enrichment of
■
EXPERIMENTAL PROCEDURES Data Sets and Databases. Three separate analyses were conducted, namely, a simple glycoprotein (Fetuin) study, a mixture of standard glycoproteins (5SG) study and a complex serum (Serum) study. All experimental details pertaining to sample preparation, tryptic digestion and LC-MS/MS are given in Supporting Information. A summary of the data sets and databases used in this paper is given in Table 1. The Fetuin Table 1. Data Sets (a) and Databases (b) Used in This Papera (a) analysis name
no. HCD/ CID data sets
no. HCD/ CID/ETD data sets
total
comments
Fetuin 5SG
2 0
1 4
3 4
Serum
10
2
12
fetuin sample on 1 h LC 3 data sets on 1 h; one was on 5 h LC column all 5 h, two sample types, cancer and control
(b) database name
no. glycoproteins
Fetuin_dB 5SG_dB Test_dB SerumMascot_dB SerumCombined_dB
2 5 71 105 566
a
Fetuin glycomaps were created using the fetuin databases both the Fetion_dB and Test_dB. 5SG glycomaps were created using the 5SG data sets and 5SG_dB and Test_dB. Serum glycomaps were created using the SerumMascot_dB and SerumCombined_dB databases.
study involved three data sets, one of which had all three modes of fragmentation while the remaining 2 had only HCD/CID. All four data sets were HCD/CID/ETD in the 5SG study which involved the analysis of samples containing 5 model glycoproteins, namely, bovine fetuin, human α1-acid glycoprotein (AGP), bovine pancreatic ribonuclease B (RNase B), porcine thyroglobulin (PTG), and human fibronectin, as described in supplementary methods. One of these 5SG data sets was subjected to a separate 5-h LC analysis as opposed to the rest that were subjected to a 1-h LC analysis. The Serum study involved the pooling of serum samples into two groups (esophageal cancer and control) of 6 each. One in each group 454
dx.doi.org/10.1021/ac402338u | Anal. Chem. 2014, 86, 453−463
Analytical Chemistry
Article
Figure 1. ETD fragmentation of example glycopeptide from fetuin sample. The original spectrum (shown at the bottom) is zoomed (shown at the top) to illustrate the region between 0 and parent m/z (1470.64). Peaks corresponding to c and z fragmentation can be observed although at very low intensities.
of a sialic acid termination. To score an ETD fragmentation spectrum, a set of candidate (i.e., target) glycopeptides must be constructed. For this purpose, a library of intact N-linked glycopeptides is built by attaching each N-glycan from our constructed glycan list to peptides containing sequons obtained from in silico tryptic digest (allowing up to two miss-cleavages) of all glycoproteins in the target database (in fasta format). To estimate FDR of glycopeptide identification, a library of decoy glycopeptides is also constructed by attaching each of the above N-glycans to the reversed sequence of each tryptic peptide. The position of the sequon is kept intact while reversing the original peptide sequence. The objective here is to emulate occurrences of naturally occurring glycopeptides. Keeping the sequon intact in the decoy sequence will construct a similar peptide sequence as the target glycopeptide sequence, thereby assuring a better estimate of FDR. If the peptides were simply reversed, the FDR estimates are likely to be underestimated because the sequon property is not captured in the random glycopeptide space. The target and decoy glycopeptides are combined to form a database against which the ETD fragmentation spectra are searched. This was done for each of the databases listed in Table 1b. As shown in Figure 1, ETD fragmentation of a glycopeptide results in low abundant fragments that are mostly observed at
was a HCD/CID/ETD data set whereas the remaining 5 were just HCD/CID. Databases were built according to individual study type. The Fetuin study used a database (Fetuin_dB) constructed from a fetuin A and B glycoproteins. The 5SG study used a database (5SG_dB) constructed from a fasta file containing the 5 model glycoproteins. For the Serum study, the CID fragmentation spectra of each of the 10 HCD/CID data sets were searched using MASCOT40 against the IPI database (version 3.79). Only identified proteins containing the glycosylation sequon (NXS/ T, X!=P) were retained as putative glycoproteins. A list of 116 glycoproteins was obtained which was trimmed down to a list of 105 unique and reviewed UniProtKB/SwissProt IDs in the UniProtKB database.41,42 The list is henceforth referred to as SerumMascot_dB. To test the robustness of all identifications, we also built larger databases, namely, the Test_dB (with 71 glycoproteins) and the SerumCombined_dB (with 566 glycoproteins, see supplementary methods). A list of 319 glycans was also compiled to a glycan database (Supporting Information). Scoring CID/HCD/ETD Spectra of Glycopeptides. We use the same p-value-based HCD scoring along with accurate deisotoping as described in our previous work31 to determine the presence of a glycopeptide ion and also to test the presence 455
dx.doi.org/10.1021/ac402338u | Anal. Chem. 2014, 86, 453−463
Analytical Chemistry
Article
Figure 2. Illustration of glycan sequencing algorithm. Starting from the y1 ion the core is first completed. Then, using N-glycan synthesis rules, a set of candidates is built at each round of matching. At termination (marked with *), the m/z difference between the sequenced glycopeptide and the precursor m/z is used to derive the remaining sequence parts. In this case, this was unsuccessful, so the charge state was incremented and the remaining glycan sequence components could be derived (marked with ** and ***).
m/z values below that of the parent ion. After testing a variety of preprocessing options43 on manually validated spectra, a dual strategy was adopted here on all peaks within the specified range. In the first pass, the top peak in a 5 m/z bin is selected, thus ensuring the use of a local noise threshold. Next, the top 20 overall observed peaks, excluding the precursor peak, are chosen for peak matching to theoretical fragments of glycopeptides. For each ETD spectrum, a list of target and decoy glycopeptides within 10 ppm of the precursor mass is considered from the database. For each candidate, a theoretical fragmentation spectrum is constructed using the peptide backbone with the attached glycan as a fixed Asn (N) modification. Carbamidomethylation of cysteine residues was considered as a fixed modification for each glycopeptide. Theoretical ion intensities were assigned as follows: c and z ions were considered with an intensity of 100, b and y ions were considered with an intensity of 50, and b and y ions from neutral loss were considered with an intensity of 25 each. These values were empirically assigned after manual validation of several annotated experimental spectra. While matching, both theoretical and observed spectra are binned into 1 Thompson bins. A bin is considered to be a “match” if both the corresponding theoretical and observed values are matched. Consider an observed spectrum of bin size n out of which m bins are found to be matched against a theoretical spectrum. The ETD score can then be calculated as
m
m
ETD_score =
∑ Ij ,theoretical + j=1
∑ j = 1 Ij ,observed m
∑i = 1 Ii ,observed
Essentially, the ETD score is the summation of all matched theoretical intensities with a decimal component as the percentage of matched observed intensity. Glycopeptide candidates with two or more sequons were considered separately for each sequon position. The match with the highest ETD score is retained as the best match with a TRUE or FALSE designation depending on whether the highest scoring match corresponds to a target or decoy glycopeptide, respectively. Glycan Sequencing Using CID Spectra. We implemented a de novo sequencing algorithm using the CID spectrum for automated annotation of the glycan component of an N-linked glycopeptide. In simple glycoprotein samples, deciphering the glycan topology is relatively straightforward. However, for complex samples, this becomes tedious and thus automated glycan sequencing with the derived score presented here becomes very useful. Additionally, the scores from the sequencing algorithm are combined with ETD scores to offer a single score that characterizes the whole glycopeptide (see below). Complex glycoproteome samples have a large number of glycopeptide candidates; thus combining ETD and CID will help characterize the confidence of the best hit. The heuristic glycan sequencing algorithm only requires the parent mass (or peptide sequence) and the parent charge state as input. Further 456
dx.doi.org/10.1021/ac402338u | Anal. Chem. 2014, 86, 453−463
Analytical Chemistry
Article
Figure 3. Pipeline on which GlycoFragwork is based. Data sets from LC-MS/MS platforms are aligned using tools such as MultiAlign. Fragmentation information of every common ion or LC-MS cluster (each blue circle) is collected. HCD scoring indicates the presence of a glycopeptide ion. A set of glycopeptide candidates is built that match the precursor mass based on mass. The ETD spectrum is scored against each of these glycopeptide candidates (shown in blue cylinders). Each fragmentation event is compiled as a glycopeptide-spectra match (GSM, shown in blue boxes) that consists of the highest scored glycopeptide along with individual HCD, CID and ETD scores. The top ranked candidate across all GSMs (shown in red box) is selected as the representative GSM (rep-GSM) and is given to sequencing the corresponding CID spectra. Finally, the ETD scores and the glycan sequencing scores are retained in the rep-GSM for FDR estimation.
made allowances for a fucosylated core GlcNAc, thereby expanding the search pool of glycans. The algorithm proceeds to grow the glycan outside the core in a nonlinear fashion based upon the peaks observed in the CID spectrum. Peak matching is done with an empirically chosen tolerance of 0.8 m/z. To limit the size of the candidate search space at every step, we used a set of glycan synthesis rules emulating the N-glycan synthesis process to guide the sequencing procedure. For example, fucose can only be attached to a GlcNAc and not a Man or a Gal. The complete set of rules is listed in Supporting Information Table S-1a and b. Given a glycan sequence from the candidate pool, the algorithm will add each putative monosaccharide to this
details with an example are provided in supplementary methods. Within GlycoFragwork, candidate glycopeptides (made up of peptide sequence and glycan compositions) are chosen based on ETD scoring. Peptide masses are used to calculate the mass of Y1 ion, which along with the glycan composition is given as input to the sequencing algorithm. The algorithm starts from the position of the Y1 ion derived from the peptide mass plus one GlcNAc mass and incorporating an appropriate charge, which usually carries one less charge than the precursor ion charge. Since all N-linked glycans have the same pentasaccharide core (GlcNAc2Man3), the algorithm predicts the next m/z values continuously until the core is completed and accounted for in the spectrum. We 457
dx.doi.org/10.1021/ac402338u | Anal. Chem. 2014, 86, 453−463
Analytical Chemistry
Article
sequence, calculate the total glycan m/z and match it to observed peaks in the CID spectrum. If there is a match, the new monosaccharide is attached to the partial glycan sequence based on the glycotransferase rules and the updated sequence is inserted into the candidate pool for the next round. This attaching and matching procedure begins with the core and continues until the maximum observed m/z is reached in the CID spectrum. Consider the example illustrated in Figure 2. Starting with the Y1 ion at 972.72 m/z, the core was grown to completion at 1316.43 m/z. Then, each monosaccharide was continuously added and peak matched until the m/z limit reached beyond the observed m/z range in the CID spectrum (as illustrated by the glycan marked with the * at 1865.80 m/z). Usually, at this point, the parent m/z is utilized to select the monosaccharides that filled the missing segments. If successful, the algorithm is terminated and the completed sequence or the closest glycopeptide candidate to the precursor mass within a user-defined tolerance (80 ppm by default determined empirically) is returned. Most CID spectra, however, have one or more missing peaks that hamper the sequencing. While the algorithm has the functionality to add a di and trisaccharide to a partial glycan sequence (thereby accounting for up to 2 missing fragments), the algorithm still sometimes returns incomplete sequences. We observed that this happened primarily because of charge state discrepancy between fragment ions. While large components of the glycopeptides fragments carried a charge that was one less than the precursor ion, sometimes fragment ions were observed to carry the same charge as the precursor ion. In order to account for this, the algorithm increments the charge state by 1 and repeats the sequencing procedure starting from the beginning of CID spectrum. This procedure is illustrated in Figure 2. The first pass of the glycan sequencing algorithm was done with charge state +2 and terminated at 1865.80 m/z (marked with *). On incrementing the charge state to +3, we immediately see matching candidate, marked with ** (the peak at 1340.91 m/z) and *** (the peak at 1438.29 m/z). The terminal NeuAc monosaccharide was added based the m/z difference between the sequenced glycopeptide and the parent as mentioned above. The final output of the sequencing algorithm is the complete N-glycan sequence or the top-ranked partial N-glycan sequence encoded as an IUPAC string along with a matching score calculated as the log-2 of the summed intensities of matched peaks within the CID spectrum. The top ranking glycan sequence and corresponding score are reported and in tandem with the ETD score, is used for estimating the FDR of the glycopeptide assignment (see below). The algorithm is currently being implemented in C# as a stand-alone tool that will be released in a future publication. Here in GlycoFragwork, the sequencing algorithm was incorporated as a dynamic linked library (.dll) for N-linked glycopeptide analysis Analyzing Features Across LC-MS/MS data Sets. Only recently have instrument techniques been developed which are capable of multiple sources of fragmentation in a single run.19 More often, runs have individual or paired (CID-HCD or CIDETD) fragmentation. In order to combine fragmentation information for a single glycopeptides, individual data sets must be aligned so that MS/MS information associated with different fragmentation types can be pooled together. Additionally, using replicates (or repeated measurements of ion intensity) also facilitates glycopeptide quantification across
different samples. GlycoFragwork implements a functionality to analyze multiple glycoproteomic data sets that are previously aligned from tools such as MultiAlign (http://omics.pnl.gov/ software/MultiAlign.php). The pipeline of the workflow within the software is illustrated in Figure 3. GlycoFragwork utilizes the aligned map generated from MultiAlign, the individual data sets, the target glycoprotein database (as a fasta file) and a list of glycans (either default or user-defined) as input. An aligned map of data sets contains a bunch of LC-MS cluster records (denoted as blue circles in Figure 3) with each one pertaining to a common ion. We only consider those clusters that are indicative of being a glycopeptide of interest. This is done by matching the mass of each LC-MS cluster record using a parametrized tolerance of 10 ppm against a target plus decoy glycopeptide database created from the fasta file and the glycan list. Once a set of candidate glycopeptides is created for a LC-MS record, GlycoFragwork proceeds to gather “fragmentation event” information by pooling together MS/MS spectra. Fragmentation events are compiled based on clustering of CID spectra, following which different scoring schemes are applied for each event (see Supporting Information for details). GlycoFragwork was optimally designed to incorporate all three fragmentation methods, that is, CID, HCD, and ETD for the accurate identification of glycopeptides. These fragmentation methods can be acquired from either a single data set or from different LC-MS/MS runs. For example, a CID-HCD data set can be combined with a CID-ETD data set to acquire data from all three fragmentation channels. GlycoFragwork can also support input sets that contain only CID-HCD in which case, glycopeptide assignments are based on precursor mass. It can also support input sets that contain only CID-ETD with a few Boolean parameter changes (e.g., setting HCD scoring and HCD-based glycopeptide candidate filtering to FALSE). However, since CID-based clustering is a crucial step in generating cluster information, GlycoFragwork currently cannot support an input set of only HCD-ETD data sets, and thus cannot be directly used to analyze the data acquired by using the HCD-product-dependent-ETD mode on Thermo Orbitrap instrument. We will address this limitation in future software release. In this particular scenario at least one CID only, CID-HCD, or CID-ETD data set must be used along with the HCD-ETD data sets. It must be pointed here that for complex samples, such as serum, having all three fragmentation data leads to better identifications through combining the sensitivity of the HCD scoring and accurate assessment of confidence in assignment from CID-ETD (see below). Assignment of Glycopeptides in Maps. In traditional proteomics, the concept of peptide spectrum matches44 (PSMs) is prevalent, wherein a PSM within a data set typically contain a one-to-one mapping between peptide and spectrum and receives a single score. In glycoproteomics, we introduce the notion of glycopeptide-spectra match (GSM) that contains a one-to-many mapping between a glycopeptide candidate and multiple spectra (HCD, CID, and ETD) and thus receives multiple scores (HCD score, CID path score, ETD score, CID glycan sequencing score). While GSM could be technically used to indicate glycopeptide-spectrum match (i.e., a one-to-one mapping), here in this study and in the context of GlycoFragwork, we imply the association between glycopeptide and spectra (and thus should be interpreted as the glycopeptide−spectra match). When we use both target and decoy glycopeptides in the spectrum scoring of a single data set, 458
dx.doi.org/10.1021/ac402338u | Anal. Chem. 2014, 86, 453−463
Analytical Chemistry
Article
Figure 4. Reassessing FDR using ETD and CID sequencing scores. Above plots are for the serum map analyzed against the SerumMascot_dB database created using the klaR R package. Each point is a rep-GSM with a corresponding ETD (x-axis) and CID sequencing score(y-axis). Both target rep-GSM (denoted by T) and decoy rep-GSM (denoted by F) are plotted. These rep-GSMs are clearly clustered together in different regions of the 2D map. (a) LDA clustering results in generation of projection weights which can be used to mark a decision boundary (b). Instead of clustering however, the weights are used to calculate a new ES score that is used for estimating FDR. The dark dots indicate center of regions, but are unused here.
we obtain a collection of GSMs resulting from matching to target (or target-GSM) glycopeptides or from matching to decoy (or decoy-GSM) glycopeptide matching. In GlycoFragwork, each fragmentation event is compiled into a GSM with a corresponding HCD score, CID score, the highest ETD scoring glycopeptide, and associated ETD score. The GSM with the highest ETD score is chosen as the representative (or repGSM) that either matches target (target rep-GSM) glycopeptides or decoy (decoy rep-GSM) glycopeptides. Finally, the glycopeptide candidate in each rep-GSM is scored against the corresponding CID spectra through glycan sequencing. The sequencing scores are retained as part of the rep-GSM. Our goal is to select a set of target rep-GSMs that are highly confident, based on the score distribution of decoy rep-GSMs that are known to be false, which follows the concept of the target-decoy search approach for estimating FDR in proteomics.45 Estimating FDRs for GSMs. We implemented a targetdecoy search approach for estimating FDR in glycopeptide identification by combining ETD and CID scores of the rep-
GSM, as reported by GlycoFragwork. We assume that the behavior of glycopeptides mirror that of traditional peptides there are multiple hits per spectrum and that in the absence of the true hit, the false target and decoy are equally likely. In complex samples, with a large glycoprotein database, there are multiple candidates that match a precursor mass even with a tolerance of 10 ppm. Figure 4a illustrates the principle of our approach on a glycomap created using twelve data sets from the Serum analysis and the SerumMascot_dB as the target glycoprotein database. Both target (denoted as T) and decoy (denoted as F) rep-GSMs are plotted based on their ETD (x-axis) and CID sequencing (y-axis) scores. Most of the target rep-GSMs are located in the top right corner with a high ETD score and a high CID sequencing score, where most of the decoy repGSMs are located in the bottom left corner of the plot. Additionally, points on the x axis with midrange ETD score and zero CID sequencing score are rep-GSMs with the wrong glycopeptide match, since the glycan sequencing score was 0 because of the wrong choice of Y1. The clustering behavior of 459
dx.doi.org/10.1021/ac402338u | Anal. Chem. 2014, 86, 453−463
Analytical Chemistry
Article
Table 2. Total Numbers of Glycoproteins, Sites, and Glycopeptides Identified with FDR < 0.05 for Different Samplesa glycan class distribution for completely sequenced glycans GlycoMap analysis Fetuin (against Fetuin_dB) 5SG (against 5SG_dbB) serum (against SerumMascot_dB)
no. protein ids
no. sites detected
no. glycopeptides
no. glycopeptides with glycan sequences completely sequenced
no. complex
no. high mannose
no. hybrid
2 4 33
5 5 53
22 11 103
8 6 94
8 6 84
0 0 4
0 0 6
a
Assignments were made against corresponding databases. For each analysis, the number of glycopeptides with completely sequenced glycans is noted along with the glycan class distribution among these identifications.
can be searched against a glycomap xml to identify glycoforms (two glycopeptides that contain same peptide but different glycans) within a sample. Additionally, a glycoproteomic data set run can be simply aligned to the glycomap and glycopeptides identified.
rep-GSMs in Figure 4a can be exploited by using Linear Discriminant Analysis. Using the LDA functions in the MASS package within the statistical platform R, clustering of target and decoy rep-GSMs was done using the CID sequencing score and the ETD score as the two dimensions. The model returns projection weights for the dimensions that will ensure maximum separation of classes. A new score, called an ES (ETD-Sequencing) score, was computed for every rep-GSM using these weights as coefficients. FDR measures for glycomaps were computed for each rep-GSM (one for each LC-MS record) using the ES score. This is illustrated in Figure 4b which is a 2-D plot for nonzero ETD and CID sequencing scores from Figure 4a with the LDA decision region marked. The choice of incorporating a linear separation (as opposed to quadratic or planar) between target and decoy classes is motivated by the desire to avoid overfitting. The discriminant line created using the projection weights, marked by the decision boundary is not used for prediction or reclassification purposes. Rather, it is used for the recalculation of FDR. Starting from the origin, a decision list is calculated and moved across different values of ETD and CID scores. At each stage, the numbers of False and True hits above the line (in effect, above a particular ES score) were noted to receive an FDR estimation, similar to target-decoy search in proteomics.45 The LDA analysis is dependent on the complexity of the data sets and databases involved, and hence is glycomap specific. The rep-GSMs with FDR < 0.05 were deemed to be confidently identified (i.e., verified) and assigned glycopeptides. The repGSMs with ≥0.05 FDR were deemed “tentative” The fragmentation event with the highest ETD score was chosen to be the rep-GSM fragmentation event for both these labels. Those rep-GSMs without any ETD spectra were labeled as “unverified” and were not involved in FDR estimation. The GSM with the lowest HCD score (that are p-value based) was chosen as the rep-GSM for these records. Glycopeptide assignments for unverified rep-GSMs were made based on mass and glycan type. The ES score will be implemented as part of future releases of GlycoFragwork; the scripts used to perform the LDA on a .csv glycomap format are attached to the installer package. Final Output of GlycoFragwork. The collective information obtained for each LC-MS rep-GSM is printed into an output glycopeptide map called as a glycomap. The map can be either in a CSV (comma separated value) format in which case only the mass, NET, abundance in data sets and scores are printed. Alternatively, the map can also be output in XML format (Supporting Information Figure S-4) that contains additional information such as data set specific information and representative spectra in a base 64 format. This output format opens up other applications that are currently being researched. For example, an experimental CID spectrum of a glycopeptide
■
RESULTS Prior to analyzing the data sets in Table 1, we confirmed the efficacy of ES_Scoring as compared to simple ETD scoring on both an individual data set and on a map of aligned data sets (see Supporting Information). This illustrates that the use of additional information from complementary methods of fragmentation can be used to increase both the number and the accuracy of glycopeptide identifications. We then built several glycomaps by analyzing data sets from samples of low to high complexity using GlycoFragwork and the ES scoring model with glycoprotein databases as target databases. The summary of all identifications in these glycomaps is given in Table 2, which lists the total numbers of glycoproteins, sites and glycopeptides with an estimated FDR < 0.05. Also noted are the number of glycopeptides for which a complete sequence was derived, and the glycan type distribution for these glycopeptide identifications. Each glycomap analysis is explored individually in subsequent sections. We also report identification numbers for each analysis using a larger size database as target database. Our goal was to evaluate if the GlycoFragwork framework gives robust glycopeptide identification when using a larger target database (containing many more proteins that are not present in the sample) in comparison to using an appropriate database. In this paper, we indicate robustness as the consistency of the glycopeptides identifications regardless of database size. Simple Glycoprotein Study. We built fetuin glycomaps from the fetuin LC-MS/MS data sets using either Fetuin_dB or Test_dB as target databases. The total numbers of verified, tentative and unverified rep-GSMs are reported in Table 3. As expected, the number of verified rep-GSMs drops on using the larger database, but on comparing the identifications within these rep-GSMs, we still see that the most highly scoring proteins are fetuin in both cases. When Test_dB was used, one glycopeptide of complex N-glycan type is identified in the human Complement 5 protein with high ETD and CID glycan sequencing scores, which may correspond to a contaminant glycoprotein in the sample. Nevertheless, 10 out 11 glycopeptides identified from the Test_dB analysis were attributed to fetuin and matched the top 10 out of 22 glycopeptides identified from matching using the Fetuin_dB database. Out of these 22 glycopeptides, 8 were completely sequenced (see Supporting Information Table S-4) and were all sialylated complex structures (Table 2), which is consistent 460
dx.doi.org/10.1021/ac402338u | Anal. Chem. 2014, 86, 453−463
Analytical Chemistry
Article
However, out of 11 glycopeptides identified using the 5SG_dB as the target database, 7 were found using the Test_dB without any additional glycopeptides being identified (Supporting Information Figure S-5a). This again indicates the robustness of the ES score model. The complete list of glycopeptides is given in Supporting Information Table S-5 and summarized in Table 2. Serum Glycoproteins Study. The results of building a serum map are shown in Table 5. We used the default N-glycan
Table 3. Fetuin Glycomap Assignment Comparisons Using Fetuin_dB and Test_dB Databasesa Fetuin_Glycomap all rep-GSMs
IDs from verified rep-GSMs
Fetuin_dB (2 glycoproteins)
Test_dB (71 glycoproteins)
no. verified rep-GSMs no. tentative rep-GSMs no. unverified rep-GSMs no. proteins
23
12
1
78
10
30
2
no. sites
5
no. glycopeptides
22
3 (2 fetuin +1 CO5_HUMAN) 5 (4 fetuin +1 CO5_HUMAN) 11 (10 fetuin +1 CO5_HUMAN)
Table 5. Serum Glycomaps Assignment Comparison Using the SerumMascot_dB (105 Glycoproteins) and the Much Larger SerumCombined_dBa
a
Verified rep-GSMs have an FDR < 0.05. Both fetuin proteins and associated 5 sites are found among the verified rep-GSMs in the Fetuin_dB analysis. While analyzing using Test_dB as target database, a contaminant was found from CO5_HUMAN (complement 5 protein). The 10 fetuin glycopeptides found using the Test database was among the 22 glycopeptides found using the Fetuin_dB database indicating robustness even in context of a larger database. Details on the list of glycopeptides for the Fetuin_dB study are given in Supporting Information Table S-4.
Serum_Glycomap all rep-GSMs
IDs from verified repGSMs
with previous studies.46 By “completely sequenced”, we mean that all the monosaccharides in the glycan composition were accounted for in the topology derived from the CID spectrum. All 8 completely sequenced glycans matched observed structures in GlycosuiteDB47 and the CFG database.48 Mixture Glycoproteins Study. When the standard mixture glycoprotein data sets were analyzed using the 5SG_dB and the Test_dB databases as target databases, GlycoFragwork identified 4 out of 5 glycoproteins in sample with high confidence (FDR < 0.05). RNAS1_BOVIN was missed in both cases (Table 4). Out of a total of 11 glycopeptides, 6 were completely sequenced and matched with observed structures in external databases (Supporting Information Table S-5), and all of them were found to be of complex type with sialic-acid terminations. The number of verified rep-GSMs dropped on using the larger database.
all rep-GSMs
IDs from verified rep-GSMs
no. verified repGSMs no. tentative rep-GSMs no. unverified rep-GSMs no. proteins no. sites no. glycopeptides
5SG_dB (5 glycoproteins)
8
205
429
41
53
4 5 11
4 4 7
119
102
558
771
111
133
33 53 103
36 50 89
Verified rep-GSMs have a FDR < 0.05. The overlap in identifications in both these analyses was 68 (Supporting Information Figure S-5b) indicating robustness. A complete list of identifications from SerumMascot_dB is given in Supporting Information Table S-6.
list that was described earlier, but several other glycan lists are available for serum analysis.49 When the SerumMascot_dB was used as target database, 103 glycopeptides across 53 Nglycosylation sites in 33 glycoproteins were identified at FDR < 0.05. The detailed list of these 103 identifications is given in Supporting Information Table S-6. All but one reported sites in the glycoproteins identified were annotated in UniprotKB. The glycoprotein HPTR_HUMAN (Haptoglobin-related protein) with a reported glycosylation at site N-126 was not annotated in current UniprotKB. Since, HPTR_HUMAN is rarely observed and the identifications were only found in the two data sets containing ETD, it is likely that this corresponds to a different intact glycopeptide possibly associated with the N-184 from standard Haptoglobin. 94 out of 103 glycopeptides were completely sequenced, and 84 of these (i.e., about 87%) were found to be of complex-type. Four high-mannose glycans were identified to be from CO3_HUMAN (complement C-3) and CO7_HUMAN (complement C-7) glycoproteins. Out of the 6 identified glycopeptides with hybrid-type glycans, two were from attachments to N-869 in A2MG_HUMAN (alpha-2macroglobulin), two were attached to N-169 in VTNC_HUMAN (vitronectin), and the remaining two were attached to N85 in CO3_HUMAN (complement C-3) and to N-271 in A1AT_HUMAN (alpha-1-antitrysin). Out of the 84 complextype glycopeptides, 4 contained nonsialylated (or a-sialylated) glycans, 26 contained monosialylated glycans, 46 contained disialylated glycans and 8 contained trisialylated glycans. When the SerumCombined_dB was used as target database, 34 glycoproteins were found with 50 sites and 89 glycopeptides at FDR < 0.05. We observed an overlap of 68 glycopeptides between these two analyses (Supporting Information Figure S5b).
Test_dB (71 glycoproteins)
12
SerumCombined_dB (566 glycoproteins)
a
Table 4. 5SG_glycomap Assignment Comparisons Using Both 5SG_dB and the Test_dBa 5SG_Glycomap
no. verified rep-GSMs no. tentative rep-GSMs no. unverified rep-GSMs no. proteins no. sites no. glycopeptides
SerumMascot_dB (105 glycoproteins)
a
Verified rep-GSMs have FDR