MS de Novo

Mar 3, 2005 - Keywords: proteomics • mass spectrometry • protein identification • bioinformatics • de novo sequencing • mass-based alignment...
31 downloads 29 Views 543KB Size
Identification of Protein Modifications Using MS/MS de Novo Sequencing and the OpenSea Alignment Algorithm Brian C. Searle,† Surendra Dasari,† Phillip A. Wilmarth,‡ Mark Turner,† Ashok P. Reddy,† Larry L. David,‡ and Srinivasa R. Nagalla*,† Department of Pediatrics and School of Dentistry, Oregon Health & Sciences University, 3181 SW Sam Jackson Park Road, Portland, Oregon 97239-3098 Received November 24, 2004

Algorithms that can robustly identify post-translational protein modifications from mass spectrometry data are needed for data-mining and furthering biological interpretations. In this study, we determined that a mass-based alignment algorithm (OpenSea) for de novo sequencing results could identify posttranslationally modified peptides in a high-throughput environment. A complex digest of proteins from human cataractous lens, a tissue containing a high abundance of modified proteins, was analyzed using two-dimensional liquid chromatography, and data was collected on both high and low mass accuracy instruments. The data were analyzed using automated de novo sequencing followed by OpenSea mass-based sequence alignment. A total of 80 modifications were detected, 36 of which were previously unreported in the lens. This demonstrates the potential to identify large numbers of known and previously unknown protein modifications in a given tissue using automated data processing algorithms such as OpenSea. Keywords: proteomics • mass spectrometry • protein identification • bioinformatics • de novo sequencing • massbased alignment • post-translational modification • human lens • crystallin • cataract

Introduction A major challenge in proteomic experiments is the identification of co- and post-translational protein modifications using tandem mass spectrometry.1 Many computer algorithms that were originally designed to identify proteins from amino acid sequence information in tandem mass spectra2-5 have also been extended to identify protein modifications. For example, the database-searching program SEQUEST2 identifies proteins by computing the cross correlation6 between uninterpreted tandem mass spectra and hypothetical spectra generated from peptide sequences in protein databases. The number of peptide sequences that must be examined is limited by selecting only sequences with a calculated mass that falls within a parent ion mass tolerance. To identify modified proteins, SEQUEST effectively searches against an enlarged database of protein sequences that contain all possible combinations of anticipated user-specified modifications.7 This technique has been validated in numerous experimental settings,8-10 but it can be extremely computationally expensive. A common technique to increase search efficiency is to only look for modifications in proteins that have been identified from unmodified peptide matches.11-13 A second identification technique for protein modifications is to search for peptides without a parent ion mass tolerance.14 In this case, the difference between the parent ion mass of the * To whom correspondence should be addressed. Tel: (503) 494-1928. Fax: (503) 494-4821. E-mail: [email protected]. † Department of Pediatrics. ‡ School of Dentistry.

546

Journal of Proteome Research 2005, 4, 546-554

Published on Web 03/03/2005

spectrum and the calculated mass can suggest possible modifications. To narrow down the search space, a short sequence tag can be derived from a spectrum and that tag can be used to screen for peptides in a protein database.15-17 Conversely, short peptide tags in a protein database can be used to generate hypothetical sections of a tandem mass spectrum and the spectra can be matched using auto correlation.18 Unfortunately, these techniques have difficulty locating specific modification sites, and usually only report mass shifts. Additionally, they cannot reliably detect multiple modifications to the same peptide. Another approach is to use de novo sequencing followed by a mass-based alignment program, such as OpenSea.19 In this technique, complete or partial de novo sequences are computed by identifying short series of fragment ions where the mass differences between peaks appear to match the masses of amino acids.20-24 Mass- or sequence-based25-28 alignment algorithms attempt to match these sequences to proteins in databases. Modifications and substitutions can be identified from the mass difference between the measured parent ion mass and the calculated peptide mass. The differences between measured and calculated fragment ion masses can then be used to localize the modification or substitution. Another program, Popitam,29 combines de novo sequencing and homology searching by traversing a spectrum graph25 with protein database sequences rather than with sequences derived de novo. In the present study, the ability of OpenSea to identify modifications was examined using a complex digest of proteins 10.1021/pr049781j CCC: $30.25

 2005 American Chemical Society

research articles

Protein Modifications Using de Novo and OpenSea

from a 93-year old human cataractous lens. This tissue was used because it is a rich source of modified proteins. Unlike proteins in most tissues, lens proteins do not turn over and remain throughout life. Tryptic digests from both the watersoluble and water-insoluble fractions of lens were separated by two-dimensional liquid chromatography and analyzed on both quadrupole TOF and ion trap instruments. The four large datasets were then analyzed using automated de novo sequencing followed by OpenSea mass-based sequence alignment. This strategy was able to identify 44 previously reported sites of co- and post-translational modifications in lens crystallins, and 36 new crystallin modification sites. The results demonstrate the benefits of de novo sequencing and mass based alignments to perform large scale analysis of protein modifications, and show that the method is useful for analyzing data generated on both low and high mass accuracy instruments.

Experimental Modification Detection and Localization Using OpenSea. The OpenSea mass-based alignment algorithm19 was used to identify and assign post-translational modifications. In short, OpenSea identifies peptides by aligning de novo sequences derived by other programs21,22 to protein sequences in databases. Candidate database sequences are determined from a two to four amino acid sequence tag search. OpenSea converts all amino acid characters in candidate sequences and de novo sequences into series of masses, and these masses are compared using a dynamic programming approach. Multiple consecutive masses can be aligned in groups to resolve isobaric mistakes in the de novo sequence. Regions of the de novo sequence that cannot be matched to the database sequence are assumed to be either amino acid substitutions or modifications, as described below. Finally, peptide identifications from multiple mass-based alignments are used to assign peptide identifications to proteins. Accurate detection of post-translational and chemical modifications is based on two new methods. First, an autointerpretation routine is used after a mass-based alignment to interpret mass ambiguities between the database and de novo sequences automatically as modifications or amino acid substitutions using a lookup table. This table contains a list of known post-translational and chemical modifications in the extensive FindMod30 database. These modifications are indexed by their mass difference, the amino acids they are found on, and an estimated log odds score for modification, similar to the log odds score for substitution in the Blosum31 and PAM32 matrixes. The modification log odds scores are adjustable and were chosen based on frequency of common sample processing artifacts and previously known lens modifications (see Supporting Information). Amino acid substitutions are similarly indexed based on their Blosum-90 log odds substitution score. It is assumed that consecutive amino acid mass mismatches are the result of a single modification, as multiple modifications or amino acid substitutions on consecutive residues is deemed unlikely. The mass shift of a suspected modification and the consecutive unmatched database amino acids are used as a multidimensional key in the table lookup. If the table can suggest a modification based on the supporting evidence, then that modification is inserted into the alignment. The massbased alignment is scored as if a new amino acid were identified using the estimated log odds score. Modifications that cannot be interpreted via table lookup are reported as

Figure 1. Pseudocode representation of the auto-interpretation routine used in OpenSea. SRS denotes smoothed rank scoring algorithm. Log odds scores for substitutions are discussed in ref 19, and a table of modifications, their respective mass shifts, associated residues, and log odds scores can be found in the Supporting Information.

unknown mass shifts and scored according to the original algorithm.19 Figure 1 contains pseudocode outlining the steps in the auto-interpretation routine. Second, OpenSea uses a simplified rank-based scoring mechanism33 to independently validate peptide identifications made by the OpenSea Alignment Score (OSAS).19 To determine the ranks of the peaks in the tandem mass spectrum, the precursor ion is removed, and each peak in the tandem mass spectrum is normalized by dividing the peak intensity by the mean of all peaks (intensity greater than 0.0) in a (100 AMU range. The top 50 normalized peaks of the tandem mass spectrum are retained and ranked by their intensities. For each of the top 50 peaks, the intensity is replaced by the rank where the most intense peak is assigned a rank of 50 and the least intense peak assigned a rank of 1. A vector X corresponding to integer m/z values from 1 to MH+ is created where the value is zero except for the m/z values of the 50 ranked peaks which have values equal to the assigned ranks (if more than one ranked peak rounds to the same m/z value, the value is set to the largest rank). This vector is then normalized so that the largest value is one. The hypothetical spectrum vector, Y, for the candidate peptide sequence uses the estimated fragment ion ranks listed in Table 1. Again, vector values are zero except at the predicted m/z values of the ion types in Table 1 where the value is equal to the specified rank, and the vector is also normalized. The Smoothed Rank Score (SRS) is given by:

∑ [X - Xh ][Y - Yh ]] + {1 - (n - 1)

MH+

SRS ) [

1

i

i)1

i

2

for n ) 0 for n > 0

(1)

where Xi and Yi are the normalized tandem mass spectrum rank value and hypothetical spectrum rank value, respectively, X h and Y h are the mean rank values in the respective rank vectors, and n is the total number of modifications and substitutions. The first term is somewhat analogous to covariance, and related to Journal of Proteome Research • Vol. 4, No. 2, 2005 547

research articles

Searle et al.

Table 1. Peptide Fragmentation Model Ion Types and Ranks ion

typea

rankb

b- or y-type b- or y-type +1 AMU b- or y-type -1 AMUc b- or y-type -H2O b- or y-type -NH3 a-type b- or y-type +2Hd neutral loss of modificatione

50 25 25 10 10 10 25 50

a Ion types used in the SRS peptide fragmentation model. b The rank assigned to each ion type based on the expected frequency of occurrence. The ranks are similar to the theoretical fragmentation model used in ref 2. c b- or y-type -1 AMU are only incorporated into the model when average masses are used to calculate the mass of ions. d +2 charged b- and y-type ions are incorporated only for peptides of +3 charge or higher. e If a protein modification is expected and that modification has a known labile bond, the neutral loss of the modification from the parent ion is also included in the fragmentation model.

the overlap between peaks in the two spectra. The second term in the equation adjusts the SRS score to penalize peptides having a large number of modifications and substitutions. Using more than 50 peaks did not significantly alter SRS scores, and 50 peaks were chosen to minimize computation time. The charge-independent, nonparametric correlation used by the SRS score has an advantage over traditional Pearson’s cross correlation34 in that it is outlier independent and more robust for nonnormally distributed data, such as the intensities of MS/ MS peaks. Rank-based correlation has also been suggested by Bern, et al.35 The OSAS and SRS scores are derived using different assumptions: namely that the OSAS is a measure of mass-based sequence homology, while the SRS is a measure of similarity between a normalized MS/MS spectrum and a predicted fragmentation model. We have empirically adopted a linear combination of these two scoring systems, called the Combined Alignment Score (CAS), to improve the separation of correct and incorrect results based on an analysis of a known protein mixture: CAS )

0.9*OSAS + 6.0*SRS 100

(2)

OpenSea now uses a workflow where three sequential searches of an MS/MS data set are executed against each protein sequence database. Initially, a relatively fast scan of the database is completed, with stringent scoring parameters, to find all of the identifiable proteins in that sample. Then, a second search against the much smaller database of identified proteins is performed using an enlarged search space (due to a shorter sequence tag length and an increased “breadth-first”19 search depth) to identify lower quality de novo sequenced tandem mass spectra. The larger number of candidate sequences from this search increases the likelihood of identifying modifications and substitutions. Finally, the specific posttranslational and chemical modifications that were identified by the auto-interpretation table lookup during the second search are fed back into the OpenSea algorithm, and any additional peptides of identified proteins that have those modifications are found. This last round of searching is analogous to specifying variable modifications in SEQUEST, and allows additional ambiguous mass gaps to be aligned that could not be matched in the second round search. Although the scoring system for all three searches remains the same, the enlarged search space for the second and third searches leads 548

Journal of Proteome Research • Vol. 4, No. 2, 2005

to the identification of new peptides. On a 1.7 GHz Pentium 4 processor with 64 MB of RAM, OpenSea takes approximately 14 s to search a single de novo sequence against the SwissProt database (version 41.11 containing 127,863 entries) and interpret any modifications or substitutions that may be present. Sample Preparation, LC Separations, and MS/MS Spectra Acquisition. Two types of samples were used to test OpenSea. The known protein control mixture was obtained by combining 10 purified proteins of varying molecular weight and physiochemical properties, including: Bos taurus insulin, ubiquitin, cytochrome c, superoxide dismutase, beta-lactoglobulin A, serum albumin, and immunoglobulin G, as well as Equus caballus myoglobin, Armoracia rusticana peroxidase, and Gallus gallus conalbumin (obtained from Ciphergen, Fremont, CA). The sample preparation and digestion protocols have been described elsewhere.19 Lenses from a 93-year-old human male with nuclear cataracts were obtained from the Oregon Lyons Eye Bank with Institutional Review Board approval from the Oregon Health & Sciences University. Following decapsulation, the lenses were homogenized in 1.0 mL of 20 mM phosphate buffer (pH 7.0), and 0.1 mM EGTA. Water-soluble proteins were isolated by centrifugation at 20 000 × g for 30 min. The water-insoluble proteins were suspended from the pellet by brief sonication. Protein content was assayed by the BCA assay (Pierce Biotechnology, Rockford, IL) according to the manufacturer’s protocol using bovine serum albumin as a standard. Protein aliquots (2.5 mg) were dried by vacuum centrifugation and stored at -70 °C until use. The two-dimensional liquid chromatography separation protocol we used has been previously demonstrated.36 Initially 2.5 mg of protein (either water-soluble fraction of water-insoluble fraction) was dissolved in 250 µL of 8 M urea, 0.8 M Tris 80 mM methylamine, and 8 mM Ca2Cl. The proteins were then reduced with 37.5 µL of 0.9 M dithiothreitol by incubation at 50 °C for 15 min, alkylated with 62.5 µL of 1 M iodoacetamide at room temperature for 15 min, and 37.5 µL of 0.9 M dithiothreitol was again added. The proteins were digested for 18 h at 37 °C with 100 µg trypsin (Trypsin Gold, Promega) dissolved in 612 µL of water. Formic acid was added to a final 5% concentration and the peptide mixture was purified using a Sep-Pak Light (Waters, Milford, MA) solid-phase extraction cartridge. The peptide mixtures were fractionated using a 60-min gradient through a 100 × 2.1 mm polysulfoethyl A cation exchange column (The Nest Group, Inc., Southborough, MA). Some of the two-minute fractions were pooled to produce 23 soluble and 20 insoluble peptide samples. Q-TOF MS/MS spectra were acquired with a Micromass QTOF-2 (Waters) quadrupole/time-of-flight hybrid mass spectrometer with an online capillary LC (Waters). Samples comprising 5% of each ion exchange fraction were desalted with an in-line C18 trap cartridge (LC Packings, San Francisco, CA) and separated on a 75 µm × 15 cm C18 IntegraFrit column (Waters). Peptides were injected into the online mass spectrometer through a nanospray source. Ion trap MS/MS analysis was preformed using an Agilent 1100 series capillary LC system (Agilent Technologies, Palo Alto, CA) and injection by a standard electrospray source (modified with a 34 G metal needle) into an LCQ Classic ion trap mass spectrometer (ThermoFinnigan, San Jose, CA). Samples comprising 12.5% of each ion exchange fraction were desalted using a 180 µm × 3.0 cm trap cartridge containing 5 µm Zorbax SB-C18 packing

research articles

Protein Modifications Using de Novo and OpenSea

material (Agilent Technologies). Separation was performed using a 180 µm × 10 cm column containing the same packing material. De Novo Sequencing and Database Searching. All Q-TOF MS/MS spectra were de novo sequenced using Peaks Batch Version 2.222 (Bioinformatics Solutions Inc., Waterloo, ON Canada) using a mass accuracy of 0.1 AMU. Peaks reports full amino acid sequences without unknown mass regions, but assigns each amino acid in the sequence a confidence score. Sequence regions where amino acids had confidence scores below 50% were replaced by the combined mass of those amino acids. If the entire sequence had an average confidence below 50%, then only amino acids that had confidence below the average confidence were combined. All sequences were analyzed with OpenSea using monoisotopic masses for calculating hypothetical parent and fragment masses, and were matched with a mass accuracy of 0.25 AMU. All ion trap MS/MS spectra were de novo sequenced with LutefiskXP21,37 using parent ion and fragment ion mass accuracies of 1.2 AMU and 0.4 AMU, respectively. Lutefisk was configured to consider at maximum 500 subsequences and 2000 final sequences for speed considerations. Lutefisk was further configured to report any de novo sequence with a score above 0.01 Pr(c), but to report only the top three sequences for each MS/MS spectrum. All three sequences were used to produce a database match. OpenSea was configured to use average masses and assumed a mass accuracy of 0.5 AMU for fragment masses. All samples generated from the control mixture were searched against the Swiss-Prot38 database (release 41.11, 128 055 entries) that was modified to include PIR-NREF39 (release 1.25) sequences for the control proteins. OpenSea analysis of the control mixture was compared to SEQUEST 2.7 (ThermoFinnigan) results. SEQUEST was configured to search for variable oxidation and carbamylation, given our previous history of identifying these modifications in the control sample.19 Peptide identifications with cross correlation scores (XCorr) of greater than 1.8, 2.5, and 3.5 for singly, doubly, and triply charged peptides, respectively, and DeltaCN greater than 0.08 were accepted as correct SEQUEST identifications. OpenSea analysis of data from the 93-year old lens digests used the Swiss-Prot database selected for human proteins (9615 entries). In OpenSea, peptide identifications with CAS Scores above 0.9 and protein matches with Protein Scores above 1.5 were accepted. The Protein Score is the sum of the independent peptide identification scores.19 OpenSea was not configured to look for any particular protein or peptide modifications. In addition to passing these thresholds, all modifications reported here were identified in at least two MS/MS spectra, all MS/MS spectra were manually validated, and were further confirmed using a series of SEQUEST searches configured to look for the same modifications. Spectra matched to modified peptides by OpenSea were compared to the SEQUEST output file for that particular MS/MS spectrum in a properly configured SEQUEST search, and, for all validated spectra, the modification reported by OpenSea was identical to the top scoring peptide match by SEQUEST. To avoid 1 AMU mass accuracy mistakes that can occur in ion trap data, identifications of deamidation were only accepted if at least one Q-TOF spectrum verified the site of deamidation. Similarly, acetylations (+42 AMU) were only accepted if at least one Q-TOF spectrum confirmed that they were not carbamylations (+43 AMU). These conservative

requirements significantly reduced the chance of reporting false positive modifications.

Results and Discussion OpenSea Analysis of Control Mixture Proteins. Thirty-five technical replicate analyses of a 10-protein mixture were performed using a QTOF-2 quadrupole time-of-flight mass spectrometer as previously described.19 The OpenSea Alignment Score (OSAS) was computed to separate correct peptide identifications from incorrect, and is primarily concerned with sequence homology between a de novo sequence (corrected for sequencing errors by OpenSea) and a database protein sequence. The current version of OpenSea now rescores all potential peptide identifications using the Smoothed Rank Score (SRS). The SRS score is primarily a measure of correlation between the actual spectrum and a theoretical spectrum generated by a fragmentation model, and therefore makes different assumptions about the MS/MS data than the OSAS. Figure 2a shows a scatter plot of peptide identifications made across all 35 technical replicates, as scored by both the OSAS and the SRS scoring systems. Further tuning of the mass-based alignment algorithm and the OSAS scoring results in most incorrect peptides having OSAS scores below -100.0. This creates a bimodal score distribution of the incorrect peptide identifications (data not shown) and the majority of the incorrect peptide scores are not visible in Figure 2a. As all correct peptide identifications have OSAS scores above 0.0, the lower distribution can automatically be disqualified and, thereby, eliminating approximately 80% of the incorrect results when using a Q-TOF instrument. The distribution of correct peptide identifications in Figure 2 had OSAS scores that ranged from about 50 to 150, and SRS scores from 5 to 15. The separation of correct from incorrect peptide identifications using either individual scoring system was inferior to linearly combining the OSAS and the SRS scores using the empirically derived weighting factors in eq 2. The combined score (CAS) had a sensitivity (percentage of correct peptides exceeding threshold) of 97% and a specificity (percentage of incorrect peptides below threshold) of 98%, as compared to the 89% sensitivity and 93% specificity of the OSAS alone. For comparison, analysis of the same data sets using SEQUEST scoring thresholds listed in the Experimental section resulted in 69% sensitivity and 95% specificity. Peaks (version 2.2) identified a larger number of control protein peptides than LutefiskXP (see Supporting Information) and was used in the subsequent lens protein analysis. Twenty technical replicate analyses of this 10-protein mixture were also analyzed using an LCQ classic ion trap mass spectrometer. As shown in Figure 2b, the same scoring parameters can be directly applied to the analysis of ion trap MS/MS data without any changes other than the increased mass tolerances mentioned in the Experimental section. Since the sensitivity (89%) and specificity (97%) values were reasonable, no attempt to independently optimize scoring for ion traps was done. The lower sensitivity and specificity values for the ion trap data compared to Q-TOF data were not unexpected, and is primarily due to decreased mass accuracy, the interspersion of b-type and y-type ions, and the lack of both low and high m/z peaks. Sensitivity is decreased because LutefiskXP produces lower quality de novo sequences from ion trap data than from Q-TOF data, and thereby lowering OpenSea scores for correct peptides. Also, the sequences contain more ambiguous regions that increase the likelihood that OpenSea can make Journal of Proteome Research • Vol. 4, No. 2, 2005 549

research articles

Searle et al.

Figure 2. Scatter plots showing the distribution of correct and incorrect peptide identifications made by the OpenSea Alignment Score (OSAS) and the Smoothed Ranked Score (SRS) when analyzing Q-TOF (a) and ion trap (b) MS/MS data. Correct identifications are represented by blue “X”s, while incorrect identifications are shown as red “O”s. The combined alignment score threshold of 1.0 is drawn as a diagonal line, showing the improved separation of incorrect versus correct matches when combining the OSAS and the SRS. 550

Journal of Proteome Research • Vol. 4, No. 2, 2005

research articles

Protein Modifications Using de Novo and OpenSea Table 2. Summary of Identified Modifications and Sites in 93-Year Old Cataractous Human Lens A: confirmation of previously reported protein modifications protein/accession

no.a

Crystallin, RA chain (P02489) Crystallin, RB chain (P02511) Crystallin, βA3 (P05813) Crystallin, βA4 (P53673) Crystallin, βB1 (P53674) Crystallin, βB2 (P43320) Crystallin, βB3 (P26998) Crystallin, γB (P07316) Crystallin, γC (P07315) Crystallin, γD (P07320) Crystallin, γS (P22914)

deamidation

oxidation

Q6, Q90, Q147 N146 Q42, N54, N103, N120, Q164

M1 M1, M68b M126b M13 M112, M136, W192, M225b M121, W150

N157, N161

N14, Q16, Q63, N76, Q120

methylation

phosphorylation

S122 S59 C82, C117, C185

acetylation

n1 n1 n1 n1b n1 n1b,d

C22 C110 C24, C26

M58, M73

B: newly identified protein modifications protein/accession no.a

Crystallin, RA chain (P02489) Crystallin, RB chain (P02511) Crystallin, βA3 (P05813) Crystallin, βA4 (P53673) Crystallin, βB1 (P53674) Crystallin, βB2 (P43320) Crystallin, βB3 (P26998) Crystallin, γB (P07316) Crystallin, γC (P07315) Crystallin, γD (P07320) Crystallin, γS (P22914)

deamidation

N123 N40, N62, N133 N82, N113 N57, N67, Q69, N124 N115, Q162 N155 N24, Q66 Q12, N49,c N160

oxidation

methylation

W9b

S20, H79 H83b

W9 M46, W96, W99,b W168 W100c M192b M102 M101 or M102f W156 W162

+28 AMUe

S151, H214b

C114

a Protein sequences are referenced to their SwissProt database accession number. b Modification observed only in ion trap data. c Modification observed only in Q-TOF data. d N-terminally acetylated without initial methionine residue. e The +28 AMU mass shift modification does not agree with any previously reported lens modifications. f Modification could be either M101 or M102.

matches to incorrect peptides, which lowers the specificity. A SEQUEST search of the ion trap data resulted in 62% sensitivity, and 97% specificity using the scoring thresholds given in the Experimental section. The lower sensitivity in SEQUEST searches is probably a consequence of SEQUEST being better able to consider noisy MS/MS spectra, many of which do not pass the scoring thresholds. The current versions of ion trap de novo sequencing programs do not provide OpenSea with candidate sequences for these poor quality spectra. Confirming previous reports,37 OpenSea was able to identify more peptides from ion trap de novo sequencing results when using LutefiskXP than when using Peaks 2.2 (see Supporting Information), and LutefiskXP was therefore used in further analysis of ion trap data. Analysis of 93-Year Old Human Cataractous Lens Tissue. Eleven crystallins represent the majority of the total human lens tissue protein by mass. Crystallin proteins in human lens tissue do not turnover and as the tissue ages they often become substantially modified. Over time, some of these crystallins become more insoluble and the tissue itself becomes increasingly opaque. It has been suggested that the accumulation of protein modifications plays a role in these aging changes and could lead to the formation of cataracts.40 In particular, deamidations have been linked with aging and cataractogenesis.41 It is suspected that accumulated age-related protein modifications might cause instability and tertiary conformational changes, resulting in aggregation, insolubility, and increased light scatter. An extensive catalog of known protein modifications in human crystallins includes phosphorylations,9,42 acetylations,43 oxidations,9,42,44-46 methylations,19,47-49 and deamidations.41,42,44-46 As human lens tissue is of relatively low protein complexity, and many modifications to lens proteins have been well-

characterized, the tissue is an ideal choice to test techniques that identify protein modifications. For example, human lens tissue has been previously used as a model to demonstrate MS/ MS based protein modification detection using SEQUEST and various SEQUEST utilities (such as SEQUEST-PHOS).9 In this study, proteins from a 93 year-old human male lens containing an age-related nuclear cataract were separated into watersoluble and water-insoluble fractions. Both of these fractions were digested and separated via cation exchange HPLC. Approximately twenty peptide fractions were collected from both the soluble and insoluble peptides and analyzed using both QTOF-2 and LCQ Classic tandem mass spectrometers with online capillary LC using C18 columns. Using the two new methods detailed in the Experimental section, namely, auto-interpretation of OpenSea alignments coupled with SRS rescoring, allowed OpenSea to localize amino acid modifications to specific residues. Figure 3 shows a de novo sequence (a) derived using LutefiskXP from an ion trap MS/MS spectrum. Although over 40% of the sequence is in error, OpenSea can identify the correct peptide from the γS crystallin as the top-scoring match (Figure 3b). The autointerpretation subroutine (Figure 3c) groups consecutive massmismatches and identifies the overall delta mass to a methylated cysteine at either site C24 or C26. As shown in Figure 3d, the peptide with the highest SRS score is reported as the best interpretation, and can be used, in this case, to localize the methylation to C24. Additional scoring details can be found in the Supporting Information. Using this system, OpenSea was able to identify 80 sites of modification in eleven crystallins, as cataloged in Table 2. All modified residues were observed in at least two MS/MS spectra and deamidations were only reported if observed in the Q-TOF data. Table 2a confirms 44 sites of phosphorylation, acetylation, Journal of Proteome Research • Vol. 4, No. 2, 2005 551

research articles

Searle et al.

Figure 3. Identification of a methylated peptide YDCD[Cm]DCADFHTYLSR from an ion trap tandem mass spectrum using OpenSea. Initially a de novo sequence is derived from the spectrum (a) and that sequence is used to generate a mass-based alignment against the γS Crystallin chain (b) where the database sequence is represented on top and the de novo sequence on bottom. Regions of the sequences that match by mass within 0.5 AMU are signified by “|”, whereas mass mismatches with local alignment scores 0, which is signified by “:”. Modification localization using SRS scoring (d) suggested that the methylation of C24 corresponded to more of the intense peaks in the MS/MS spectrum than methylation of C26. As a result, methylation of C24 was reported as the best scoring interpretation. B- and y-type fragment ion peaks in the tandem mass spectrum (a) that correspond to the C24 methylated peptide sequence are labeled in red and blue, respectively. Other peaks that can be interpreted as less common ions derived from neutral loss are shown in green.

oxidation, methylation, and deamidation that have been previously reported in the literature.9,19,41-49 Table 2b also reports 36 new identifications discovered in this study. Although our method cannot differentiate in vivo oxidations from those that occurred during sample preparation, 13 of the 25 oxidations reported have been previously observed in lens tissue.9,42,44-46 Additionally, 8 of the new identifications are at tryptophan residues, which in general is an uncommon target for in vitro oxidation. Despite significant differences between ion trap and 552

Journal of Proteome Research • Vol. 4, No. 2, 2005

Q-TOF fragmentation patterns, 85% of the modifications reported in Table 2 were identified by both mass spectrometers lending credibility to the experimental procedure. Also included in Table 2b are unanticipated +28 AMU mass shifts to two serines and three histidines. The +28 AMU shift could possibly be dimethylation or in vitro formylation as postulated by Hanson et al.44 in a previous lens crystallin study. Although 5% formic acid was used to deactivate trypsin in this experiment, formylation was not seen in the 10-protein control

research articles

Protein Modifications Using de Novo and OpenSea

mixture, which used a similar digestion protocol. Residues T13 (RA), S200 (βA3), and S174 (βB2) had an apparent loss of 18 AMU which could be due to β-elimination of a phosphate group. However, no spectra with a mass increase of 80 AMU for those three peptides were observed. Further studies are planned to validate the chemical composition of these modifications and to verify the new sites of modifications. In addition to the modifications reported in Table 2b, N67 in βB1 crystallin chain had an apparent loss of ammonia to form succinimide, which is a likely intermediate in nonenzymatic deamidation.50 Also, four aspartic acid residues and one glutamic acid residue lost water. From the Q-TOF and ion trap data, 210 MS/MS spectra were assigned to deamidated peptides in the soluble samples (4.6% of all spectra identified from the soluble fraction), whereas 587 deamidated peptides were assigned from the insoluble samples (13.8% of the total in the insoluble fraction). Using total spectrum count to estimate relative abundance,51 this suggests a 3-fold increase in overall extent of deamidation in the insoluble samples, which is consistent with previously reported values44,45 for individual crystallins. Seventy-five sites of N-terminal peptide carbamylation were found, represented by 526 Q-TOF and iontrap spectra. Nterminal carbamylation is a common in vitro modification when using urea as a denaturant.52 Curiously, no sites of lysine carbamylation were identified, suggesting a strong N-terminal preference. Twenty-three sites of pyroglutamic acid53 and N-terminal S carbamoyl methyl cysteine cyclization54 (369 spectra) were also identified. These entropically driven cyclized forms are also common in vitro modifications. Seventeen generally acidic sites were found to contain an unanticipated +38 AMU mass shift (80 spectra). A common feature of the spectra in question was a strong +1 charged loss of 38 AMU from the parent ion, implying that the modification is generally charged. As 350 mM KCl was used during the strong cation exchange separation, it is possible that a potassium cation was bound to acidic residues in some peptides. SEQUEST searches of the data were performed to verify modifications found by OpenSea (data not shown) and produced an interesting result. A mass shift of +38 AMU was not specified in the SEQUEST searches since it was not clear which residues were affected, however, phosphorylation (+80 AMU) of serine and threonine were allowed modifications. Since the sample has many N-terminally acetylated proteins and there were a large number of N-terminal carbamylated peptides, the sum of +42 AMU (or +43 AMU) and +38 AMU is essentially the same mass as phosphorylation. There were several peptide identifications with erroneous phosphorylation sites, especially when S or T was in the N-terminal region of the peptide. These false positives had XCorr and DeltaCN scores that met generally established criteria for correct peptide identifications. This demonstrates the need for careful validation and caution when assigning modifications using database searching programs such as SEQUEST.

Conclusion Here we have shown that OpenSea, a program that combines de novo sequencing results with a mass-based alignment tool, could identify posttranslationally modified peptides in a highthroughput environment. Without any user-specified modification mass shifts, the program was able to confirm a large number of previously known lens modifications, identify

numerous new modification sites, and detect a wide variety of in vitro modifications. OpenSea could process de novo sequences from high mass accuracy Q-TOF data, and, for the first time, low mass accuracy ion trap MS/MS data. In a single analysis, OpenSea was able to consider over 75 eukaryote modifications listed in the FindMod database, search for amino acid substitutions, and identify nontryptic cleavage sites. An equivalent search would generally be a computationally intractable problem for most database searching programs. OpenSea was written in Java and binaries are available for noncommercial academic use via a Material Transfer Agreement (see Supporting Information). DTA files of the 10-protein control mixture generated from QTOF or ion trap instruments are available upon request (see Supporting Information for details).

Acknowledgment. The authors would like to thank Ashley McCormack and D. Leif Rustvold for helpful discussions. This work was supported by National Institute of Health Grant Nos. U19ES11384 and U24DK5870 to Srinivasa Nagalla and EY07755 to Larry David. Supporting Information Available: Program and data availability, scoring algorithm details, modification log odds values, and Peaks 2.2 versus LutefiskXP performance (3 sets of figures and 1 table) are available in the publication section of our website at http://medir.ohsu.edu/∼geneview. This material is available free of charge via the Internet at http://pubs.acs.org. References (1) Mann, M.; Jensen, O. N. Nat. Biotechnol. 2003, 21, 255-261. (2) Eng, J. K.; McCormack, A. L.; Yates, J. R., III J. Am. Soc. Mass Spectrom. 1994, 5, 976-989. (3) Perkins, D. N.; Pappin, D. J. C.; Creasy, D. M.; Cottrel, J. S. Electrophoresis 1999, 20, 3551-3567. (4) Craig, R.; Beavis, R. C. Rapid Commun. Mass Spectrom. 2003, 17, 2310-2316. (5) Denny, R.; Neeson, K.; Rennie, C.; Richardson, K.; Leicester, S.; Swainston, N.; Worroll, J.; Young, P. “The Use of Search Workflows in Peptide Assignment From MS/MS Data”, Association of Biomolecular Resource Facilities, ABRF ‘02: Biomolecular Technologies: Tools for Discovery in Proteomics and Genomics, Austin, Texas, March 9-12, 2002. (6) Owens, K. G. Appl. Spectrosc. Rev. 1992, 27, 1-49. (7) Yates, J. R., III; Eng, J. K.; McCormack, A. L.; Schieltz, D. Anal. Chem. 1995, 67, 1426-1436. (8) Zhou, H.; Watts, J. D.; Aebersold, R. Nat. Biotechnol. 2001, 19, 375-378. (9) MacCoss, M. J.; McDonald, W. H.; Saraf, A.; Sadygov, R.; Clark, J. M.; Tasto, J. J.; Gould, K. L.; Wolters, D.; Washburn, M.; Weiss, A.; Clarck, J. I.; Yates, J. R. III Proc. Natl. Acad. Sci. U.S.A. 2002, 99, 7900-7905. (10) Liebler, D. C. Introduction to Proteomics: Tools for the New Biology; Humana Press: Totowa, NJ, 2002; pp 167-184. (11) Gatlin, C. L.; Eng, J. K.; Cross, S. T.; Detter, J. C.; Yates, J. R., III Anal. Chem. 2000, 72, 757-763. (12) Pevzner, P. A.; Mulyukov, Z.; Dancik, V.; Tang, C. L. Genome Res. 2001, 11, 290-299. (13) Creasy, D. M.; Cottrell, J. S. Proteomics 2002, 2, 1426-1434. (14) Clauser, K. R.; Baker, P.; Burlingame, A. L. “Peptide FragmentIon Tags from MALDI/PSD for Error-tolerant Searching of Genomic Databases”, Proceedings of the 44th ASMS Conference on Mass Spectrometry and Allied Topics, Portland, Oregon, May 12-16, 1996. (15) Mann, M.; Wilm, M. Anal. Chem. 1994, 66, 4390-4399. (16) Sunyaev, S.; Liska, A. J.; Golod, A.; Shevchenko, A, Shevchenko, A. Anal. Chem. 2003, 75, 1307-1315. (17) Tabb, D. L.; Saraf, A.; Yates, J. R., III Anal. Chem. 2003, 75, 64156421. (18) Liebler, D. C.; Hansen, B. T.; Davey, S. W.; Tiscareno, L.; Mason, D. E. Anal. Chem. 2002, 74, 203-210. (19) Searle, B. C.; Dasari, S.; Turner, M.; Reddy, A. P.; Choi, D.; Wilmarth, P. A.; McCormack, A. L.; David, L. L.; Nagalla, S. R. Anal. Chem. 2004, 76, 2220-2230.

Journal of Proteome Research • Vol. 4, No. 2, 2005 553

research articles (20) Dancik, V.; Addona, T. A.; Clauser, K, R.; Vath, J. E.; Pevzner, P. A. J. Comput. Biol. 1999, 6, 327-342. (21) Taylor, J. A.; Johnson, R. S. Anal. Chem. 2001, 73, 2594-2604. (22) Ma, B.; Zhang, K.; Hendrie, C.; Liang, C.; Li, M.; Doherty-Kirby, A.; Lajoie, G. Rapid Commun. Mass Spectrom. 2003, 17, 23372342. (23) Lu, B.; Chen, T. J. Comput. Biol. 2003, 10, 1-12. (24) Skilling, J.; Denny, R.; Richardson, K.; Young, P.; McKenna, T.; Campuzano, I.; Ritchie, M. Comput. Funct. Genom. 2004, 5, 6168. (25) Taylor, J. A.; Johnson, R. S. Rapid Commun. Mass Spectrom. 1997, 11, 1067-1075. (26) Shevchenko, A.; Sunyaev, S.; Loboda, A.; Shevchenko, A.; Bork, P.; Ens, W.; Standing, K. G. Anal. Chem. 2001, 73, 1917-1926. (27) Huang, L.; Jacob, R. J.; Pegg, S. C.; Baldwin, M. A.; Wang, C. C.; Burlingame, A. L.; Babbitt, P. C. J. Biol. Chem. 2001, 276, 2832728339. (28) Mackey, A. J.; Haystead, T. A. J.; Pearson, W. R. Mol. Cell. Proteomics 2002, 1, 139-147. (29) Hernandez, P.; Gras, R.; Frey, J.; Appel, R. D. Proteomics 2003, 3, 870-878. (30) Wilkins M. R.; Gasteiger E.; Gooley A. A.; Herbert B. R.; Molloy M. P.; Binz P. A.; Ou K.; Sanchez J. C.; Bairoch A.; Williams K. L.; Hochstrasser D. F. J. Mol. Biol. 1999, 289, 645-657. (31) Henikoff, S.; Henikoff, J. G. Proc. Natl. Acad. Sci. 1992, 89, 1091510919. (32) Dayhoff, M. O.; Schwartz, R. M.; Orcutt, B. C. In Atlas of Protein Sequence and Structure; Dayhoff, M. O., Ed.; Natl. Biomed. Res. Found., Washington, DC, 1978; vol. 5 suppl. 3, pp 345-352. (33) Searle, B. C.; Turner, M.; Dasari, S.; Rodland, M. J.; Lapidus, J.; Khatra, G.; Nagalla, S. R. “Using Mass-Based Alignment of MS/ MS De Novo Sequencing Results to Mine For Peptides In Unmatched MS/MS Spectra”, Peptide Fragmentation and Identification Workshop, Gaithersburg, Maryland, May 3-4, 2004. (34) Pagano, M.; Gauvreau, K.; Pagano, R. Principles of Biostatistics, 2nd ed.; Duxbury Press: Pacific Grove CA, 2000; pp 400-407. (35) Bern, M.; Goldberg, D.; McDonald, W. H.; Yates, J R. III Bioinformatics 2004, 20, Suppl. 1, i49-i54. (36) Wilmarth, P. A.; Riviere, M. A.; Rustvold, D. L.; Lauten, J. D.; Madden, T. E.; David, L. L. J. Proteome Res. 2004, 5, 1017-1023. (37) Johnson, R. S. “Lutefisk1900 vs Peaks: A comparison of automated de novo sequencing programs”, ABRF ‘04: Integrating Technologies in Proteomics and Genomics, Portland, Oregon, February 28-March 2, 2004.

554

Journal of Proteome Research • Vol. 4, No. 2, 2005

Searle et al. (38) Apweiler, R.; Bairoch, A.; Wu, C. H.; Barker, W. C.; Boeckmann, B.; Ferro, S.; Gasteiger, E.; Huang, H.; Lopez, R.; Magrane, M.; Martin, M. J.; Natale, D. A.; O′Donovan, C.; Redaschi, N.; Yeh, L. S. Nucleic Acids Res. 2004, 32, D115-D119. (39) Wu, C. H.; Huang, H.; Arminski, L.; Castro-Alvear, J.; Chen, Y.; Hu, Z., Ledley, R. S.; Lewis, K. C.; Mewes, H. W.; Orcutt, B. C.; Suzek, B. E.; Tsugita, A.; Vinayaka, C. R.; Yeh, L. S.; Zhang, J.; Barker, W. C. Nucleic Acids Res. 2002, 30, 35-37. (40) National Advisory Eye Council, Report of the Retinal Diseases Panel: Vision Research: A National Plan, 1994-1998. United States Department of Health and Human Services, Bethesda, MD, 1993; Publication NIH 93-3186; pp 175-183. (41) Lapko, V. N.; Purkiss, A. G.; Smith D. L.; Smith, J. B. Biochemistry 2002, 41, 8638-8648. (42) Lund, A. L.; Smith, J. B.; Smith, D. L. Exp. Eye Res. 1996, 63, 661672. (43) Lapko, V. N.; Smith, D. L.; Smith, J. B. Protein Sci. 2001, 10, 11301136. (44) Hanson, S. R.; Hasan, A.; Smith, D. L.; Smith, J. B. Exp. Eye Res. 2000, 71, 195-207. (45) Zhang, Z.; Smith, D. L.; Smith, J. B. Exp. Eye Res. 2003, 77, 259272. (46) Harrington, V.; McCall S.; Huynh, S.; Srivastava, K.; Srivastava, O. P. Mol. Vis. 2004, 10, 476-489. (47) Lapko, V. N.; Smith, D. L.; Smith J. B. Biochemistry 2002, 41, 14645-14651. (48) Lapko, V. N.; Smith, D. L.; Smith, J. B. Protein Sci. 2003, 12, 17621774. (49) Lapko, V. N.; Cerny, R. L.; Smith, D. L.; Smith, J. B. Protein Sci. 2005, 14, 45-54. (50) Wright, H. T. CRC Crit. Rev. Biochem. 1991, 26, 1-52. (51) Liu, H.; Sadygov, R. G.; Yates, J. R., III Anal. Chem. 2004, 76, 41934201. (52) Stark, G. R.; Stein, W. H.; Moore, S. J. Biol. Chem. 1960, 235, 31773181. (53) Khandke, K. M.; Fairwell, T.; Chait, B. T.; Manjula, B. N. Int. J. Peptide Protein Res. 1989, 34, 118-123. (54) Geoghegan, K. F.; Hoth, L. R.; Tan, D. H.; Borzilleri, K. A.; Withka, J. M.; Boyd, J. G. J. Proteome Res. 2002, 1, 181-187.

PR049781J