Isoform Analysis of LC-MS/MS Data from Multidimensional Fractionation of the Serum Proteome† Alexei L. Krasnoselsky,* Vitor M. Faca, Sharon J. Pitteri, Qing Zhang, and Samir M. Hanash Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue North, Seattle, Washington 98109 Received November 8, 2007
Abstract: We developed a visualization approach for the identification of protein isoforms, precursor/mature protein combinations, and fragments from LC-MS/MS analysis of multidimensional fractionation of serum and plasma proteins. We also describe a pattern recognition algorithm to automatically detect and flag potentially heterogeneous species of proteins in proteomic experiments that involve extensive fractionation and result in a large number of identified serum or plasma proteins in an experiment. Examples are given of proteins with known isoforms that validate our approach and present a subset of precursor/ mature protein pairs that were detected with this approach. Potential applications include identification of differentially expressed isoforms in disease states. Keywords: Protein fractionation • visualization • LC-MS/ MS • isoforms
Introduction With rapid proliferation of proteomic data, there is a need for tools that allow computational data mining and visualization of complex data sets. There are many software packages available for processing proteomics data and displaying results (for recent review, see Palagi et al.).1 However, there is a paucity of visualization tools that are simple and easily adaptable to evolving proteomic data formats. Visualization tools combine several sources of information for intelligent data mining. The human eye is particularly suited to identify complex patterns and features, provided that the information is presented in a structured visual way and limited to a few patterns at a time. The gene expression red-green heat maps serve as an example of simple and yet effective method of representation of complex data.2 Proteins exist in plasma and tissue sources in multiple forms that result from alternative splicing (isoforms), precursor/ mature protein combinations, or different patterns of glycosylation. Most proteins are secreted as precursor proteins from which biologically active forms are generated upon proteolytic cleavage (e.g., see Khatib and Geraldine).3 For biomarker discovery, it is important to assess the presence of isoforms that may differ in their levels in a disease related manner as in the case of phosphorylation and glycosylation, among numer* To whom correspondence should be addressed. Tel: (206) 667-1250, fax: (206) 667-2537, E-mail:
[email protected]. † “Graphics reveals data”. Edward R. Tufte in “The Visual Display of Quantitative Information”.
2546 Journal of Proteome Research 2008, 7, 2546–2552 Published on Web 04/18/2008
ous post-translational modifications. We present here a visualization approach for multidimensional proteomic data to assist in the search for protein isoforms, precursor/mature protein combinations, and fragments. Along with the visualization tool, we also describe a simple pattern recognition algorithm that we developed to automatically detect and flag potentially heterogeneous species of proteins in proteomic experiments that involve extensive fractionation and result in a large number of identified proteins in one experiment.
Methods Protein Separation and Mass Spectrometry Analysis. Serum and plasma protein samples were subjected to fractionation followed by LC-MS/MS analysis of tryptic digests from individual fractions. The full procedure, designated Intact Protein Analysis System (IPAS) has been previously described by Fac¸a et al.4 Briefly, after immunodepletion, acrylamide-labeled samples5 were fractionated by anion-exchange into 12 fractions and subsequently by reversed-phase into 12 fractions, representing a total of 144 fractions that were analyzed individually by shotgun LC-MS/MS. In-solution tryptic digestion was performed overnight with lyophilized aliquots from the reversedphase (second dimension) fractionation step. The resulting peptide mixtures were analyzed by a LTQ-FTICR mass spectrometer (Thermo-Finnigan) coupled with a NanoAcquitynanoflow chromatography system (Waters). Spectra were acquired in a data-dependent mode in m/z range of 400-1800, including selection of the 5 most abundant +2 or +3 ions of each MS spectrum for MS/MS analysis. Acquired data was automatically processed by the Computational Proteomics Analysis System (CPAS)6 pipeline. This pipeline includes the X!Tandem search algorithm7 with comet score module plugin,8 PeptideProphet9 peptide validation, and ProteinProphet10 protein inference tool. The tandem mass spectra were searched against version 3.12 of the human IPI database.11 All identifications with a PeptideProphet probability greater than 0.75 were selected and the subsequent protein identifications were filtered at a 5% error rate. Heterogeneity Detection Algorithm. The concept behind cluster detection is as follows. For each protein (single IPI or a protein group of multiple IPI numbers considered to represent the same protein), the data were assembled into a n × m grid of fractions, where n corresponds to the number of fractions derived in ion-exchange chromatography (represented on the X-axis) and m corresponds to the number of fractions derived in RP-HPLC (represented on the Y-axis). The dimensions for the two data sets used in this article are 12 × 12 for one data 10.1021/pr7007219 CCC: $40.75
2008 American Chemical Society
Isoform Analysis of LC-MS/MS Data
technical notes
Figure 1. Visualization of proteomic data in 2-D fractionation experiments with differential sample labeling, The data shown is for protein HFAC (hepatocyte growth factor activator). (A) The peptide and ratio map of the 2-D chromatography fractionation. The grid represents the 2-D chromatography fractionation (12 × 12 fractions). The X-axis represents 12 fractions of ion-exchange chromatography and Y-axis, 11 fractions of RP HPLC. Each node of the grid shows the fraction location. The peptides are shown as concentric circles of different colors (the full list of identified peptides is shown in the inset), whereby the size of the circle indicates a relative distance of the peptide from the N-terminus of the full protein sequence. The size of the circle corresponds to the sequential order of the peptides starting from the N-terminus. The range for each peptide represents the starting and ending position in the protein sequence, scaled to 0-1. Values are provided for ratios between two samples being compared based on differential acrylamide labeling,5 where case samples are labeled with C13 acrylamide and control samples with C12 acrylamide. (B) Histogram of the ratios obtained for this protein in an experiment in which a comparison is made between two samples (in all 132 fractions). (C) Total MS events map for 2-D separation. Each node of the grid shows the number of MS events summed up across all peptides, while the size of the circle reflects visually that number. Journal of Proteome Research • Vol. 7, No. 6, 2008 2547
technical notes
Krasnoselsky et al. protein group of multiple IPI numbers considered to represent the same protein), the n × m data matrix of fractions (as described above), along with the ratio vector and a vector of number of spectral events for each peptide in each fraction is passed to the software. A vector of scaled 0-1 sequence position information is passed to the application as well. All preprocessing of the data is accomplished prior to passing the data to the visualization tool. The outputs of the application include three figures, saved as picture files (jpeg format): the figure that combines fractionation, ratio, and peptide sequence information (such as Figure 1A), histogram of the ratios (if available, see Figure 1B), and a figure of total spectral events for each fraction (such as Figure 1C). The Matlab code for the application is available upon request from the author.
Results
Figure 2. Relationship between number of peptides and the average number of clusters per protein. (A) The average cluster score (number of identified chromatographic clusters averaged across all the peptides per protein) is plotted against the total number of unique peptides for the corresponding protein. (B) The histogram of average cluster scores across all proteins with two or more unique peptides.
set, and 12 × 11 for the other. The pattern detection is performed at the peptide level. For each fraction, a binary peptide separation map is derived by assigning 1 to a fraction where the peptide was identified and 0 where it was not. The map serves as input to the protein heterogeneity detection algorithm, which consists of two steps. First, the fractionation pattern is smoothed by a 2 × 2 kernel, whereby each fraction xij is assigned a sum of the values in the kernel: Si,j ) xi,j + xi+1,j + xi,j+1 + xi+1,j+1. The rationale for smoothing is to reduce the MS sampling effect that might result in overestimation of the number of clusters. The clusters are defined by selecting the nodes with the values equal or exceeding k (kmax ) 4) and separated by a gap of at least g fractions (g ) 2 for this fractionation experiment based on the chromatographic resolution of the system). The number of identified peptide clusters is then averaged across all peptides for a given protein to result in a cluster score assigned to this protein. The output consists of all proteins ranked by the cluster score with the cluster statistics described on peptide level. Data Visualization. The visualization application requires several input data matrices. For each protein (single IPI or a 2548
Journal of Proteome Research • Vol. 7, No. 6, 2008
Visualization of the IPAS Proteomics Data. The data generated in comparative proteomics experiments that utilize extensive protein fractionation contain information related to isoforms that could be mined, but is generally not systematically analyzed. Such information is intrinsic to the locations (fractions) in which proteins were identified. Thus, chromatographic properties contain information that could be used to make inferences about subspecies/isoforms of proteins that elute differently but may be the products of the same gene. In this study, we analyzed data from 132 serum fractions that resulted from 2-D fractionation of intact (undigested) proteins. Figure 1A shows a representation of the 2-D fractionation as a grid with the nodes denoting the fractions. The particular identified peptides in a protein could be used to infer cleavages as in the case of surface proteins that shed their extracellular domains. We have devised a way of capturing this information on the fractionation grid, whereby a set of concentric circles represent the sequentially organized peptides. The circles are scaled in such a way that the size of the circle indicates a relative distance from the N-terminus of the protein, with the peptide represented by the smallest circle being closest to the N-terminus and the largest circle denoting the peptide closest to the C-terminus. Such visualization aids in immediate discerning a fragment: if a set of peptides appears as doughnutshaped in one or more fractions (such as fraction with coordinates [x ) 2, y ) 5] on Figure 1A), such a set of peptides would be derived from the C-terminal portion of the protein. If the peptides in a given fraction are represented by a set of small circles (relative to all the peptides identified in the fractions, as shown in the figure inset), such as in the fraction with coordinates [x ) 7, y ) 3], then the fragment is derived from the N-terminal portion of the protein. Thus, visualization allows an immediate grasp of four characteristics for each protein: the two chromatographic properties, the distribution of peptides along the sequence, and in comparative quantitative studies the differential ratio. Furthermore, the same visualization approach can be used for representing the number of MS events for a given protein in a given fraction (Figure 1B). Additional information is provided in an accompanying histogram of all ratios for a given protein in the experiment (Figure 1C). Automated Detection of Chromatographic Clusters. We developed a simple pattern recognition algorithm (see Methods) to identify and flag proteins that show distinct chromatographic clusters, such as shown in Figure 2A. The cluster identification occurs on the peptide level, and the number of clusters is then averaged across all the peptides for a single
Isoform Analysis of LC-MS/MS Data
technical notes
Figure 3. Hepatocyte growth factor activator protein, The sequence of the precursor is shown with a signal peptide in black letters, prepropeptide removed in mature protein in red letters, short chain in blue letters, and long chain in green letters. The underlined peptides denote those identified by mass-spectrometry in 132 fractions.
protein to derive a protein score. Figure 2A shows that there is no correlation between the average number of peptides and the number of identified clusters. The increase in number of clusters for proteins identified with a single peptide in multiple fractions is most likely due to incorrect IDs. The single-peptide hits were not included in subsequent analysis. The analysis shows that out of 1224 proteins with more than one unique peptide coverage 295 proteins showed chromatographic heterogeneity on the peptide levels. Such heterogeneity could be due to multiple factors that include MS sampling, precursor/ mature protein, multichain proteins connected by S-S bridges, splice isoforms, PTM modifications, and proteolytic fragments. The algorithm flags all these instances as long as they are manifested in discontinuous elution profile for a given protein. The histogram in Figure 2B shows that the majority of heterogeneous proteins show less than two clusters per protein (averaged number of identified peptide clusters). This is reasonable given the limited resolution of the system (11 × 12 fraction grid). Identification of Proteins and Their Cleavage Products. Most proteins are synthesized in vivo in the form of inactive precursor that is cleaved upon a physiological event locally or with their extracellular release. We have analyzed human plasma for presence of such precursor/mature protein pairs using our pattern detection algorithm to flag potential isoforms. Out of 295 proteins that were flagged as heterogeneous, 176 (or 60%) were consistent with precursors. Figure 1A shows an example of the detection of the full-length precursor and the mature form of hepatocyte growth factor activator (HGFA), identified in the IPAS experiment with 14 peptides. As could be observed from Figure 1A, the protein species elute as separate clusters that correspond to the mature protein as well as the corresponding precursor part removed upon cleavage (see Figure 3 for explanation). The detection algorithm flags this protein as heterogeneous and fragments may be discerned upon inspection of the plot. The precursor for HGFA does not convert single chain HGF to its biologically active form.12 However, cleavage of pre-HGFA at R407-I408 and R372-V373 converts it to its active two-chain form. Figure 1A shows that we detect several forms. The R407-I408 corresponds to the position 0.62 on 0-1 scale from N- to C-terminus of 655 amino
acid-long HGFA, and R372-V373 to 0.57, respectively. Indeed, we identified two sets of fractions that correspond to the precursor part that is removed in the mature form (sequence 36:372 or 0.06:0.57) as well as the two chains of the mature protein itself (0.57-0.6 and 0.72-0.98, peptides 9-14). Interestingly, the short chain of the mature protein (peptide 8) yielded only a single identified peptide, which elutes separately from the long chain of the HGFA. Analysis of Protein Isoforms. Proteins that result in alternative splicing can produce isoforms that are distinguishable in IPAS experiments. Here, we show one such example, fibulin-1 (FBLN1). Fibulin-1 is an extracellular matrix protein that is known to have four different isoforms (for recent review, see Gallagher et al.13). In the IPAS experiment described here, we have identified peptides that map to FBLN1 and identify at least two groups of isoforms: isoform C and isoforms B and D. The latter are indistinguishable by the identified peptides and referred here as isoform B/D. Figure 4 exhibits the fractionation pattern of FBLN1. The differences between the isoforms lie in the C-terminal portion of FBLN1. Figure 4A exhibits the fractions in which the isoforms B/D were identified by unique peptides (peptides 14 and 15), whereas the isoform C was identified by its corresponding C-terminal peptides (peptides 11 and 12 on Figure 4B). Isoform B/D elutes in the earlier ionexchange and reverse-phase HPLC fractions. There is also some, albeit incomplete, separation of isoforms by reversephase HPLC for the late eluting ion-exchange fractions. Analysis of the peptide composition shows no evidence of the earlier eluting fractions resulting from fragmentation of the later fulllength protein. Such differences might be due to variation in the glycosylation pattern FBLN1. The contribution of each isoform to the overall FBLN1 ratio could not be assessed in this study due to the origin of the Cys-containing peptides from the region of FBLN1 sequence common to all known isoforms. However, the presence of several isoforms that are partially resolved chromatographically is demonstrated. A utility of the visualization algorithm could be illustrated on Figure 5A where two subspecies of coagulation factor F11 are shown. The detection algorithm flags F11 (IPI00008556) as a chromatographically heterogeneous protein with two distinct species (Figure 5A). The Swiss-Prot annotation (P03951) indiJournal of Proteome Research • Vol. 7, No. 6, 2008 2549
technical notes
Krasnoselsky et al.
Figure 4. Fibulin 1 isoforms, (A) Total MS events map for 2-D separation. The grid represents the 2-D chromatography fractionation (12 × 12 fractions). The X-axis represents 12 fractions of ion-exchange chromatography and Y-axis, 12 fractions of RP HPLC. Each node of the grid shows the number of MS events corresponding to FBLN1, while the size of the circle reflects visually that number. (B) The peptide and ratio map of the 2-D chromatography fractionation. Each node of the grid shows the fraction location as in (A). Information is provided regarding fractions in which FBLN1 was found and the related peptides that were identified (full list is displayed in the figure inbox). Peptides are shown as concentric circles of different colors, whereby the size of the circle indicates a relative distance of the peptide from the N-terminus of the full protein sequence.
cates that two splice isoforms have been identified for this protein. However, the visualization plot suggests that the two chromatographic species are unlikely to be splice isoforms, because the missing sequence in isoform 2 that distinguishes it from isoform 1 is present in both clusters (the sequence maps to the range of 0.17-0.30). An alternative explanation is the difference in glycosylation pattern (F11 is heavily glycosylated).
Discussion Fractionation based on chromatographic properties yields a fingerprint of a protein that is determined by structural variations in the protein. High resolution HPLC systems, such as modern reverse-phase and ion-exchange HPLC, yield 2-D fractionation patterns that allow inferences to be made regarding single protein heterogeneity. We have utilized this chro2550
Journal of Proteome Research • Vol. 7, No. 6, 2008
matographic pattern information, along with sequence mapping of identified peptides, to gain insight into potential fragmentation patterns, splice isoforms, or other sources of protein heterogeneity that might be found in a sample. To reduce data complexity and allow an easier grasp of multidimensional proteomic data, we developed a visualization method that combines three sources of information (four dimensions of data) in one two-dimensional plot. Along with the visualization tool, we also developed a simple pattern recognition algorithm to automatically detect and flag potentially heterogeneous species of proteins in experiments such as IPAS, which involve extensive fractionation and identify more than a thousand serum or plasma proteins in one experiment.4 Given that proteins are identified based on matching of their corresponding peptide mass spectra to sequence databases, the
Isoform Analysis of LC-MS/MS Data
technical notes
Figure 5. Protein heterogeneity for LCAT and F11, The peptide map of the 2-D chromatography fractionation. Each node of the grid shows the fraction location as in 4A). (A) F11; peptides 6-9 are present in both chromatographically distinct clusters. Region 0.1-0.30 of the sequence of the F11 protein is missing in alternatively spliced isoform 2 (see text for details). (B) LCAT protein; N-glycosylation of LCAT has been shown by mass-spectrometry.14
isoform identification process is dependent on accurate peptide identifications. The goal of the automated detection algorithm we have developed is to reduce data complexity by eliminating proteins that do not show heterogeneity and leaving it to the researcher, aided by the visualization tool, to make final decisions about the flagged proteins. It is desirable to estimate a false-discovery rate for the list of proteins deemed heterogeneous by the algorithm. To address this problem, the availability of a benchmark set of known heterogeneous proteins that are resolved by chromatography would be useful to develop an algorithm for FDR estimation. In this publication, we provide two examples, whereby an observed heterogeneous nature of proteins (HGFA and FBLN1) could be indicative of the true precursor/mature protein (in the case of HGFA) and different splice isoforms (in the case of FBLN1) to be present in the samples. However, the definitive assessment requires
biochemical evidence to validate the finding of distinct species for the same protein. Nevertheless, as shown in this paper, in the example of coagulation factor F11, using our visualization software tool enables the researcher to rule out a hypothesis, such as the presence of alternatively spliced isoforms in the case of F11. Our approach allows us to start compiling a list of proteins that could serve as benchmark set for performance evaluation of future isoform detection algorithms. Figure 5 shows an example of two such proteins, F11 and LCAT. Compiling a comprehensive data set for benchmarking of isoform detection algorithm is beyond the scope of this paper and will be addressed in future publications. Such a protein set should satisfy at least the following criteria: the species of a protein should (a) be well-defined and characterized biochemically; (b) be detectable in normal plasma in quantities that allow good Journal of Proteome Research • Vol. 7, No. 6, 2008 2551
technical notes peptide coverage in MS; and (c) have large enough differences to be separable by common methods of protein fractionation. In conclusion, we have developed a visualization tool to aid in making inferences about heterogeneity of proteins identified in proteomics experiments that utilize extensive fractionation. We also provide a simple algorithm to detect and flag potential splice isoforms, mature/precursor protein combinations, and other types of protein structural variation.
References (1) Palagi, P. M.; Hernandez, P.; Walther, D.; Appel, R. D. Proteome informatics I: bioinformatics tools for processing experimental data. Proteomics 2006, 6 (20), 5435–5444. (2) Spellman, P. T.; Sherlock, G.; Zhang, M. Q.; Iyer, V. R.; Anders, K.; Eisen, M. B.; Brown, P. O.; Botstein, D.; Futcher, B. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell 1998, 9, 3273–3297. (3) Khatib, A.-M.; Geraldine, S. Growth Factors: To Cleave or not To Cleave. In Regulation of Carcinogenesis, Angiogenesis and Metastasis by the Proprotein Convertases (PCs), A New Potential in Cancer Therapy; Khatib, A.-M., Ed.; Springer: The Netherlands, 2006; pp 121-135. (4) Faca, V.; Pitteri, S.; Newcomb, L.; Glukhova, V.; Phanstiel, D.; Krasnoselsky, A.; Zhang, Q.; Struthers, J.; Wang, H.; Eng, J.; Fitzgibbon, M.; M, M.; Hanash, S. Contribution of protein fractionation to depth of analysis of the serum and plasma proteomes. J. Proteome Res. 2007, 6 (9), 3558–3565. (5) Faca, V.; Coram, M.; Phanstiel, D.; Glukhova, V.; Zhang, Q.; Fitzgibbon, M.; McIntosh, M.; Hanash, S. Quantitative analysis of acrylamide labeled serum proteins by LC-MS/MS. J. Proteome Res. 2006, 5 (8), 2009–2018.
2552
Journal of Proteome Research • Vol. 7, No. 6, 2008
Krasnoselsky et al. (6) Rauch, A.; Bellew, M.; Eng, J.; Fitzgibbon, M.; Holzman, T.; Hussey, P.; Igra, M.; Maclean, B.; Lin, C. W.; Detter, A.; Fang, R.; Faca, V.; Gafken, P.; Zhang, H.; Whitaker, J.; States, D.; Hanash, S.; Paulovich, A.; McIntosh, M. W. Computational Proteomics Analysis System (CPAS): an extensible, open-source analytic system for evaluating and publishing proteomic data and high throughput biological experiments. J. Proteome Res. 2006, 5 (1), 112–121. (7) Craig, R.; Beavis, R. C. TANDEM: matching proteins with tandem mass spectra. Bioinformatics 2004, 20 (9), 1466–1467. (8) Maclean, B.; Eng, J. K.; Beavis, R. C.; McIntosh, M. General framework for developing and evaluating database scoring algorithms using the TANDEM search engine. Bioinformatics 2006, 22 (July 28), 2830–2832. (9) Keller, A.; Nesvizhskii, A. I.; Kolker, E.; Aebersold, R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal. Chem. 2002, 74, 5383– 5392. (10) Nesvizhskii, A. I.; Keller, A.; Kolker, E.; Aebersold, R. A statistical model for identifying proteins by tandem mass spectrometry. Anal. Chem. 2003, 75 (17), 4646–4658. (11) Kersey, P.; Duarte, J.; Williams, A.; Karavidopoulou, Y.; Birney, E.; Apweiler, R. The International Protein Index: an integrated database for proteomics experiments. Proteomics 2004, 7, 1985–1988. (12) Miyazawa, K.; Shimomura, T.; Naka, D.; Kitamura, N. Proteolytic activation of hepatocyte growth factor in response to tissue injury. J. Biol. Chem. 1994, 269 (12), 8966–8970. (13) Gallagher, W. M.; Currid, C. A.; Whelan, L. C. Fibulins and cancer: friend or foe. Trends Mol. Med. 2005, 11 (7), 336–340. (14) Liu, T.; Qian, W. J.; Gritsenko, M. A.; Camp, D. G., 2nd.; Monroe, M. E.; Moore, R. J.; Smith, R. D. Human plasma N-glycoproteome analysis by immunoaffinity subtraction, hydrazide chemistry, and mass spectrometry. J. Proteome Res. 2005, 4 (6), 2070–2080.
PR7007219