Protein Identification and Peptide Expression Resolver: Harmonizing Protein Identification with Protein Expression Data Paul Kearney,* Heather Butler, Kevin Eng, and Patrice Hugo Caprion Proteomics, Montreal, QC, Canada, H4S 2C8 Received August 17, 2007
Proteomic discovery platforms generate both peptide expression information and protein identification information. Peptide expression data are used to determine which peptides are differentially expressed between study cohorts, and then these peptides are targeted for protein identification. In this paper, we demonstrate that peptide expression information is also a powerful tool for enhancing confidence in protein identification results. Specifically, we evaluate the following hypothesis: tryptic peptides originating from the same protein have similar expression profiles across samples in the discovery study. Evidence supporting this hypothesis is provided. This hypothesis is integrated into a protein identification tool, PIPER (Protein Identification and Peptide Expression Resolver), that reduces erroneous protein identifications below 5%. PIPER’s utility is illustrated by application to a 72-sample biomarker discovery study where it is demonstrated that false positive protein identifications can be reduced below 5%. Consequently, it is recommended that PIPER methodology be incorporated into proteomic studies where both protein expression and identification data are collected. Keywords: biomarkers • expression • fingerprinting • proteomics • protein identification
Introduction Proteomic workflows for biomarker discovery typically generate both peptide expression and protein identification information during the course of sample analysis.1–6 Peptide expression information, derived from LC-MS analysis of study samples, is used to select anonymous peptides that are significantly differentially expressed between study cohorts (e.g., diseased vs healthy). These peptides are then targeted for protein identification. See Figure 1 for a schematic of a typical proteomic workflow for biomarker discovery. To date, the expression and identification phases of the proteomic discovery workflow have been largely disconnected: expression data are used to select peptides, and identification data are used to sequence peptides, independently. However, as illustrated in Figure 2, an essential link between peptide expression and protein identification information exists. Five peptides (a-e) are depicted along with their expression profiles across ten samples. An expression profile depicts the intensity of a peptide in each sample as measured by LC-MS. During protein identification, observed peptides are assigned to theoretical peptides in a protein database. This putative peptide-to-protein assignment, as discussed below, often has many false positive and false negative assignments. The reader is encouraged to examine Figures 12 and 13 below which depict real peptide-to-protein assignments. The PIPER hypothesis is that peptides derived from the sample protein have similar expression profiles. Consequently, peptides assigned to the same protein that have dissimilar expression profiles may be erroneous peptide-to-protein as* Corresponding author. E-mail:
[email protected].
234 Journal of Proteome Research 2008, 7, 234–244 Published on Web 12/07/2007
signments. Returning to Figure 2, peptides a-d have similar expression profiles, whereas peptide e does not. Consequently, either peptide e is an erroneous assignment or peptides a-d are erroneous assignments. Note that the hypothesis is based on intact proteins within the sample. Proteolysis, splice variation, and other factors could result in peptides being assigned to the same database protein. In fact, dissimilar expression profiles for peptides assigned to the same database protein may be indicative of splice variation across the sample population. This paper makes three contributions. First, evidence supporting the PIPER hypothesis is presented. That is, peptides derived from the same protein have significantly correlated expression profiles. Second, a software tool called PIPER that makes use of the above hypothesis is presented. PIPER can be used in proteomics workflows using tandem mass spectrometry or fingerprinting protein identification methodologies to increase confidence in the results. PIPER includes a rigorous false positive rate estimation based on false discovery rate (FDR) techniques.7–9 Third, PIPER is illustrated by application to a 72-sample biomarker discovery study where it is illustrated that the false positive rate of protein identification can be reduced below 5%. Motivation. If targeted protein identification could be performed without error, then there would be no need for a tool such as PIPER. Unfortunately, the literature indicates that confident protein identification is still a destination to be reached.10 There are two prevalent targeted protein identification methodologies. Tandem mass spectrometry utilizes LCMS/MS technology to obtain fragmentation spectra for the target peptides. These spectra are then submitted to a database search or to de novo sequencing engines to derive a list of 10.1021/pr0705439 CCC: $40.75
2008 American Chemical Society
research articles
Harmonizing Protein Identification with Expression Data
Figure 1. Typical proteomic discovery workflow. Samples are acquired, processed to enrich for proteins or peptides of interest, and then profiled by LC-MS. The raw expression data are analyzed by software for peptide detection, normalization, and tracking across samples in the study. Differentially expressed peptides are then selected and targeted for identification by tandem mass spectrometry (LC-MS/MS) or fingerprinting. This results in putative peptide-to-protein assignments. The peptideto-protein assignments along with the peptide expression profiles are submitted to PIPER for processing. The output is a list of differentially expressed proteins along with an estimate of the false positive error rate.
Figure 2. Essential link between peptide expression and protein identification.
identified proteins.11–13 Fingerprinting uses peptide information such as mass, LC retention time, and other peptide information derived during the LC-MS analysis of the samples to identify proteins directly.14–20 Since both methodologies generate a peptide-to-protein assignment, both can be integrated with PIPER, as depicted in Figure 1. The are several sources of error in the targeted protein identification workflow, including the following.
Coverage. The emerging standard in the community is a minimum of two peptide assignments per protein for tandem mass spectrometry21 and even higher for fingerprinting. Specificity. Peptides frequently coelute, resulting in convoluted MS/MS spectra that give ambiguous sequencing results. In samples with a large dynamic range such as plasma, highabundance proteins disproportionately populate the peptide space. Comprehensiveness. There are multiple comprehensiveness errors. Protein databases are not complete in terms of genes, splice variants, SNPs, and other sources of sequence diversity. Software models for sequencing spectra are incomplete in terms of post-translational modifications, dealing with convoluted spectra, nonspecific cleavages, and scoring spectra-tosequence alignments. Finally, LC-MS/MS analyses often cannot acquire all targeted peptides due to resource limitations. Misalignment. In targeted protein identification, the peptide target is first observed in LC-MS analysis of the sample and then reacquired during subsequent targeted LC-MS/MS analyses of the sample. Unfortunately, the target is specified by the mass and LC retention time which can vary between the LCMS and LC-MS/MS analyses. As a result, a peptide can be correctly sequenced but obtained from the wrong target. Fingerprinting methods derive their information from LC-MS analysis directly and, so, do not suffer from misalignment errors. Clearly, there is the need for protein identification tools that improve the quality of protein identification. In particular, in biomarker or drug target discovery studies, the future costs of an error in protein identification can be significant. PIPER reduces the number of false protein identifications significantly. To substantiate this claim, we focus on an important class of errors called off-target identifications prevalent in plasma analysis. This is a misalignment error where a peptide is targeted for protein identification, but LC-MS/MS sequences the wrong peptide. PIPER detects these errors. Figure 3 illustrates the example. Importantly, PIPER uses peptide intensity information (i.e., expression profiles), whereas protein identification uses peptide coordinate information (i.e., mass and LC retention time) along with fragmentation data. Peptide intensity and peptide coordinates are independent information, so PIPER provides an independent/orthogonal validation method of the results of protein identification. Again, the off-target protein identification illustrated in Figure 3 is an excellent example of this. Concrete examples are provided in the Results section (see Figure 14). Previous Work. The most relevant related work is the construction of correlation networks of coregulated endogenous peptides.22 These are peptides correlated because they are coregulated, not because they originate from the same protein. The innovations presented in this paper address the issue of how expression profiles can be used to improve protein identification as opposed to the detection of coregulated peptides or proteins.
Experimental Section Methods. The proteomic workflow is depicted in Figure 1. In green are the generic steps which will be detailed briefly below; in yellow are the PIPER specific steps; and in orange are the expression and protein databases. Described below is the processing of a 72-sample biomarker discovery study that Journal of Proteome Research • Vol. 7, No. 01, 2008 235
research articles
Kearney et al.
Figure 3. On the left, peptides x and y are the target peptides which are located near high-abundance peptides A and B from the same ubiquitous protein. During targeted LC-MS/MS acquisition, spectra for A and B are acquired mistakenly due to either low specificity of the mass spectrometer or retention time drift between the LC-MS and LC-MS/MS analyses. On the right, A and B are assigned to their parent protein with high confidence. By comparing the dissimilar expression profiles of the intended targets x and y, the off-target protein identification can be detected. This scenario is very common in plasma analysis since high-abundance proteins produce a disproportionate number of high-intensity peptides throughout the LC-MS mass and retention time space.
Figure 4. Study peptide. Once LC-MS expression profiling is completed, peptides are detected and tracked across the 72 samples. This is called a study peptide, and it is assigned a unique ID and is associated with its SCX fraction, median m/z, median retention time, consensus charge, and most importantly, the 72 intensity values across the 72 samples.
Figure 5. Illustration of hierarchical clustering of peptide expression correlation values for 10 peptides assigned to a protein. Bright green indicates high positive correlation, and bright red indicates high negative correlation. Peptides D-J are all wellcorrelated to each other, whereas peptides A-C are likely false positive assignments.
incorporates PIPER analysis. Note that fingerprinting was not applied in this particular analysis. Clinical Human Plasma. The study consisted of 24 prostate cancer plasma samples, 24 lung cancer plasma samples, and 24 age and gender matched healthy samples. Twenty-two prostate samples were acquired from the Centre Hospitalier Universitaire de Montreal, and two were acquired from the McGill University Health Centre. Twenty-three lung samples 236
Journal of Proteome Research • Vol. 7, No. 01, 2008
were acquired from the Centre Hospitalier Universitaire de Montreal, and one was acquired from the McGill University Health Centre. All 24 healthy samples were acquired from SeraCare Life Sciences. Study Design. Several general rules were employed to construct an unbiased study design. First, prostate, lung, and healthy samples were interleaved during sample processing and mass spectrometric analysis to ensure that biases due to order of processing are minimized. Second, samples were processed and analyzed on the same instruments with the same lot of reagents. Sample Processing. Stringent sample processing, analysis procedures, and quality control checks were controlled by a series of standard operating procedures (SOPs). Upon reception, frozen plasma samples were bar-coded, entered into the Laboratory Information Management System (Nautilus LIMS, Thermo Electron, Woburn, MA), and stored at -80 °C. To begin, samples were thawed, passed through 0.22 µm filters, and then transferred to 24-well plates. Plasma samples were depleted of high-abundance proteins using the Multiple Affinity Removal System (MARS, Agilent Technologies, Palo Alto, CA) on an Agilent 1100 HPLC fitted with a refrigerated (4 °C) autosampler and fraction collector (Bjorhall). The depletion method was a modified version of the Agilent MARS protocol. Plasma samples were loaded onto the column in 150 mM ammonium bicarbonate (pH 7.8) for 10 min, and the unbound proteins were collected. The column was then washed for 3 min in Agilent buffer A. Bound proteins were eluted over 8 min in Agilent buffer B. The column was then re-equilibrated in 150 mM ammonium bicarbonate (pH 7.8). Depleted plasma samples were proteolyzed under denaturing conditions (8 M urea/400 mM ammonium bicarbonate, pH ) 8.0) with endo-LysC (Princeton Separations, Adelphia, NJ) (1:50, enzyme/total protein) for 2 h and then diluted (4:1) and proteolyzed with trypsin (Promega, Madison, WI) (1:50, enzyme/ total protein) for an additional 16 h. Following proteolysis, the peptides were desalted on a 10 × 10 mm C18 HPLC guard column (Phenomenex, Torrance, CA). Buffer A was water/0.1% TFA, and buffer B was acetonitrile/0.1% TFA. After a 2 min wash in 2% B, the samples were eluted by a 1 min ramp up to 90% B. The column was then re-equilibrated in 2% B.
research articles
Harmonizing Protein Identification with Expression Data
Figure 6. Distribution of peptide intensity CVs for the prostate cohort.
Figure 7. On the left, the MDS plot of the 72 samples colored by cohort. On the right, an MDS plot of the 48 prostate and lung samples illustrating segregation with minimal overlap.
Following desalting, the samples were fractionated by SCX chromatography using a 4.6 × 150 mm BioBasic column (Thermo Electron, Bellefonte, PA). The Agilent 1100 HPLC was operated at a flow rate of 800 µL/min. The mobile phase A was 5 mM ammonium formate/15% acetonitrile, and mobile phase B was 1 M ammonium formate/15% acetonitrile. The gradient was developed by moving from 2.5% to 75% B over the course of 20 min. Prior to injecting plasma samples, the system was verified by separating a mixture of peptides. Eight fractions were collected from the separated peptides. The fractionated samples were then freeze-dried in bar-coded 24-well plates and stored at -80 °C. The distribution of fractions into 96-well plates for mass spectrometry analysis was accomplished on a Multiprobe II HT Plus (Packard, Meriden, CT) four-channel liquid handler. Sample plates were then lyophilized and stored at -80 °C. Expression Profiling. The LC-MS system consisted of a CapLC (Waters, Milford, MA) with a cooled autosampler and a QTOF Ultima (Waters, Milford, MA) controlled by MassLynx version 4.0 software. Samples were reconstituted in 15 µL of
water/10% acetonitrile/0.1% formic acid solution and injected onto a reversed-phase (Jupiter C18, Phenomenex, Torrance, CA) column. For the reversed-phase HPLC separation, buffer A was water/0.2% formic acid, and buffer B was acetonitrile/0.2% formic acid. The gradient started at 10% B and was ramped up to 60% B in 55 min. After holding at 60% B for 2 min, B was decreased to 10% for column re-equilibration before the next injection. For LC-MS survey scans, the mass spectra were acquired over 400–1600 Da at a rate of 1 spectrum/s. Instrument performance was verified by injecting 5 µL of a peptide standards mixture. Performance characteristics were automatically generated by the platform. The sensitivity was recorded in terms of the number of multiply charged ions. The retention time and mass accuracy of two peptides in the standard samples were also recorded. Sample lists were generated by the LIMS and imported into MassLynx. Samples were injected sequentially by fraction. As data were acquired from the mass spectrometer, it was automatically retrieved from the instrument computer to a central database where it was registered. Registration includes Journal of Proteome Research • Vol. 7, No. 01, 2008 237
research articles
Kearney et al.
Figure 8. Comparison of the null and observed expression correlation distributions.
Figure 9. False positive rate of peptide-to-protein assignments derived from Figure 8 distributions.
study name, sample number, fraction, and condition (e.g., healthy, prostate, lung). The raw data were then converted into a three-dimensional isotope map format containing m/z, retention time, and intensity information. Peptide Processing. The first step in the LC-MS data analysis is peak detection. Savitzky-Golay smoothing in both the m/z and retention time dimensions is performed followed by peak fitting to a four-dimensional (m/z, retention time, charge, and intensity) peptide isotope model. This model utilizes the difference in mass between peptide isotope peaks, retention time coincidence of peptide isotopes, and the expected intensity profile of a peptide’s isotopes as a function of peptide mass. This results in a peptide map consisting of a listing of the m/z, charge, retention time, and intensity of all monoisotopic peptide peaks. The peptide maps undergo mass correction, retention time normalization, and normalization of intensity to account for platform variability in these three dimensions. Following normalization, peptides are matched across all samples in a study. Peptides are clustered according to SCX 238
Journal of Proteome Research • Vol. 7, No. 01, 2008
fraction, mass, retention time, and charge using hierarchical clustering techniques adapted to the proteomics context. The clustering, or tracking, of the same peptide across the 72 samples in the study enables the ability to determine which peptides are differentially expressed among the prostate, lung, and healthy cohorts. Once peptide clusters have been formed, a representative median mass and median retention time are calculated to represent the peptide cluster. We refer to these peptide clusters as study peptides. Figure 4 depicts the acquisition of LC-MS data and the derivation of study peptides. Each study peptide is associated with the following information: unique identifier, SCX fraction, m/z (median over 72 samples), retention time (median over 72 samples), charge (consensus over 72 samples), and an intensity profile across 72 samples. This information is stored in the Expression Database. Peptide Selection. Peptides are selected by an FDR-adjusted p-value.23 Specifically, a paired t test is applied to determine peptides differentially expressed among prostate, lung, and healthy samples. A set of 1000 permutation tests on the
Harmonizing Protein Identification with Expression Data
research articles
Figure 10. Distribution of expression correlation scores for peptides detected in different charge states or in different fractions. Table 1. Overlap between the Mascot Filter and the Correlation Filter mascot filter protein overlap
correlation filter
+ -
+
-
37 18
2 2
expression profiles are performed to derive a FDR-adjusted p-value. All peptides with a FDR-adjusted p-value below 5% are kept for protein identification. Targeted Tandem MS. The LC-MS/MS injections were performed with the same parameters as the LC-MS injections described above with the following exceptions. For MS/MS scans, the mass range was 50–2000 Da, and each spectrum was acquired in 2 s. For LC-MS/MS, the duty cycle was one survey scan followed by one product ion scan (MS/MS). Inclusion MS/ MS spectra were acquired for the target peptides selected (described above) and placed in inclusion lists. Tolerances for inclusion MS/MS acquisition of target peptides were (1 min retention time and (0.2 Da. The collision energy varied depending on the m/z as well as on the precursor ion charge state. Instrument performance was verified by injecting 5 µL of the peptide standards mixture. Performance characteristics were automatically generated by the platform. The sensitivity was recorded in terms of the number of multiply charged ions. The retention time and mass accuracy of two peptides in the standard samples were also recorded. Peptide-to-Protein Assignment. Database searching of LCMS/MS spectra for peptide identification was accomplished using Mascot 1.8 (MatrixScience, Boston, MA) and the Human International Protein Index v3.31 (IPI, European Bioinformatics Institute).24 Mascot parameters specify trypsin proteolysis with one allowed missed cleavage and with variable modification of methionine (oxidation) and glutamine (deamidation). Mass tolerances were 0.25 Da for both precursor and fragment ions. PIPER. The peptide-to-protein assignments achieved by the Mascot search were submitted to PIPER for expression correlation filtering. All proteins with fewer than two assigned peptides were filtered out. For each of the remaining proteins,
all assigned peptides were submitted to hierarchical clustering using average linkage where the distance metric between pairs of peptide expression profiles was the Pearson correlation score. The Pearson correlation score ranges from -1 to 1, where scores near -1 indicate negative correlation, scores near 1 indicate positive correlation, and scores near 0 indicate no correlation. For each protein, the largest cluster where all peptides have pairwise expression correlation scores above a dynamic threshold is determined. The dynamic threshold is obtained from a distribution of the pairwise Pearson correlation scores between all pairs of 3000 randomly selected study peptide expression profiles. The absolute Pearson score of the 95th percentile of this distribution is selected as the threshold to ensure that false positive correlations are no more than 5%. This technique enables the threshold to be determined on a study-by-study basis. In the case of a tie (i.e., two clusters with the same size), the cluster with the minimum maximum correlation score is kept; however, other tie-breaking criteria could be used. A protein is designated as a confident identification if it has at least two assigned peptides with a Pearson correlation score above the dynamic threshold. A visualization of the PIPER analysis appears in Figure 5. False Positive Rate Estimation. The PIPER process has two false positive rates (FPRs). First is the FPR of the peptide-toprotein assignment. Specifically, given two peptides assigned to a protein with a specific expression correlation score, what is the probability that at least one of these two peptides did not originate from the same protein? This FPR is determined as described above. The second determines the FPR of protein identification. The rate of false positive protein identifications is achieved by comparing the number of proteins identified by chance alone to the actual number of proteins identified. To determine the number of proteins identified by chance alone, random database search techniques25 and multiple testing adjustment techniques have been applied to PIPER.23 The Mascot database search is performed on a randomized version of the IPI database. Finally, expression correlation is performed using randomized expression profiles. These randomizations ensure that the peptide-to-protein assignments Journal of Proteome Research • Vol. 7, No. 01, 2008 239
research articles
Kearney et al. a
Table 2. Proteins Identified by at Least Two Peptides Using the Mascot or Correlation Filter
overlap category
accession
description
++
IPI00011261.2
++ ++ ++ ++ ++ ++
IPI00021841.1 IPI00021842.1 IPI00021854.1 IPI00022229.1 IPI00025864.5 IPI00029717.1
++ ++ ++ ++
IPI00029863.4 IPI00032179.2 IPI00032220.3 IPI00292530.1
++ ++ ++
IPI00298497.3 IPI00383338.1 IPI00411626.2
++
IPI00478003.1
++ ++ ++
IPI00478671.2 IPI00480192.1 IPI00550991.3
++ ++
IPI00639937.1 IPI00645038.1
++ ++ ++
IPI00645849.1 IPI00654875.1 IPI00742696.2
++
IPI00744362.1
++ ++
IPI00745089.2 IPI00783987.2
++
IPI00784338.1
complement component C8 γ chain precursor apolipoprotein A-I precursor apolipoprotein E precursor apolipoprotein A-II precursor apolipoprotein B-100 precursor cholinesterase precursor splice isoform 2 of fibrinogen R chain precursor R-2-antiplasmin precursor antithrombin III variant angiotensinogen precursor interalpha-trypsin inhibitor heavy chain H1 precursor fibrinogen beta chain precursor PRO2769 hypothetical protein DKFZp779N0926 R-2-macroglobulin precursor 152 kDa protein retinol binding protein 4, plasma isoform 1 of R-1antichymotrypsin precursor B-factor, properdin interalpha (Globulin) inhibitor, H2 polypeptide extracellular matrix protein 1 complement C4-B precursor vitamin D-binding protein precursor hypothetical protein DKFZp686K08164 R 1B-glycoprotein complement C3 precursor (fragment) similar to APOA4 protein
++ ++ ++ ++ ++
IPI00784409.1 IPI00789547.1 IPI00790993.1 IPI00791901.1 IPI00794184.1
70 kDa protein 19 kDa protein 104 kDa protein 26 kDa protein 97 kDa protein
++ ++
IPI00795633.1 IPI00796316.1
52 kDa protein gelsolin
++ ++
IPI00815663.1 IPI00815692.1
FGA protein (Fragment) KNG1 protein
++
IPI00816741.1
+-
IPI00006662.1
complement component 5 variant (Fragment) apolipoprotein D precursor
+-
IPI00011694.1
trypsin I precursor
++-
IPI00017601.1 IPI00021727.1
+-
IPI00022395.1
+++-
IPI00022488.1 IPI00025327.2 IPI00032291.1
ceruloplasmin precursor C4b-binding protein R chain precursor complement component C9 precursor hemopexin precursor plasminogen complement C5 precursor
240
Journal of Proteome Research • Vol. 7, No. 01, 2008
best peptide
SLPVSDSVLSGFEQR
best peptide Mascot score
best correlation score
correlation FPR
80.09
0.41
5.0%
LLDNWDSVTSTFSK AYKSELEEQLTPVAEETR EPCVESLVSQYFQTVTDYGKDLMEK SVSDGIAALDLNAVANK AILQSGSFNAPWAVTSLYEAR GGSTSYGTGSETESPR
123.21 117.36 66.22 99.01 32.5 85.35
0.71 0.75 0.63 0.88 0.67 0.78
0.3% 0.3% 0.8% 0.1% 0.3% 0.1%
HQMDLVATLSQLGLQELFQAPDLR AFLEVNEEGSEAAASTAVVIAGR ALQDQLVLVAAK GFSLDEATNLNGGLLR + m11|1 deamidation (N) QVKDNENVVNEYSSELEK ERGHMLENHVER MLEEIMKYEASILTHDSSIR
84.6 143 50.25 78.93
0.82 0.79 0.72 0.77
0.1% 0.1% 0.3% 0.1%
128.92 32.61 117.43
0.85 0.69 0.81
0.1% 0.3% 0.1%
SSSNEEVMFLTVQVK
104.87
0.78
0.1%
DKIYMYGGK LLNLDGTCADSYSFVFSR AVLDVFEEGTEASAATAVK
50.54 100.44 106.07
0.57 0.79 0.75
0.8% 0.1% 0.1%
DFHINLFQVLPWLK KLWAYLTINQLLAER
40.75 100.61
0.56 0.79
0.8% 0.1%
EVGPPLPQEAVPLQK ALEILQEEDLIDEDDIPVR HLSLLTTLSNRVCSQYAAYGEK
41.06 107.8 75.31
0.68 0.78 0.54
0.3% 0.1% 1.9%
85.81
0.93
0.1%
46.64 122.58
0.44 0.67
5.0% 0.3%
51.67
0.66
0.3%
92.73 81.75 118.96 58.76 47.92
0.52 0.58 0.72 0.77 0.71
1.9% 0.8% 0.3% 0.1% 0.3%
77.96 95.08
0.51 0.66
1.9% 0.3%
96.98 43.66
0.75 0.61
0.1% 0.8%
68.14
0.72
0.3%
62.69
-0.28
100.0%
35.4
0.27
12.6%
78.65 48.85
0.07 0.16
49.7% 27.5%
AIEDYINEFSVR
71.84
0.08
49.7%
LYLVQGTQVYVFLTK KVYLSECK GGSASTWLTAFALR
80.73 35.41 60.68
0.29 0.04 0.35
12.6% 72.9% 12.6%
TGLDSPTGIDFSDITANSFTVHWIAPR LRCLAPLEGAR ILLQGTPVAQMTEDAVDAER DKVNSFFSTFK + m4|1 deamidation (N) TATSEYQTFFNPR KLSFYYLIMAK ANTVQEATFQMELPK MGNFPWQVFTNIHGR GVYSSDVFDIFPGTYQTLEMFPR + m20|1 oxidation (M) LFDSDPITVTVPVEVSR AGALNSNDAFVLK + m7|1 deamidation (N) GLIDEVNQDFTNR KIYPTVNCQPLGMISLMK + m13m17|2 oxidation (M) LNLVATPLFLKPGIPYPIK KMTVTDQVNCPK + m2|1 oxidation (M) TLNNDIMLIK + m3|1 deamidation (N) KALYLQYTDETFR EDVYVVGTVLR
research articles
Harmonizing Protein Identification with Expression Data Table 2. Continued
overlap category
accession
description
+-
IPI00218732.3
+-
IPI00292950.4
++-
IPI00298971.1 IPI00329775.7
+-
IPI00394992.1
+-
IPI00556459.1
++++-+ -+
IPI00643525.1 IPI00793618.1 IPI00794874.1 IPI00796232.1 IPI00796279.1 IPI00828131.1
serum paraoxonase/ arylesterase 1 heparin cofactor II precursor vitronectin precursor isoform 1 of Carboxypeptidase B2 precursor splice isoform 2 of N -acetylmuramoyl-L-alanine amidase precursor serine/cysteine proteinase inhibitor clade G member 1 splice variant 2 complement component 4A 13 kDa protein protein prothrombin B-chain 25 kDa protein mutS homologue 3
---
IPI00515041.3 IPI00647556.1
complement factor H gelsolin
best peptide
best peptide Mascot score
best correlation score
correlation FPR
EVQPVELPNCNLVK
33.55
0.28
12.6%
FAFNLYR
34.13
-0.07
92.7%
DVWGIEGPIDAAFTR AYISMHSYSQHIVFPYSYTR + m5|1 oxidation (M) EGKEYGVVLAPDGSTVAVEPLLAGLEAGLQGR
59.12 62.37
0.19 0.22
27.5% 27.5%
61.08
0.29
12.6%
TNLESILSYPKDFTCVHQALK
40.33
-0.31
100.0%
53.52 119.87 56.5 42.5 63.58 16.27
-0.03 0.30 0.29 -0.01 0.60 0.49
72.9% 12.6% 12.6% 72.9% 0.8% 1.9%
39.3 83.08
0.38 0.36
5.0% 5.0%
AEMADQAAAWLTR VPVAVQGEDTVQSLTQGDGVAK SYTVAIAGYALAQMGR ENLDRDIALMK DTDTGALLFIGK FHSPFIVENYR + m9|1 deamidation (N) IIYKENER EVQGFESATFLGYFK
a Symbol ++ indicates the protein was identified by both filters. Symbol - - indicates that the protein was rejected by both filters. Symbol +- indicates that the protein was accepted by the Mascot filter but not the correlation filter. Symbol -+ indicates that the protein was accepted by the correlation filter but not the Mascot filter. The highest scoring (Mascot) peptide sequence is shown. The highest correlation score among all pairs of assigned peptides is given along with the associated false positive rate.
Figure 11. Example of a protein that passes both the Mascot and correlation filter.
and expression correlations result in purely random protein identifications. Consequently, if 25 confident proteins are identified using the randomized search and correlation and 250 confident proteins are identified using the regular (nonrandomized) search and correlation, then the FPR is 25/250 ) 10%.
Results and Discussion The performance of PIPER is predicated upon quality expression profiling, which is established first. The number of study peptides appearing in at least 75% of the samples in any single cohort is 37 448. For these study peptides, the median
coefficient of intensity variation is 23.3%, 18.8%, and 22.9% for the healthy, prostate, and lung cohorts, respectively. These include both platform and biological variability. The distribution of intensity CVs for the prostate cohort appears in Figure 6 where missing values have been replaced using KNN impute.26 The ability to segregate healthy, prostate, and lung samples was assessed by performing a multiple dimensional scaling (MDS) analysis.27 MDS plots locate samples in three-dimensional space based on proteomic similarity; samples that are close in 3D space are similar at the proteomic level and viceJournal of Proteome Research • Vol. 7, No. 01, 2008 241
research articles
Kearney et al.
Figure 12. Example of a protein that passes both the Mascot and correlation filter.
Figure 13. Example of a protein that passes the correlation filter but not the Mascot filter.
versa. The MDS analysis of all samples is displayed on the left in Figure 7. As expected, the difference between healthy and cancer plasma samples is greater than the difference between the two cancer cohorts. On the right is the MDS analysis of the prostate and lung samples alone to provide a clearer view of the cohort segregation. Study peptides that were differentially expressed among the three study cohorts with adjusted p-values below 0.05 were submitted for sequencing by tandem mass spectrometry and for protein identification by Mascot using the IPI database. The results below illustrate how peptide expression profiles can be used to enhance confidence in the protein identification. PIPER is based upon the following hypothesis: Peptides originating from the same protein have significantly correlated expression profiles. To establish this hypothesis, the distribution of expression correlation scores for random peptides is compared to the distribution of expression correlation scores for peptides assigned to the same peptide. Essentially, the null 242
Journal of Proteome Research • Vol. 7, No. 01, 2008
distribution of expression correlation scores is compared to the observed distribution of expression correlation scores. This comparison appears in Figure 8. The null (random) distribution was generated from 3000 randomly selected peptides from which all pairwise expression correlation scores were obtained. The observed distribution was generated from all pairs of peptides assigned to the same protein by Mascot. This distribution will contain false positive peptide-to-protein assignments; however, assuming the majority of protein identifications are correct, it serves as a reasonable empirical proxy. The null distribution is Gaussian, centered at 0, and not skewed. These properties are important because if, for example, sample processing introduced variation that overpowered biological signal, then expression profiles might tend to be similar and not distinguish proteins. The null distribution illustrates that this is not the case, otherwise, it would be skewed to the right.
research articles
Harmonizing Protein Identification with Expression Data
Figure 14. Example of a protein that passes the Mascot filter but not the correlation filter. A likely example of an off-target protein identification.
Second, there is a clear difference between the null and observed distribution. Peptides assigned to the same protein by LC-MS/MS and Mascot have highly correlated expression profiles. Of course, there are errors in the assignment of peptides to proteins in the protein identification process, and so, there is overlap between these two distributions. To quantify the false positive rate of peptide assignment to proteins, the cumulative ratio of the null distribution to the observed distribution is determined and displayed in Figure 9. The estimated false positive rate decreases quickly from 26.9% at correlation score 0.2 to 4.6% at correlation score 0.4. This implies that a correlation score threshold of 0.4 is required to achieve a false positive rate below 5%. Higher or lower FPRs can be selected as required by the application. Finally, further evidence to support the PIPER hypothesis is obtained by analyzing those peptides sequenced in different charge states or in different SCX fractions. If the PIPER hypothesis is true, then these peptides should have highly correlated expression profiles. All such peptides were selected, and the distribution of their expression correlation scores appears in Figure 10. Consistent with the PIPER hypothesis, their expression correlation scores are very high. Having established the PIPER hypothesis, our attention now turns to the impact of using PIPER as an enhancer of protein identification. We compare the list of confident protein identifications using PIPER and Mascot together to that obtained using Mascot alone. The total number of proteins assigned at least two peptides is 59. The Mascot filter (at least two peptides of Mascot peptide score 30) results in 55 proteins. The correlation filter (at least two peptides having a correlation score of 0.40 and a Mascot peptide score of 15) results in a list of 39 proteins. Table 1 shows the overlap between those proteins passing the Mascot filter (+) and those passing the correlation filter (+). Overall, the two filters agree on 39/59 ) 69% of proteins. The main discrepancy between the two filters is that the correlation filter is more strict than the Mascot filter with 18/ 55 ) 33% of the proteins identified by the Mascot filter not having a pair of highly correlated peptides. The list of proteins,
Mascot scores, and correlation scores appears in Table 2. The key result here is that confident protein identification can be augmented by using peptide expression profiles to better segregate confident and less confident protein identifications. To illustrate, examples of proteins from Table 1 are depicted in Figures 11-14. Figures 11 and 12 depict proteins which pass both the Mascot and correlation filters. Figure 13 passes the correlation but not the Mascot filter. Finally, Figure 14 passes the Mascot but not the correlation filter.
Conclusions The primary contribution of this paper is that peptides originating from the same protein have similar expression profiles. This fact is implemented in a software tool called PIPER that enables the confident ranking of protein identifications. Our experience using PIPER on more than a dozen biomarker discovery projects is that it applies equally well to preclinical studies with murine samples and to clinical studies with human samples. Ironically, human patient diversity is one of the reasons that protein expression profiles can distinguish peptides from different proteins. Nonetheless, in the case of murine studies involving genetically identical animals, plasma variability across animals is still sufficiently high for PIPER to discriminate peptide expression profiles from different proteins. PIPER is also robust to variations in the proteomic platform technology. PIPER has been successfully applied to tissue-based studies, studies using gel-based separation of proteins, and studies where the RPLC gradient has been varied. Further analysis is required to establish, a priori, the minimum number of samples and platform variability required for PIPER to be applied successfully. However, these questions are relatively straightforward to address. Another more interesting question is how effective PIPER is at improving the performance of fingerprinting methods for protein identification as fingerprinting is becoming an increasingly popular method due to high mass accuracy instrumentation and the availability of improved fingerprinting methods. Finally, not quantified in this paper is the impact of PIPER on apparent protein coverage vs true protein coverage. PIPER Journal of Proteome Research • Vol. 7, No. 01, 2008 243
research articles eliminates false peptide-to-protein assignments based on expression correlation. This can reduce the number of peptides assigned to a protein from, for example, 5 to 2. This results in a more accurate estimate of protein coverage and, so, a more accurate estimate of confidence in the protein identification.
Acknowledgment. The authors would like to thank the reviewers for their insightful suggestions for improving the manuscript. This work was partially funded by the NIAID/NIH contract HHSN266200400056C. References (1) Roy, S. M.; Anderle, M.; Lim, H.; Becker, C. H. Differential expression profiling of serum proteins and metabolites for biomarker discovery. Int. J. Mass Spectrom. 2004, 238, 163–171. (2) Wiener, M. C.; Sachs, J. R.; Deyanova, E. G.; Yates, N. A. Differential mass spectrometry: A label-free LC-MS method for finding significant differences in complex peptide and protein mixtures. Anal. Chem. 2004, 76, 6085–6096. (3) Silva, J. C.; Denny, R.; Dorschel, C. A.; Gorenstein, M. Quantitative proteomic analysis by accurate mass retention time pairs. Anal. Chem. 2005, 77 (7), 2187–2200. (4) Gygi, S. P.; Rist, B.; Gerber, S. A.; Turecek, F. Quantitative analysis of complex protein mixtures using isotope-coded affinity tags. Nat. Biotechnol. 1999, 17, 994–999. (5) Lamontagne, J.; Butler, H.; Chaves-Olarte, E.; Hunter, J. Extensive cell envelope modulation is associated with virulence in Brucella abortus. J. Proteome Res. 2007, 6 (4), 1519–1529. (6) Bellew, M.; Coram, M.; Fitzgibbon, M.; Igra, M. A suite of algorithms for the comprehensive analysis of complex protein mixtures using high-resolution LC-MS. Bioinformatics 2006, 22 (15), 1902–9, 2006. (7) Peng, J.; Elias, J. E.; Thoreen, C. C.; Licklider, L. J.; Gygi, S. P. Evaluation of multidimensional chromatography coupled with tandem mass spectrometry (LC/LC-MS/MS) for large-scale protein analysis: The yeast proteome. J. Proteome Res. 2003, 2, 43–50. (8) Benjamini, Y.; Hochberg, Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Royal Stat. Soc., Ser. B 1995, 57, 289–300. (9) Weatherly, D. B.; Astwood, J. A., III; Minning, T.A.; Cavola, C. A heuristic method for assigning a false-discovery rate for protein identifications from Mascot database search results. Mol. Cell. Proteomics 2005, 4 (6), 762–772. (10) Anderson, N. L.; Polanski, M.; Pieper, R.; Gatlin, T., et al. The human plasma proteome: A non-redundant list developed by combination of four separate sources. Mol. Cell. Proteomics 2004, 3, 311–326. (11) Eng, J. K.; McCormack, A. L.; Yates, J. R., III. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. JASMS 1994, 5 (11), 976–989.
244
Journal of Proteome Research • Vol. 7, No. 01, 2008
Kearney et al. (12) Perkins, D. N.; Pappin, D. J. C.; Creasy, D. M.; Cottrell, J. S. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20 (18), 3551–3567. (13) Ma, B.; Zhang, K.; Hendrie, C.; Liang, C. PEAKS: Powerful software for peptide de novo sequencing by MS/MS. Rapid Commun. Mass Spectrom. 2003, 17 (20), 2337–2342. (14) Lekpor, K.; Benoit, M.-J.; Butler, H.; Schirm, M. An evaluation of multidimensional fingerprinting in the context of clinical proteomics. Proteomics — Clin. Appl. 2007, 1 (5), 457–466. (15) Pappin, D. J.; Hojrup, P.; Bleasby, A. J. Rapid identification of proteins by peptide-mass fingerprinting. Curr. Biol. 1993, 3 (6), 327–32. (16) Conrads, T. P.; Anderson, G. A.; Veenstra, T. D.; Pasa-Tolic, L.; Smith, R. D. Utility of accurate mass tags for proteome-wide protein identification. Anal. Chem. 2000, 72, 3349–3354. (17) Palmblad, M.; Ramstrom, M.; Markides, K. E.; Hakansson, P.; Bergquist, J. Prediction of chromatographic retention and protein identification in liquid chromatography/mass spectrometry. Anal. Chem. 2002, 74 (21), 5826–5830. (18) Adkins, J. N.; Monroe, M. E.; Auberry, K. J.; Yufeng, S. A proteomic study of the HUPO Plasma Proteome Project’s pilot samples using an accurate mass and time tag strategy. Proteomics 2005, 5, 3454– 3466. (19) Smith, R. D.; Anderson, G. A.; Lipton, M. S.; Pasa-Tolic, L.; Shen, Y. An accurate mass tag strategy for quantitative and highthroughput proteome measurements. Proteomics 2002, 2, 513–523. (20) Gay, S.; Binz, P. A.; Hochstrasser, D. F.; Appel, R. D. Peptide mass fingerprinting peak intensity prediction: extracting knowledge from spectra. Proteomics 2002, 2 (10), 1374–1391. (21) Carr, S.; Aebersold, R.; Baldwin, M.; Burlingame, A. The need for guidelines in publication of peptide and protein identification data. Mol. Cell. Proteomics 2004, 3 (6), 531–533. (22) Lamerz, J.; Selle, H.; Scapozza, L.; Crameri, R. Correlationassociated peptide networks of human cerebrospinal fluid. Proteomics 2005, 511, 2789–2798. (23) Xie, Y.; Pan, W.; Khodursky, A. B. A note on using permutationbased false discovery rate estimates to compare different analysis methods for microarray data. Bioinformatics 2005, 21 (23), 4280– 4288. (24) Kersey, P. J.; Duarte, J.; Williams, A.; Karavidopoulou, Y. The International Protein Index: An integrated database for proteomics experiments. Proteomics 2004, 4 (7), 1985–1988. (25) Peng, J.; Elias, J. E.; Thoreen, C. C.; Licklider, L. J.; Gygi, S. P. Evaluation of multidimensional chromatography coupled with tandem mass spectrometry (LC/LC-MS/MS) for large-scale protein analysis: The yeast proteome. J. Proteome Res. 2003, 2, 43–50. (26) Troyanskaya, O.; Cantor, M.; Sherlock, G.; Brown, P. Missing value estimation methods for DNA microarrays. Bioinformatics 2001, 17 (6), 520–525. (27) Cox, M. F.; Cox, M. A. A. Multidimensional Scaling; Chapman and Hall: New York, 2001.
PR0705439