Anal. Chem. 2010, 82, 2272–2281
Refinements to Label Free Proteome Quantitation: How to Deal with Peptides Shared by Multiple Proteins Ying Zhang,† Zhihui Wen,† Michael P. Washburn,†,‡ and Laurence Florens*,† Stowers Institute for Medical Research, 1000 East 50th Street, Kansas City, Missouri 64110, and Department of Pathology and Laboratory Medicine, The University of Kansas Medical Center, 3901 Rainbow Boulevard, Kansas City, Kansas 66160 Quantitative shotgun proteomics is dependent on the detection, identification, and quantitative analysis of peptides. An issue arises with peptides that are shared between multiple proteins. What protein did they originate from and how should these shared peptides be used in a quantitative proteomics workflow? To systematically evaluate shared peptides in label-free quantitative proteomics, we devised a well-defined protein sample consisting of known concentrations of six albumins from different species, which we added to a highly complex yeast lysate. We used the spectral counts based normalized spectral abundance factor (NSAF) as the starting point for our analysis and compared an exhaustive list of possible combinations of parameters to determine what was the optimal approach for dealing with shared peptides and shared spectral counts. We showed that distributing shared spectral counts based on the number of unique spectral counts led to the most accurate and reproducible results. Quantitative proteomics by many methods has generated significant biological insights. In a “bottom up” or “shotgun” based approach, peptides are the species which are analyzed by a mass spectrometer. The peptides are then reassembled into proteins, and biological conclusions are drawn. An issue with this process has been called the “protein inference problem” since once proteins are digested into peptides the connectivity between protein and peptide is lost.1 When unique peptides are detected and identified, the solution is relatively simple since theses peptides arose from one and only one protein. However, when peptides are shared between multiple proteins, determining which protein these peptides arose from is a challenge.1 For protein identification purposes, shared peptides can be condensed into a protein group, which can contain multiple proteins, so as to not exaggerate the number of proteins detected and identified in a proteomics analysis.1,2 It is possible to have proteins that share very few peptide sequences and proteins that share extensive * To whom correspondence should be addressed. E-mail:
[email protected]. Phone: (816) 926-4458. Fax: (816) 926-4685. † Stowers Institute for Medical Research. ‡ The University of Kansas Medical Center. (1) Nesvizhskii, A. I.; Aebersold, R. Mol. Cell. Proteomics 2005, 4, 1419–1440. (2) Zhang, B.; Chambers, M. C.; Tabb, D. L. J. Proteome Res. 2007, 6, 3549– 3557.
2272
Analytical Chemistry, Vol. 82, No. 6, March 15, 2010
peptide sequences, up to 100% identity. However, when single unique peptide identifiers are detected, there is then evidence to support the presence of the corresponding protein in the original sample. In a quantitative proteomic workflow, similar issues arise. It is inappropriate to exaggerate the abundance of proteins by counting shared peptides multiple times by assigning them to multiple distinct proteins. This is relevant to any quantitative proteomics workflow, using isotopically labeled or label-free data generated from ion chromatograms or spectral counting. However, it is possible and common to have more than one detection and identification of a shared peptide in a given quantitative proteomic analysis. Each of these events is distinct and contributes distinct and valuable data. In a spectral counting based quantitative proteomics workflow, the number of total spectra corresponding to a protein was used for quantitative purposes. Any given peptide, unique or shared, can have one to hundreds to thousands of spectral counts, depending on the experimental design. The question is how a shared peptide that has one to thousands of spectral counts should be used in a quantitative proteomics study. One approach was to simply discard these peptides and only used unique peptides for quantitative proteomics analysis.3 The disadvantage of this approach was data were being discarded. Another approach was to group these peptides into a protein group where the quantitative data on multiple proteins was compressed.4 This approach retained data and counted data only once, but if there are differences between proteins with significant sharing of peptides, it might be challenging to discern this. The final approach was to distribute these shared peptides based on certain criteria, like the abundance of the unique peptides for the proteins that shared sequences.5,6 We used the normalized spectral abundance factor (NSAF) for spectral counting based quantitative proteomics.7-9 In this approach, the spectral counts of a protein were divided by its (3) Usaite, R.; Wohlschlegel, J.; Venable, J. D.; Park, S. K.; Nielsen, J.; Olsson, L.; Yates, J. R., III J. Proteome Res. 2008, 7, 266–275. (4) Jin, S.; Daly, D. S.; Springer, D. L.; Miller, J. H. J. Proteome Res. 2008, 7, 164–169. (5) Liu, W. L.; Coleman, R. A.; Grob, P.; King, D. S.; Florens, L.; Washburn, M. P.; Geles, K. G.; Yang, J. L.; Ramey, V.; Nogales, E.; Tjian, R. Mol. Cell 2008, 29, 81–91. (6) Zybailov, B.; Rutschow, H.; Friso, G.; Rudella, A.; Emanuelsson, O.; Sun, Q.; van Wijk, K. J. PLoS One 2008, 3, e1994. (7) Florens, L.; Carozza, M. J.; Swanson, S. K.; Fournier, M.; Coleman, M. K.; Workman, J. L.; Washburn, M. P. Methods 2006, 40, 303–311. 10.1021/ac9023999 2010 American Chemical Society Published on Web 02/18/2010
length and normalized to the total sum of spectral counts/length in a given analysis. With the use of a spectral counting based approach like NSAF, the length of a protein was used. As a result, one must consider the length of a protein carefully when considering how to deal with shared peptides. In the current study, we have systematically analyzed and compared an exhaustive list of possible mechanisms for determining the NSAF value of a protein taking into account unique and shared peptides along with unique and shared length. We used six purified albumins from different species spiked into a complex mixture of yeast proteins to develop a data set to test many different approaches for calculating NSAF values. We found that distributing shared spectral counts based on the presence of unique peptides generates the best results. EXPERIMENTAL PROCEDURES Materials. Six purified albumin from mouse, rat, rabbit, pig, bovine, and human serum (see Supporting Figure S1 in the Supporting Information for multiple sequence alignments) were purchased from Sigma (St. Louis, MO). Urea, Tris, ammonium acetate, iodoacetamide (IAM), and tricholoroacetic acid (TCA) were also obtained from Sigma (St. Louis, MO). Tris(2-carboxyethyl) phosphine hydrochloride (TCEP) was obtained from Pierce (Rockford, IL). Endoproteinases Lys-C and Glu-C were products of Roche Diagnostics Corp. (Indianapolis, IN). Modified trypsin, sequencing grade, was obtained from Promega (Madison, WI). HPLC grade water was from EMD Chemicals Inc. (Gibbstown, NJ). HPLC grade formic acid and acetonitrile were purchased from Mallinckrodt Baker, Inc. (Phillipsburg, NJ). Bacto peptone, dextrose, and Bacto yeast extract were acquired from BD Diagnostics (Sparks, MD). Sample Preparation. Saccharomyces cerevisiae strain BY 4741 was grown to midlog phase (OD at 600 nm ∼1.5) in YPD/rich media (10 g of Bacto yeast extract, 20 g of Bacto peptone, and 20 g of dextrose/L). Cells were collected and washed in cold water by centrifugation for 20 min at 4 000g at 4 °C. Cells were lysed by silica glass beads in lysis buffer (40 mM HEPES-KOH pH 7.5, 350 mM NaCl, 10% glycerol, 0.1% Tween-20), including 10 cycles consisting of 1 min vortexing at 2 500 rpm followed by 30 s incubation at 4 °C. Unbroken cell material and glass beads were pelleted at 4 000g at 4 °C for 20 min. Pooled supernatants were centrifuged for 1 h at 22 000g at 4 °C. The supernatants were removed, and the protein concentration was determined by the BCA protein assay (Pierce). The proteins were precipitated by the addition of TCA to 20%, incubated 3 h at 4 °C, pelleted at 14 000g at 4 °C, and washed twice with 500 µL of acetone. The final pellet was dried via a SPD111 V speed vacuum system (Thermo Electron, Midford, MA). Protein Digestions. The precipitated yeast proteins and six albumin standards were dissolved, respectively, in 100 mM TrisHCl, pH 8.5 (or pH 7.8 when endoproteinase Glu-C was used), 8 M urea, reduced in 5 mM TCEP, incubated at room temperature for 30 min, and carboxymethylated by adding IAM to 10 mM and incubated at room temperature for 30 min in the dark. Proteins (8) Paoletti, A. C.; Parmely, T. J.; Tomomori-Sato, C.; Sato, S.; Zhu, D.; Conaway, R. C.; Conaway, J. W.; Florens, L.; Washburn, M. P. Proc. Natl. Acad. Sci. U.S.A. 2006, 103, 18928–18933. (9) Zybailov, B.; Mosley, A. L.; Sardiu, M. E.; Coleman, M. K.; Florens, L.; Washburn, M. P. J. Proteome Res. 2006, 5, 2339–2347.
were then digested with either of the following enzymes: (1) endoproteinase Lys-C followed by trypsin, at 37 °C overnight; (2) endoproteinase Glu-C for 4 h at 25 °C; (3) endoproteinase Lys-C at 37 °C, overnight, followed by Glu-C, incubated at 25 °C overnight. All reactions were quenched by adding formic acid to 5%. The peptide mixtures were aliquoted and stored at -80 °C prior to use. MudPIT. MudPIT was carried out as described previously10 with the following modifications. Peptides mixtures of the albumin isoforms at varying concentrations from 0.1 to 10 pmol, and 8 µg of the yeast proteins were loaded onto a 250 µm i.d. capillary packed first with 3.5 cm of 5 µm strong cation exchange material (Partisphere SCX, Whatman), followed by 2.5 cm of 5 µm C18 reverse phase (RP) particles (Aqua, Phenomenex), and the biphasic column was washed with buffer A for more than 20 column volumes. The buffer solutions used were as follows: water/ acetonitrile/formic acid (95:5:0.1, v/v/v) as buffer A (pH 2.6), water/acetonitrile/formic acid (20:80:0.1, v/v/v) as buffer B, and buffer A with 500 mM ammonium acetate as buffer C. After desalting, the biphasic column was connected via a 2 µm filtered union (UpChurch Scientific) to a 100 µm i.d. column, which had been pulled to a 5 µm i.d. tip, then packed with 9.5 cm of 5 µm C18 RP particles. The split three-phase column was placed inline with an Aglient 1100 quaternary HPLC pump (Palo Alto, CA) and a LTQ mass spectrometer (Thermo Fisher Scientific). Each full MS scan (400-1600 m/z) was followed by five data-dependent tandem mass spectrometry (MS/MS) scans, and the number of microscans was 1 for MS and MS/MS scans. Application of mass spectrometer scan functions and gradient generation was controlled by the Xcalibur data system (Thermo Fisher Scientific). Data Analysis. Collected MS/MS spectra were searched with the SEQUEST algorithm11 against a database of 14 182 protein sequences combining 6 911 S. cerevisiae nonredundant proteins and 6 albumin isoforms sequences, 177 common contaminants, and their corresponding 7 088 randomized amino acid sequences. All cysteines were considered as fully carboxamidomethylated (+57 Da statically added), while methionine oxidation was searched as a differential modification. SEQUEST results from multiple runs were then filtered and compared using DTASelect 1.9/CONTRAST12 with the following criteria set: DeltCn at least 0.1, minimum XCorr of 1.5 for +1, 2.5 for +2, and 3.5 for +3 spectra; peptides had to be at least 7 amino acids long and fully tryptic for trypsin digestions or had to start and end with glutamic acid or lysine for Glu-C and Lys-C digestions, respectively. The minimum number of peptides to identify proteins was two. Peptide hits from replicate analyses were merged to establish a master list of proteins. On the basis of the merged detected peptides, proteins could fall into three categories following the parsimony principle (see Supporting Figure S2 in the Supporting Information, which illustrates a theoretical set of proteins sharing peptides): (i) proteins detected by the exact same set of peptides were grouped together since they could not be distinguished based on the available peptide data (e.g., proteins A and A′ in (10) Florens, L.; Washburn, M. P. Methods Mol. Biol. 2006, 328, 159–175. (11) Eng, J.; McCormack, A. L.; Yates, J. R., III J. Am. Soc. Mass Spectrom. 1994, 5, 976–989. (12) Tabb, D. L.; McDonald, W. H.; Yates, J. R. J. Proteome Res. 2002, 1, 21– 26.
Analytical Chemistry, Vol. 82, No. 6, March 15, 2010
2273
Supporting Figure S2 in the Supporting Information); only one arbitrarily selected representative protein entry is reported for such group of proteins; (ii) proteins with at least one peptide uniquely mapping to them were considered unique entries (e.g., protein B in Supporting Figure S2 in the Supporting Information); note that all six albumin isoforms spiked in the yeast lysate were detected by at least one uniquely mapping peptide in each of the acquired runs; (iii) subset proteins, for which no unique peptides were detected, were removed from the final list of identified proteins since the detection of “their” peptides could be explained more simply by other proteins with additional peptides (e.g., protein C′ in Supporting Figure S2 in the Supporting Information). The definition of “unique” peptides in our analysis included (i) peptides whose sequence matched only one protein in the sequence database used to interpret the MS/MS spectra (e.g., peptide B in Supporting Figure S2 in the Supporting Information); (ii) peptides unique to a protein group (e.g., peptide A); and (iii) peptides unique to a protein after removal of a subset of proteins (e.g., peptide C). Such peptides were the ones we defined as “unique” in our analysis. The set of peptides defining a protein group was not considered “shared” between the proteins within the group (e.g., peptide A for protein group A + A′ in Supporting Figure S2 in the Supporting Information), and only one spectral-count based abundance value was calculated using the total spectral counts for a protein group. On the other hand, some of the peptides mapping to a protein group or to unique proteins might be shared with other proteins or protein groups (e.g., peptide 1 was shared between protein group A + A′ and unique protein B, while peptide 2 was shared between protein group A + A′ and unique proteins B and C in Supporting Figure S2 in the Supporting Information). Such peptides were the ones we defined as “shared” in our analysis. Spectral counts (SpC) of each protein or group of proteins were used to estimate protein abundance. Software developed in-house, NSAFv7, used the DTASelect/CONTRAST outputs to calculate the normalized spectral abundance factor (NSAF):
NSAF )
SAFi
(1)
N
∑ SAF
i
i)1
where subscript i denotes a protein identity and N is the total number of proteins, while SAF is a protein’s spectral abundance factor that is defined as a protein’s spectral counts divided by its length. Stabilizing NSAF Variance. For NSAF and other SpC based quantitation methods, proteins of higher abundance generally tend to have greater variations when repeated measurements are applied. This heteroskedasticity in statistics leads to a challenge when applying canonical analysis. We tested various transformation methods.13 We applied the well-established cubic root and log2-based transformations and the recently developed variancestabilizing normalization transformation (VST) method14 to NSAF data from 12 MudPIT technical replicates of trypsin digestion. (13) Durbin, B. P.; Hardin, J. S.; Hawkins, D. M.; Rocke, D. M. Bioinformatics 2002, 18, S105–110. (14) Lin, S. M.; Du, P.; Huber, W.; Kibbe, W. A. Nucleic Acids Res. 2008, 36, e11.
2274
Analytical Chemistry, Vol. 82, No. 6, March 15, 2010
Both log2 and VST transformation outperformed cubic root transformation in stabilizing variance (data not shown). In this study, we hence chose the simpler and effective log2 transformation to stabilize NSAF variance. RESULTS AND DISCUSSION Variations on Length, Spectral Counts, and Distribution Factor May Be Combined to Define Several NSAF Strategies. Equation 1 gives us a general format to calculate NSAF that we have described and used previously.7–9,15,16 In eq 1, SAF is defined as a protein’s spectral counts divided by its length. However, what constitute a protein’s spectral counts and length may be variable, and several strategies may be used to calculate the NSAF value of a protein isoform (Table 1). In particular, while NSAF combined to MudPIT is a robust and simple label free method, its current implementation could be improved because spectral counts from peptides shared among isoforms are counted multiple times. To specifically deal with shared peptides among protein isoforms, two strategies could be implemented: (i) a dismissive strategy, in which only peptides unique to a particular protein are counted (defined hereafter as “uNSAF”) and (ii) a distributive strategy, in which shared spectral counts are divided among protein isoforms (referred hereafter as “dNSAF”). To begin, the spectral counts of a protein isoform could be calculated in one of three primary ways. First, its total spectral counts could simply be the sum of spectra matched to peptides uniquely mapping to this protein (uSpC), and spectra from peptides shared with its isoforms (sSpC). In such a case, however, the shared spectral counts were counted multiple times among the isoforms, which was likely to result in inaccurate NSAFs for protein isoforms. Second, in a dismissive strategy previously described,3,17 the total spectral counts could only be its unique spectral counts (uSpC). Such a dismissive approach has been shown to improve quantitation accuracy but is likely to result in loss of information, in particular for low abundance proteins identified by a limited number of detected peptides. Third, its total spectral counts could be uSpC + [(d)(sSpC)], where a distribution factor (d) determines what percentage or fraction of a shared spectral counts is assigned to a particular protein isoform. In such a distributive strategy, all shared sSpC were counted once and once only, when d for a protein i that shares spectra from a specific peptide k with other (M - 1) proteins were normalized and met the following condition: M
∑d
i,k
)1
(2)
i)1
That is, the distribution factors assigned to all proteins sharing a given peptide must add up to 1. In other words, once the shared spectral counts were allotted to each of the protein isoforms, the distributed spectral counts (dSpC, which may end up being (15) Pavelka, N.; Fournier, M. L.; Swanson, S. K.; Pelizzola, M.; RicciardiCastagnoli, P.; Florens, L.; Washburn, M. P. Mol. Cell. Proteomics 2008, 7, 631–644. (16) Sardiu, M. E.; Cai, Y.; Jin, J.; Swanson, S. K.; Conaway, R. C.; Conaway, J. W.; Florens, L.; Washburn, M. P. Proc. Natl. Acad. Sci. U.S.A. 2008, 105, 1454–1459. (17) Skaar, J. R.; Florens, L.; Tsutsumi, T.; Arai, T.; Tron, A.; Swanson, S. K.; Washburn, M. P.; DeCaprio, J. A. Cancer Res. 2007, 67, 2006–2014.
Table 1. Twelve Variations on the NSAF Theme SAF calculationb
NSAF strategy a
NSAF
uNSAFa
dNSAFa
1-a
uSPc + sSpC uL + sL
1-a′
uSpC + sSpC uDL + sDL
2-a
uSpC uL + sL
2-a′
uSpC uDL + sDL
2-b
uSpC uL
2-b′
uSpC uDL
3-a
uSpC + [(d)(sSpC)] uL + sL
3-a′
uSpC + [(d)(sSpC)] uDL + sDL
3-b
uSpC + [(d)(sSpC)] uL + sL
3-b′
uSpC + [(d)(sSpC)] uDL + sDL
3-c
uSpC + [(d)(sSpC)] uL + sL
3-c′
uSpC + [(d)(sSpC)] uDL + sDL
distribution factor (d)a,b
uSpC
∑ uSpC uSpC
∑ uSpC uSpC/uL
∑ uSpC/uL uSpC/uDL
∑ uSpC/uDL (uSpC)(sL/uL)
∑
(uSpC)(sL/uL)
(uSpC)(sDL/uDL)
∑
(uSpC)(sDL/uDL)
a The three main NSAF strategies are defined as NSAF, where spectral counts from peptides shared between mulitple proteins may be counted multiple times; uNSAF, where spectral counts derived from unique peptides are the only ones considered, and the shared spectral counts are dismissed; dNSAF, where spectral counts from shared peptides are distributed among protein isoforms based on a distribution factor, d. b Spectral counts from peptides uniquely mapping to a protein are denoted as “uSpC”, while spectral counts from peptides shared between isoforms are labeled “sSpC”. Protein amino acid lengths mapping to unique and shared peptides are denoted as “uL” and “sL”, respectively, while unique and shared lengths from peptides detected in a MudPIT experiment are labeled as “uDL” and “sDL”, respectively.
noninteger numbers) would add up to the initial spectral counts shared among these isoforms. Next, the length of a protein isoform could be calculated in four possible ways based on the use of spectral counts. First, its length could be simply defined as its total number of amino acid residues (L). Second, its length could also be defined as its total identified residue length (DL, which is calculated as sequence coverage times length L), summing up the protein length covered by its detected unique peptides (uDL) and the length covered by its detected shared peptides (sDL). Third, its length could be calculated as its unique length (uL) only, removing the length mapped to shared peptides (sL). In such a case, the unique and shared lengths are defined in theory by multiple sequence alignments (e.g., Figure S1 in the Supporting Information) and by imposing specific peptide end requirements based on the type of enzymatic digestion used to generate the peptides. Lastly, its length could be defined very narrowly as the length of its uniquely detected peptides (uDL). The second and fourth approaches for calculating length only used regions of the protein that were
covered by peptides detected and identified in the proteomics analysis. The third and fourth approaches were only applied in cases where only unique spectral counts (uSpC) were being considered as a way to compensate for the fact that shared peptides were dismissed. How spectral counts were defined dictated three main NSAF strategies (Table 1), while how length and distribution factor were defined dictated a total of 12 NSAF variations (Table 1). The detail of each of the 12 SAF calculations is illustrated in Figure 1 using two theoretical proteins sharing two peptides. The first strategy for NSAF calculation was to count both shared and unique spectral counts as mentioned above, uSpC + sSpC. When we used this approach it was reasonable to choose either the total amino acid residue length (L) or detected length (DL), but not unique length (uL) or detected unique length (uDL) only, as a protein’s nominal length because both unique and shared spectra were counted. The “original” NSAF (eq 1) that we have used so far for proteomics quantitation corresponded to NSAF (1-a), while variation (1-a′) used detected length (DL) as protein length (Table 1 and Figure 1). Analytical Chemistry, Vol. 82, No. 6, March 15, 2010
2275
Figure 1. Theoritical illustration of the 12 NSAF equations. As an example, two theoretical proteins sharing two peptides were used to illustrate what parameters are inputed in each of the SAF calculations reported in Table 1.
Within the uNSAF strategy that used only unique spectral counts (uSpC), we defined four possible but not exhaustive variations to calculate uSAF when combined with the definition of nominal length (Table 1): 2-a, a strictly dismissive approach that removed shared spectral counts (sSpC) and only used unique spectral counts (uSpC), yet the total amino acid residue length (L) was still considered as nominal length; 2-a′, only used unique spectral counts (uSpC) as well, but uSpC were normalized against detected length (DL) instead of total amino acid residue length (L); 2-b also discarded shared spectral counts (sSpC) but compensated the nominal length by subtracting the shared length (sL) from the total amino acid residue length (L) to account for the fact that shared spectral counts (sSpC) were not being considered; 2-b′ was a variation of 2-b, in which detected shared residue length (sDL) was subtracted from detected length (DL), i.e., DL - sDL as nominal length instead of L - sL (Table 1 and Figure 1). Within the dNSAF strategy, which distributed shared spectral counts among isoforms, i.e., uSpC + [(d)(sSpC)], there were several possible but not exhaustive variations (Table 1) to define the distribution factor d that determined what fraction of sSpC should be assigned to each isoform. 3-a, a protein’s uSpC; 3-b, a protein’s (uSpC/uL); 3-b′, a protein’s (uSpC/uDL); 3-c, a protein’s (uSpC)(sL/uL)/uL; 3-c′, a protein’s (uSpC)(sDL/uDL). With every distributive approach, we normalized d similarly to eq 1. With uNSAF approaches, only unique spectral counts (uSpC) were 2276
Analytical Chemistry, Vol. 82, No. 6, March 15, 2010
counted, while with dNSAF approaches, all spectral counts (SpC) were counted once and once only. Because both unique spectral counts (uSpC) and shared spectral counts (sSpC) were counted for dNSAF, we could reasonably choose total amino acid residue length (L) or detected length (DL) instead of unique length (uL) or detected unique length (uDL) as the protein’s nominal length for dSAF calculation. When we chose total amino acid residue length (L) to calculate dSAF, unique length (uL) and shared length (sL) were used to calculate d; while when we chose detected length (DL) to calculate dSAF, detected unique length (uDL) and detected shared length (sDL) were used in the definition of d (Table 1 and Figure 1). As illustrated in Figure 1, in which two theoretical proteins shared two peptides, one of them (“A”) being detected by MudPIT, the fractions of shared spectral counts “a” allotted to each protein were calculated based on two distribution factors: d1 for protein 1 was the sum of unique spectral counts for protein 1 (“b + c”) divided by the sum of unique spectral counts for both proteins 1 and 2 (“b + c + f”); while d2 for protein 2 was the sum of unique spectral counts for protein 2 (“f”) divided by the sum of unique spectral counts for both proteins 1 and 2 (“b + c + f”). Under these conditions, the sum of d1 and d2 added up to 1, hence complying with eq 2; in other words, the sum of the distributed spectral counts (“[(a)(d1)] + [(a)(d2)]”) was equal to the initial shared spectral counts. In runs where no unique peptides are detected for proteins sharing peptides (e.g., run no.
Table 2. Linear Correlation between Known Quantities of Albumin Isoforms and Their Normalized Spectral Counts dynamic rangea
1-100
digestion typeb NSAF strategy NSAF uNSAF
dNSAF
spectral counts peptide counts protein counts
1-a 1-a′ 2-a 2-a′ 2-b 2-b′ 3-a 3-a′ 3-b 3-b′ 3-c 3-c′
1-1000 Lys-C + Glu-C
trypsin no. 1
trypsin no. 2
AVG
SDV
AVG
SDV
AVG
SDV
0.752 0.471 0.854 0.812 0.902 0.874 0.941 0.944 0.942 0.928 0.943 0.938 17 615.3 3 300.6 646.6
0.079 0.256 0.076 0.074 0.063 0.057 0.029 0.027 0.029 0.028 0.028 0.025 2 129.9 333.4 41.8
0.879 0.836 0.929 0.897 0.888 0.877 0.919 0.888 0.919 0.88 0.918 0.874 14 360.2 2 935.6 572.6
0.018 0.025 0.012 0.006 0.012 0.011 0.012 0.008 0.012 0.009 0.012 0.009 2 543.6 447.9 35.7
0.251 0.173 0.866 0.83 0.911 0.902 0.965 0.962 0.967 0.941 0.969 0.936 13 185.5 2 347.5 502.5
0.016 0.032 0.030 0.039 0.027 0.044 0.022 0.017 0.021 0.022 0.019 0.022 796.8 239 34.7
a For the “trypsin no. 1” and “Lys-C + Glu-C” sets of digestions, the amounts of protein isoforms spiked into 8 µg of yeast soluble protein extract were 0.25, 4, 0.63, 0.1, 10, and 1.6 pmol of mouse, rat, rabbit, pig, bovine, and human albumins, respectively. For the set of digestions performed with a wider dynamic range of isoform standards (“trypsin no. 2”), the amounts of proteins spiked into 8 µg of yeast soluble protein extract were 0.1, 0.4, 1.6, 6.3, 25, and 100 pmol of mouse, human, rabbit, bovine, pig, and rat albumins, respectively. b The square of the Pearson product moment correlation coefficients were obtained with the RSQ function in Microsoft Excel. Average (“AVG”) and standard deviation (“SDV”) were calculated for RSQ values measured across 12, 4, and 4 technical replicates for the “trypsin no. 1′′, “Lys-C + Glu-C”, and “trypsin no. 2” sets of digestions, respectively (See Supporting Tables 1-3 in the Supporting Information for detailed values).
2 in Supporting Figure S2 in the Supporting Information), the spectral counts of the shared peptide(s) would be equally distributed among the proteins or protein groups (e.g., d would be 0.33 for protein group A + A′, protein B, and protein C, in Supporting Figure S2 in the Supporting Information). Linearity Is Maintained between Protein Amounts and dNSAF Values over a Dynamic Range of at Least 3 Orders of Magnitude. To test whether there was a linear correlation between the known amount of protein isoforms and their measured NSAF values, we used two sets of albumin mixtures digested with trypsin as well as alternate digestion protocols. In the first mixture, the amounts of the six standard proteins were distributed evenly over 2 orders of magnitude in logarithmic scale from 0.1 to 10 pmol. To estimate the dynamic range of each NSAF strategy, we also used a second trypsin-digested mixture in which the amounts of albumin isoforms were logarithmically distributed evenly over 3 orders of magnitude, from 0.1 to 100 pmol. The square of the Pearson product moment correlation coefficient (RSQ) between log2 of NSAF and log2 of albumin amount were then calculated (Table 2 and Supporting Tables 1-3 in the Supporting Information). To evaluate deviations from the linear relationship, log2(NSAF) was plotted as a function of log2(amount) and linear regressions through the data sets were preformed for each NSAF strategies (Figure 2). NSAF calculations yielded the worst linear correlations between the amount of protein isoforms and NSAF responses (Table 2). One obvious reason was that because shared spectral counts (sSpC) were counted multiple times, NSAF values for isoforms with quantities on the low end of the range were overestimated, laying well above the linear regression lines (Figure 2). The linear correlation was improved by uNSAF approaches, in which shared spectral counts (sSpC) were completely ignored (Table 2). However, uNSAF values for isoforms of lower abundance appeared to be underestimated, lying below the linear regression lines
(Figure 2). Compared to the strict uNSAF approaches 2-a and 2-a′, the compensated uNSAF approaches 2-b and 2-b′ had better performance (RSQ for the first trypsin digest improved from 0.85 to over 0.90, Table 2). The difference between the strict uNSAF and the compensated uNSAF was that the protein’s shared length (sL) was removed from its total amino acid residue length (L) in the compensated uNSAF. The results illustrated that when shared spectral counts (sSpC) were removed, trimming off shared length (sL) from the protein’s total amino acid residue length (L) or detected shared length (sDL) from detected length (DL) was a reasonable step. In other words, the compensated uNSAF approaches treated the nominal residue length more fairly than the strict uNSAF. The best correlations were measured for the dNSAF approaches, where shared spectral counts (sSpC) were distributed based on a distribution factor, d. With the distributive dNSAF strategies, the averaged RSQ for the 12 trypsin replicates improved to over 0.94 when total amino acid residue length (L) or detected length (DL) were used to calculate the protein nominal length (Table 2). There was no significant difference among the different distributive approaches (Figure 2). While within the NSAF approaches using total amino acid residue length (L) for a protein’s nominal length 1-a had a significantly better performance than using the detected length (DL) 1-a′, these differences were not significant in either uNSAF or dNSAF (Table 2). Considering that the detected shared length (sDL) and detected unique length (uDL) are more easily computed than shared length (sL) and unique length (uL) (Supporting Figure S1 in the Supporting Information), the results suggested that we could simply choose sDL or uDL as a protein’s nominal length, especially when the uNSAF (2-b) or dNSAF (3-b or 3-c) strategies were implemented. We next used the six albumin isoform standards to estimate the effective dynamic range of each strategy. Again, the linearity Analytical Chemistry, Vol. 82, No. 6, March 15, 2010
2277
Figure 2. Linear regression between NSAF values and known protein amounts. Log2-transformed NSAF, uNSAF, or dNSAF values were plotted as a function of log2-transformed proteins amounts in picomoles. Upper panels report NSAF (1-a) (gray squares) and NSAF (1-a′) (open squares) for the trypsin 1-100 (A), LysC +GluC 1-100 (B), and trypsin 1-1000 (C) data sets. Middle panels report uNSAF (2-a) (gray squares) and uNSAF (2-b) (open triangles) for the trypsin 1-100 (D), LysC + GluC 1-100 (E), and trypsin 1-1000 data sets (F). Lower panels report dNSAF (3-a) (gray squares), dNSAF (3-b) (open triangles), and dNSAF(3-c) (small closed circles) for the trypsin 1-100 (G), LysC + GluC 1-100 (H), and trypsin 1-1000 (I) data sets. Averages and standard deviations were calculated across the technical replicates. Linear regressions throught the data sets were fitted through the 6 data points, with the exception of the NSAF panels A, B, and C for which the two data points corresponding to the lowest amounts were excluded from the linear fit.
between NSAF and the amount of isoform proteins was very poor (Table 2), widely deviating from linearity on the low amount side (Figure 2). Because of the greater dynamic range of the standard proteins, quantitation of proteins of lower abundance was more affected by the shared peptides than in the case of the protein mixture covering 2 orders of magnitude. Mouse albumin was spiked into the mixture at the lowest amount, and its total spectra were 927 on average from four replicates, while its unique spectra were only 13 on average (Supporting Table 1 in the Supporting Information). All 927 spectra were considered to calculate NSAF, leading to a quantitation estimation well off its real quantity. On the other hand, the linearity between the amount of isoform proteins improved with the uNSAF (2-b) strategies (RSQ > 0.9), and RSQs measured with 2278
Analytical Chemistry, Vol. 82, No. 6, March 15, 2010
any of the dNSAF variations were very good (Table 2). When comparing trypsin digestions on both sets of protein mixtures, the differences in performance between NSAF, uNSAF, and dNSAF were accentuated when the dynamic range of the amount of isoform proteins became larger. In particular, while uNSAF values measured for the two extreme isoforms (mouse and rat proteins of lowest and highest abundance, respectively) were clearly underestimated (Figure 2), this was not the case for dNSAF values. Overall, linearity was maintained between protein amounts and dNSAF over a dynamic range of at least 3 orders of magnitude (Figure 2), which proved that dNSAF are suitable to estimate protein levels in biological samples.
Normalized Spectral Counting Yields Accurate Quantitation on Nontryptic Data Sets When Protein Digestion Is Complete. When we investigated the correlation between protein quantity and NSAF on Glu-C digested data sets, all RSQ were drastically low (below 0.65) (data not shown). Although dNSAF still showed the best linear correlation, the results indicated that a spectral counting based quantitation may not be appropriate on Glu-C digested samples. We then checked the sequences of the identified peptides and found that there were significant numbers of missed cleavage sites, suggesting the digestion was not complete. Therefore we changed the digestion protocol by incubating first with endoproteinase Lys-C in 8 M urea and extending Glu-C incubation time from 4 h to overnight at 25 °C. With our efforts to generate completely digested proteins, the linear correlation between the amounts of six standard isoforms and NSAF, uNSAF, and dNSAF improved significantly. All RSQ increased to >0.84 in every strategy (Table 2). Differences in RSQ values observed for the various NSAF strategies were not as pronounced for the Lys-C + Glu-C digested samples, although uNSAF and dNSAF still outperformed NSAF (Table 2). The linear correlations derived from uNSAF and dNSAF values were very close (Figure 2). These results demonstrate that in spectral counting based shotgun proteomics quantitation, a complete digestion is important for accurate quantitation. What Factors Are Responsible for the Observed Differences between NSAF Strategies? In the digestion of a real biological sample, isoforms would normally only occupy a small fraction of the total proteins, and proteins without any shared peptides always kept the relationship of SAF ) uSAF ) dSAF since their shared length (sL), detected shared length (sDL), shared spectral counts (sSpC), and distribution factor (d) are all null. Consequently, the denominators in eq 1 for calculating NSAF, uNSAF, and dNSAF should be close to each other. Differences between SAF, uSAF, and dSAF were then primarily responsible for differences between NSAF, uNSAF, and dNSAF values. There were some obvious differences in SAF responses whether NSAF, uNSAF, or dNSAF strategies were applied to estimate protein isoforms levels: by removing shared spectral counts (sSpC), a protein isoform’s strict uSAF was always smaller than its SAF and was always smaller than dSAF; dNSAF proportionally distributed shared spectral counts (sSpC) rather than simply adding shared spectral counts (sSpC) into isoforms total spectral counts, therefore dSAF was always smaller than SAF; a protein isoform’s strict uSAF (2-a or 2-a′) was also always smaller than its compensated uSAF (2-b or 2-b′), in which shared length (sL) or detected shared length (sDL) were removed, hence spectral counts were divided by a smaller nominal length value. The performance difference between NSAF, uNSAF, and dNSAF also depended on the ratio between shared and unique lengths: when (sL/uL) (or (sDL/uDL)) became bigger, larger differences in SAF, uSAF, and dSAF values were observed. Let us assume a theoretical digestion of two protein isoforms i and k; protein i was a lot more abundant than protein k; proteins i and k had the same length and shared one and only one peptide; proteins i and k did not share any peptide with any other proteins; after liquid chromatography-tandem mass spectrometry (LC-MS/
MS) analysis, we identified exactly one unique peptide with unique spectral counts (uSpCi) for protein i and exactly one unique peptide with unique spectral counts (uSpCk) for protein k and one shared peptide with shared spectral counts (sSpC) for proteins i and k; the three identified peptides had the same length. If all the experimental efficiency constants from digesting proteins i and k to LC-MS/MS of the three identified peptides were the same, stoichiometrically unique spectral counts (uSpCi) should be much larger than unique spectral counts (uSpCk) and shared spectral counts (sSpC) should be larger but close to unique spectral counts (uSpCi) because in stoichiometry sSpC ) uSpCi + uSpCk and uSpCi . uSpCk. Therefore for protein k, (sSpC/uSpCk) (which should be . 1) should be much greater than (sLk/uLk) (which should be