Thermodynamic Framework for the Interpretation of

Nov 25, 2009 - The interpretation of tandem mass spectra of peptides from collision-induced dissociation is discussed from the perspective of a multin...
1 downloads 10 Views 808KB Size
5360

J. Phys. Chem. C 2010, 114, 5360–5366

Physicochemical/Thermodynamic Framework for the Interpretation of Peptide Tandem Mass Spectra† William R. Cannon* and Mitchell M. Rawlins Computational Biology and Bioinformatics Group, Fundamental and Computational Sciences Directorate, Pacific Northwest National Laboratory, Richland, Washington 99352 ReceiVed: May 29, 2009; ReVised Manuscript ReceiVed: September 16, 2009

The interpretation of tandem mass spectra of peptides from collision-induced dissociation is discussed from the perspective of a multinomial data analysis problem and from the perspective of statistical mechanics. Both approaches use the same statistical likelihood function, but the free energy differs from the statistical likelihood by a term that additionally accounts for the size of the system. In addition, it is shown that the statistical likelihood is equivalent to the information theory entropy when scaled appropriately. The likelihood function provides a physically and chemically principled way to incorporate intensity information into the interpretation of tandem mass spectra. As a result, we demonstrate that incorporating the intensity information in this manner leads to an increased identification rate for database searches and allows for the simultaneous use of spectral libraries with standard database searching. The use of spectral libraries further increases the identification rate relative to database searching by 25-32%. Furthermore, the convergence of statistical and physical perspectives enables the future use of predictive modeling and simulation to inform the peptide identification process. Introduction In large-scale proteomics studies of cell populations, one method for the identification of proteins is by tandem mass spectrometry (MS/MS) of the component peptides. The process begins with the isolation of proteins from cells of interest, typically followed by digestion of the proteins with a peptidase such as trypsin. The resulting peptides are then partially separated by chromatography and introduced into the gas phase by electrospray ionization. In collision induced dissociation (CID) mass spectrometry, the gas phase peptides are then vibrationally excited by collisions with an inert gas. A peptide is identified when the observed fragmentation pattern matches an expected fragmentation pattern, or model spectrum, for a particular peptide. Typical model spectra are shown below an experimental spectrum in Figure 1. The expected fragmentation pattern may be derived from either heuristic rules based on past experience,1 statistical models of spectra based on training or averaging over many peptides of diverse sequences,2-4 or based on using libraries of spectra from known peptides.5-7 The difference between the latter two is that a model spectrum from a spectral library is derived from the fragmentation pattern of a specific peptide, while a model spectrum derived from statistical training uses a variety of peptides having diverse sequences. Many measures of similarity can be used to quantify the match between the model spectrum and the observed spectrum. However, it is a difficult task to determine whether any particular similarity measure performs best and whether the performance is due to the inherent appropriateness of the statistical model or whether it is due to implementation issues. Most data analysis schemes used for interpreting MS/MS spectra are ad hoc, that is, they are set up for the single purpose of attempting to select †

Part of the “Barbara J. Garrison Festschrift”. * To whom correspondence should be addressed. Phone: (509) 375-6732. Fax: (509) 372-4720. E-mail: [email protected].

the best match between an observed spectrum and a predicted spectrum. A case in point: currently identification methods that use spectral libraries are run independent of identification tools that use model spectra derived from statistical training. The simultaneous use of spectral libraries with model spectra derived from training would require that the similarity measure used to evaluate each be the same. The goal of using model spectra from multiple sources is important; although spectral libraries are highly accurate, they will not be available for all peptides. The difference between a model spectrum from a spectral library and one generated from training over spectra from a variety of different peptides is in the intensitiessboth approaches get mass-to-charge (m/z) ratios equally correct. Many methods for incorporating intensity information have been proposed, but they are as diverse as the similarity methods for comparing spectra. Ultimately, the intensity information contained in a spectral library are due to the fragment populations generated from the underlying physics of collisionally induced dissociation. In fact, there is interest in using statistical physics to aid in the identification of peptides. Two previous studies reported on the effectiveness of using entropy as a measure of similarity,4,8 and one report used a similarity measure inspired by the partition function of spin glasses.9 In this paper we discuss the relationship between the statistical analysis of data from peptide CID MS/MS and the statistical mechanical formulation of free energy changes in chemical reactions. We show that a likelihood ratio statistic not only is a natural similarity measure for comparing fragmentation patterns from different peptides but also that the statistical mechanical characterization of a chemical reaction uses in part the same likelihood ratio. There are several advantages of using the same statistical formulation for data analysis as is used for describing the underlying physical phenomena. An analysis based on the scientific principles of the underlying phenomena is less likely to employ incorrect assumptions about the data or models, and as a result, will lead to a more robust analysis. For

10.1021/jp905049d  2010 American Chemical Society Published on Web 11/25/2009

Interpretation of Peptide Tandem Mass Spectra

J. Phys. Chem. C, Vol. 114, No. 12, 2010 5361

Figure 1. (A) Experimental MS/MS spectrum for des-Arg9-bradykinin (PPGFSPFR). A model spectrum from a spectral library will look similar to the experimental spectrum. (B) Heuristic model spectrum similar to that used by SEQUEST.1 (C) Model spectrum developed through the use of statistical training methods in which the probability of observing a peak at a specific location was learned from a sequence-averaged training set similar to that proposed by Dancik, et al.2

example, we show how this approach provides a principled solution to the problem of how to incorporate intensities into the scoring of a match between a model spectrum and an experimental spectrum. Using this approach, peptide identifications from spectral libraries can be directly compared to those obtained using training models, which allows for the simultaneous use of standard database searching with spectral library searching. We demonstrate the combined analysis produces 25-32% more correct identifications relative to a standard database search. Second, we discuss how molecular theory and simulations can be used as a predictive tool for informing the data analysis. In addition to potentially providing model spectra, molecular models and simulation can help evaluate hypotheses about fragmentation pathways and mechanisms. Finally, the framework reported here could be used as a primer on thermodynamics for statisticians. Experimental Methods Peptide tandem mass spectrometry (MS/MS) was performed in the laboratory of Richard Smith at the Pacific Northwest National Laboratory and was part of a routine quality assessment analysis. Briefly, the 31 476 CID spectra used in this report are from the analysis of a mixture of known peptides10 analyzed using electrospray ionization feeding Finnigan LTQ ion traps. The peptides consisted of a trypsin digest of a mixture of 34 independently purified peptides and 31 proteins. The mixture was independently analyzed with the LTQ multiple times on multiple days. For the purpose of comparing the relative performance of the likelihood ratio methods discussed in this paper, these 65 protein and peptide sequences were supplemented with 88 268 protein sequences.11 A file containing the sequences of the 31 proteins and 34 peptides and the 88 268 additional protein sequences was used in a database search with the likelihood scoring functions implemented in the MSPolygraph software.4 In the database search, tryptic enzyme rules were used and peptide sequences within a mass of (3 massto-charge units of the measured mass-to-charge of the parent peptide were used as candidate peptides. Spectral libraries for 395 peptides from the 64 known protein and peptides were developed from an independent set of 28 888 spectra. The spectra matching to each peptide sequence were then evaluated manually and compared for consistency.

Theoretical Development Statistical Model of Peptide Fragmentation. First, we consider the characterization of peptide fragmentation from a purely descriptive perspective. Consider ntot identical peptides having k labile peptide bonds that may fragment due to collisioninduced dissociation. For now, we consider only b-ion and y-ion products formed by fragmentation at the peptide bond. In the collision chamber of a mass spectrometer, each peptide is vibrationally excited and undergoes fragmentation at each peptide bond i, and n ) {ni} is the vector count of the number of fragments formed at each peptide bond, ni e ntot. The probability of a fragmentation at peptide bond i depends on the molecular energy (ε) required for the chemical reaction and the temperature of the system; we denote this probability as θi(ε,T). Each of the ni products is independent and indistinguishable, but each of the fragments produced at bond i is distinguishable from the fragments produced at other bonds. For any fragmentation product of the peptide, the probability θi and the data ni can be used to form a multinomial model of fragmentation, k

p(n|θ) ) ntot!

∏ n1j ! θj(ε, T)n

j

(1)

j

The term on the left-hand side is the likelihood, p(n|θ), for the data n given the model parameters θ. There are no terms (1 - θi(ε,T)) when fragment i is missing from a spectrum in the equation above because this is accounted for by the probabilities for other fragments: (1 - θi(ε,T)) ) ∑j*iθj(ε,T). Assuming that one can determine counts of peptide fragments under these conditions, the calculation of this distribution for any peptide is relatively straightforward in principle. However, this formulation is difficult to use in practice because the parameters θi refer to the microscopic state and are not known. In fact, in experimental measurements such as surface induced dissociation the energy levels of the individual molecules are not observed; instead, the average energy of a collection of molecules is measured. However, the analysis can instead be treated on the macroscopic level such that

5362

J. Phys. Chem. C, Vol. 114, No. 12, 2010

Cannon and Rawlins

k

p(n|θ) ) ntot!

∏ n1j ! θj(T)n

j

(2)

probability over the available energy levels for a specific fragment gives,

j

In this case, the probabilities θj(T) are measured as a function of temperature only and energy levels are not resolved. The θj(T) can be estimated from counts such that the model parameters for any fragment i are the number of observed fragments ni out of ntot fragments,

θi(T) )

ni ntot

That is, θi(T) is the estimated probability of observing ni fragmentations at bond i out of ntot molecules when the measurement is done at temperature T. In order to obtain a distribution of values, θi, is evaluated from multiple spectra of the same peptide. The development of a scoring scheme using the likelihood function as outlined above is achievable when using spectral libraries.5,7,12,13 However, the development of a complete spectral library for the peptides derived from any organism is a formidable task. For instance, to develop a spectral library for tryptic peptides of Salmonella typhi we estimate that 400 000 distinct peptides would have to be isolated, purified, and analyzed repeatedly by tandem mass spectrometry. Clearly, in the case where spectral libraries are not available it would be desirable to be to use less accurate model spectra derived from nonspecific training over a diverse set of peptides (generic model spectra). However, compared to spectral libraries, the generic models usually do not fit the data particularly well. This is illustrated for various model spectra in Figure 1. Assuming that fragment mass-to-charge ratios are accurately accounted for and the same set of peaks are used for identification (e.g., y-ions, b-ions), the major difference between accurate model spectra derived from spectral libraries and less accurate models are the expected counts of each fragment, which can vary from ni ) 0 to a significant proportion of the observed fragment abundances. Alternatively, with the availability of large compute clusters, it may be now be feasible to develop peptide-specific model spectra from molecular simulations. In order to obtain such probabilities the relationship between the analysis scheme outline above and molecular simulation models must be understood. A first step in this direction is to understand the relationship between observable experimental data and molecular level models. Relationship between Data Analysis and Thermodynamics. Statistical mechanics is the foundation for thermodynamics, which consists of the statistical description of the physics of the molecular processes and interactions. Here we elaborate on the relationship between the statistical analysis discussed above and the statistical thermodynamics of the same process. First, we note that the microscopic probabilities θi(ε,T) used above are the Boltzmann probabilities,

θi(εil, T) )

e k

-εil



j

-εjl

/RT

l

θi(T) ) )

-εil

q qi q

where qi is the molecular partition function that accounts for energy levels for the fragment produced at bond i of the original peptide. Also, it can be verified that the likelihood of eq 2 is proportional to the product of the molecular partition functions defined above, k

p(n|θ) ) ntot!

∏ n1j ! θj(T)n

i

j k

) ntot! )

ntot ! qntot

∏ n1j ! · j k

() qj q

∏ n1j ! · qjn

nj

j

j

As before, ntot ) ∑nj. Since the peptide fragmentation occurs in the closed collision chamber of a mass spectrometer, we will additionally use the ensemble parameters V and T (volume and temperature, respectively) to characterize the system. The appropriate free energy is then the Helmholtz free energy, -a

( ( (

) ) )

k

/kBT ) log ntot!

∏ n1j ! qjn

) log ntot!

∏ n1j ! qjn

ntot !

j

j k

j k

j

∏ n1j ! qjn

+ log

1 1 - log n ntot q q tot

1 q qntot j ) log p(n, V, T|θ) + ntot log q ) log

ntot

j

- log

Here, kB is Boltzmann’s constant and T is the absolute temperature in K. This says that the free energy of a system with n ) {ni} species, volume V and temperature T depends on the likelihood p(n,V,T|θ) and the total state space, or likelihood distribution, available to the molecules. We use a lower case a to symbolize this as the free energy density because the vector quantities n ) {ni} are fixed. The only assumption about equilibrium regarding θi is that the degrees of freedom in the physical system are coupled such that energy can be transferred among the species; that is, it is assumed that the θi reflect the true Boltzmann probabilities. Restating,

-log p(n, V, T|θ) ) a/kBT + ntot log q

/RT

∑∑e

∑e

/RT

l

in which εil is the lth energy level for the fragment produced at bond i of the original peptide.14 The denominator is the multifragment partition function that accounts for fragmentation at all peptide bonds j,q ) ∑kj ∑∞l e(-εjl)/(RT). The partition function is the cumulative distribution of the likelihood, which measures the extent of the available state space. Marginalizing this

That is, the likelihood is the free energy without the normalization over the available state space. The reason why free energy is formulated this way is that the state space (distributions) of reactants and products of a chemical reaction will in general be different distances away from absolute zero and perfect order; hence, normalization of each of the free energies of the reactants and products by their respective state spaces incorrectly places reactants and products on equivalent scales. For a general chemical reaction given by

Interpretation of Peptide Tandem Mass Spectra

J. Phys. Chem. C, Vol. 114, No. 12, 2010 5363

a1 + a2 + · · · + am a b1 + b2 + · · · + bl

(3)

LR ) -log

the relative free energy that determines whether the products or reactants will be observed is given by the statistical mechanical relationship14

pb(nb, V, T|θ) pa(na, V, T|θ) nb,tot! na,tot!

∏ nb,j1 ! qb,jn

b,j

na,tot!

∏ na,j1 ! qa,jn

a,j

j

Here, qa,i is the partition function of the reactant ai, qb,j is the partition function of the product bi, na,i is the count of the product ai, and nb,i is the count of the product bi. The numerator is related to the likelihood distribution for observing all the products while the denominator is related to the likelihood distribution for observing all the reactants. (The partition functions above differ from those presented by McQuarrie14 because each product bi can be produced in any amount resulting in an k choose nb,i problem, (nb,totnb,i), while for chemical reactions discussed by McQuarrie the products are constrained according to the stoicheometry of the specified reaction such that one only needs to correct for indistinguishability of each product bi, 1/(nb,i!).) The statistical mechanical formulation of the free energy is related to the log-likelihood ratio of reactants and products,

∆a/kBT ) -log

pb(nb, V, T|θ) - log pa(na, V, T|θ)

(5)

na,j θa,j

In comparison to the relative free energy (eq 4), the loglikelihood ratio above does not include the term -log(qbnb,tot)/ (qana,tot) that accounts for the distances between the two distributions. This term is important when the two distributions are not equally likely, such as in collisional-induced dissociation when greater collision energy is required to fragment one peptide relative to another. For example, this is the case when comparing peptides of different sequences, lengths and charge states. This demonstrates how using a scientifically principled approach to data analysis can illuminate assumptions present in the models used for inference. If i in eq 5 refers to a specific peak in the spectrum such that ni is the same for peptide A and peptide B, then the loglikelihood ratio simplifies to N

LR ) log

∏ j

N

) qnbb,tot qnaa,tot

∏ na,j ! 1

j

j

∆a/kBT ) -log

b,j

j

) -log nb,tot!

∏ nb,j1 ! θb,jn

While the first term on the right-hand side compares the likelihood of each served distribution, the second term compares the extent of each state space including the distance of each distribution from perfect order at absolute zero. The loglikelihood provides a physically and chemically principled approach that motivates the use of a similar log likelihood for inference in which the task is to choose the best peptide match to a experimental spectrum. The inference is analogous to deciding from reaction conditions whether equilibrium favors products or reactants. In the inference case, the “chemical reaction” is a hypothetical reaction in which fragments from peptide A are transformed into fragments from a different peptide B (eq 3). Alternatively, peptide B can be a hypothetical, nullhypothesis “peptide” represented by probabilities due to the chance of randomly matching a peak at that location. Then the interpretation is that of a hypothetical reaction in which fragments from peptide A are transformed into fragments from a null model peptide B. Importantly, the formulation of the free energy-inspired loglikelihood ratio provides a chemically principled solution to an outstanding problem in peptide identification regarding how to incorporate intensity information into scoring. The formulation suggests that the intensity information be used as an exponent to the probability of observing a given fragment. In principle, the statistical log likelihood used for inference would have the form of a multinomial model as in eq 2. As before, θi is the probability of observing a specific fragment, derived either from training, spectral libraries, or simulation (discussed below). The determination of whether peptide A or peptide B is a better match to the spectrum can be made using a log likelihood ratio,

nj

θ

∑ nj log θb,ja,j j

(4)

( ) ( ) θa,j θb,j

(6)

Strictly speaking, ni refer to counts of molecules. A separate issue is whether counts of fragments can be derived from the abundances obtained from mass spectra. For mass detectors that use a multiplier, it may be more appropriate to use logtransformed values. Furthermore, using raw values measured from mass spectrometry for ni can be problematic if the values of ni are quite large. For example, if the estimated counts are in the femtomolar range (108 molecules), then this can lead to quite large values of the likelihood. The counts could be converted to molar values, but with the opposite effect that the likelihood values become quite small. An intermediate solution is to scale the estimated counts by the total counts, ni/ntot, which is then an empirical estimate of the probability for observing these fragments, Fi. Scaling the counts in such a manner leads to the information theory entropy,15

1 ntot

k

∑ j

( )

nj log

θa,j ) θb,j

k

( ) θ

∑ Fj log θb,ja,j j

(7)

The comparable equation derived from eq 4 would give k

( ) θ

∑ Fj log θb,ja,j j

+ 1 · log

θa θb

Equation 7 can be interpreted as the relative entropy between the probability space for peptide A (alternate-hypothesis) and the probability space for peptide B (null-hypothesis), as averaged over the observed space. The difference between the unscaled likelihood ratio statistic (eq 6) and the information theory entropy (eq 7) is that the likelihood ratio is an extensive measure in that the value of the likelihood depends on the number of particles observed, while the entropy is an intensive measure that is depends only on the relative number of particles observed. Consequently, in principle, the likelihood ratio provides a measure that can be used to evaluate the quality of different

5364

J. Phys. Chem. C, Vol. 114, No. 12, 2010

Cannon and Rawlins

TABLE 1: Comparison of the Identification of Peptides Using Eq 7 to Likelihood Model On the Basis of the Presence or Absence of Key Peaks for Both Standard Database Searching and Database Searching Combined with Spectral Libraries log-likelihood method presence/absence spectral library included? library peptides correct total correct % library peptides correct % total correct % increase

no 2923 4338 8.39% 12.45% 0%

yes 3184 4684 9.14% 13.45% 7.98%

spectra, while the information theory entropy does not. In order for either statistic to be used for inference, the significance of the score must be determined in an appropriate manner, such as that suggested by Klammer, et al.16 To demonstrate the usefulness of incorporating both (1) intensity information and (2) the subsequent combined analysis of spectral libraries with standard database searching, we compared the identification of peptides using eq 7 to our previous likelihood model, which was based on the presence or absence of key peaks.4 Since, as we discuss above, it is not clear whether intensities should be used directly or whether logtransformed intensities should be used, we compared both approaches to the previous presence/absence likelihood model. The results are presented in Table 1. To summarize, the use of eq 7 with log intensities leads to an 11.9% increase in the correct identification of peptides relative to the presence/absence likelihood model (4852 spectra correctly identified compared to 4338 spectra). Furthermore, the inclusion of spectral libraries in the analysis leads to a further increase to 25.5% above the base rate obtained using the presence/absence likelihood model (5445 spectra compared to 4338 spectra). Considering only peptides for which a spectral library was available, the increase in the identification rate was 31.6% (3848 spectra identified compared to 2923 spectra). Increasing the number of peptides for which a spectral library is available is expected to increase this rate further due to a reduction in false positives when the fragmentation patterns of these false positives are more accurately represented. The use of direct but scaled (normalized) peak intensities actually showed an initial decrease in the identification rate, which was alleviated by the inclusion of spectral libraries. This implies that the use of direct intensity information works well when the model spectrum is an accurate representation of the

intensities no 2483 3538 7.13% 10.16% -18.44%

log intensities yes 3249 4562 9.33% 13.10% 5.16%

no 3399 4852 9.76% 13.93% 11.85%

yes 3848 5445 11.05% 15.63% 25.52%

experimental spectrum but degrades the performance when the model spectrum does not faithfully represent the experimental spectrum. For the presence/absence model, the inclusion of spectral libraries in the analysis resulted in a modest improvement (8%) in the number of spectra correctly identified with a peptide (4684 spectra compared to 4338 spectra). One additional advantage to using a likelihood-based scoring function for the identification of peptides instead of correlation, dot products, or other approaches is that it allows for the direct use of model spectra derived from theory and simulation, as we discuss next. Fragmentation Probabilities from Simulations. A statistical scoring function used for data analysis that is consistent with statistical mechanical formulations can allow for the use of probability distributions obtained from simulations in cases when spectral libraries are not available. To demonstrate how these probabilities can be obtained, we will break down peptide fragmentation into two conceptual processes, shown in Figure 2, consistent with the mobile proton model of fragmentation.17 The first process is the organization of the vibrationally excited gas phase peptide from random configurations into a configuration from which a proton can besbut is not yetstransferred from one of the proton donor groups (N-terminal amine or protonated side chain such as lysine or arginine) to a backbone carbonyl oxygen. This process can involve large-scale structural reorganization of both backbone and side chain groups. We will refer to this organized statesprimed for proton transfersas the prereactive state. The second process is the transfer of the proton to the carbonyl oxygen and the stretching and bending of bonds required to cross the transition state energy barrier and form products. Characterization of the transition state energies is exceedingly difficult even with the use of large compute resources because of the multitude of reaction channels that can

Figure 2. Configurational reorganization of a peptide required for fragmentation at a peptide bond. Work must be done in the form of free energy to move the peptide from random configurations (left) into one in which the peptide is organized (center) for the chemical bond breaking and bond making steps (right) can occur. The free energy is proportional to the statistical likelihood of reaching this configuration. The work required to reach the transition state from the prereactive state is more difficult to characterize.

Interpretation of Peptide Tandem Mass Spectra

J. Phys. Chem. C, Vol. 114, No. 12, 2010 5365 and Arandom is the Helmholtz free energy per mole of the peptide as it explores random configurations in the gas phase. This probability can also be determined by counting the occupancy of the random and prereactive states,

θi,pre-react(T) ) 〈Npre-react /Nrandom〉 Alternatively, configurational distribution functions can be determined from molecular simulations as a function of a generalized coordinate, r. The configurational distribution function formed in this manner is referred to as the radial distribution function, g(r).14 The generalized coordinate that is relevant here is the distance between the proton donor groups and the backbone carbonyl oxygen involved in the reaction, Figure 3. Distribution of configurations, g(r), from which a proton can be donated at a distance, r, from the N-terminal amine to a specific backbone carbonyl oxygen of the 9mer polyalanine. Integration of each distribution from r ) 0 to r ) r gives the cumulative probability of finding the respective carbonyl oxygen within r angstroms from the N-terminal nitrogen.

rdonor,acceptor ) [(xdonor - xacceptor)2 + (ydonor - yacceptor)2 + 1

(zdonor - zacceptor)2] /2 Integration of the radial distribution function results in the probability of the prereactive complex, r)r g(r) dr ∫r)0 r)∞ g(r) dr ∫r)0 pre-react

lead to products.18 If one reaction channel dominates, then calculation of the transition state energy is much easier, but still a time-consuming and potentially labor intensive calculation. However, in the special case that fragmentation at each peptide bond occurs through reaction channels whose relative transition state energies do not change as a function of the position of the fragmenting bond along the peptide backbone or the specific amino acids involved, then it may be reasonable to estimate an average likelihood of crossing the transition state θi(T)TS from the prereactive state that can be applied to all labile bonds. In this case, the likelihood of fragmenting a bond can then be estimated from combining the likelihood of reaching the preorganized state with the likelihood of crossing the transition state,

θi(T)fragmentation ) θi(T)pre-react · θi(T)TS Next, we discuss how the likelihood of reaching the prereactive state from random configurations can be estimated from molecular simulations. The likelihood that a peptide will organize itself into the prereactive state depends on the free energy of this process through the Boltzmann relation,

θi,pre-react(T) ) e-(Apre-react-Arandom)/RT Here, R is the Rydberg gas constant (R ) NAkB where NA is Avogadro’s number), T is the absolute temperature, Apre-react is the Helmholtz free energy per mole of the prereactive complex

θi,pre-react(T) )

Radial distribution functions g(r) for the model peptide AAAAAAAAA (A9) are shown in Figure 3. In this peptide, the source of the proton is the N-terminal amine. The only electrostatic interactions to compensate for the charged groups are the backbone carbonyl oxygens. At any given time, a subset of the backbone carbonyls orient their dipoles toward the amine group in order to satisfy the electrostatic attraction between charged moieties. The dynamic nature of these interactions results in the radial distribution functions shown in Figure 3. In other peptides, the distance between the positively charged donor groups and the electronegative carbonyl oxygen will depend on how well the positive charge can be solvated by the carbonyl oxygen, relative to how well the donor group can be solvated by similar groups in the peptide. Since all amino acids contain the backbone carbonyl group, amino acid side chains are the largest differential modulators of these interactions. The amino acid dipolar side chain groups can provide alternate charge solvation and weaken the interaction between the donor group and the backbone carbonyl oxygen. As a result, the free energy required to organize a peptide from random configurations into a prereactive configuration is very sensitive to the amino acid sequence.19 Frequently, the only evidence used to differentiate which peptide out of many is responsible for the observed fragmentation pattern is the fragmentation pattern itself. However, as

Figure 4. Thermodynamic cycle comparing the electrospray ionization and CID fragmentation of two peptides. The upper horizontal path describes the thermodynamic process of charging, desolvation and fragmentation of peptide A while the lower horizontal path describes the same process for peptide B. The vertical paths labeled ∆Gion,BfA and ∆Gfrag,BfA describe the alchemical free energy differences25 between charging peptide A relative to peptide B, and fragmenting peptide A relative to peptide B, respectively.

5366

J. Phys. Chem. C, Vol. 114, No. 12, 2010

shown in Figure 4 the process of electrospray ionization followed by fragmentation requires charging of the peptide at a specified pH, desolvation of the peptide as it is introduced into the gas phase, and then subsequent fragmentation. The free energy for the total process is the sum of the free energies for each step, ∆Gion + ∆Gdesolv + ∆Gfrag, and the identification of a peptide would ideally consider each process. The differentiation between two peptides would then consider the total free energy difference for the two processes,

∆∆Gtotal ) ∆Gion,A + ∆Gdesolv,A + ∆Gfrag,A - ∆Gion,B ∆Gdesolv,B - ∆Gfrag,B ) ∆Gion,BfA - ∆Gfrag,BfA The use of only fragmentation patterns to differentiate between peptides is tantamount to only considering the fragmentation step ∆Gfrag,BfA. Recent work characterizing identifiable peptides20-24 implicitly seeks to account for the ionization and desolvation steps as well as additional processes involved in sample preparation. Conclusion A statistical scoring function used for data analysis that is consistent with thermodynamic formulations can potentially be used to bring in additional information regarding the thermodynamics of the fragmentation process. The use of simulations to provide fragmentation probabilities discussed above is one example. In addition, the work required to desolvate each candidate peptide prior to fragmentation could also be taken into account. Other benefits include a physically and chemically principled solution to the outstanding problem of how to account for peak intensities when evaluating mass spectra. The incorporation of peak intensity information in this manner has two advantages. First, it provides a consistent way to evaluate the match of an experimental spectrum to a set of peptides when each match may be evaluated with model spectra developed from different sources. Generally, empirical model spectra are derived from either spectral libraries or from training over sets of peptides having diverse sequences. These two types of model spectra are in common use but are generally used in separate analyses. Using the formulations above, we have extended our previous code4 to have the capability to use both model spectra from training and spectral libraries in the same analysis, with the result being a significantly improved rate (25-32%) of identifying those peptides for which spectral libraries are available. Second, as discussed above, a chemically principled scoring function opens the door for the development of sequence-specific model spectra from theory and simulation. Additionally, relating data analysis to physical principles of fragmentation may ultimately provide another conceptual tool for understanding the complex mechanisms involved in barrier crossing during collisional dissociation. Finally, although we used peptide fragmentation as the topic for this report, the concepts discussed herein are generally applicable to many areas where it would be advantageous to have data analysis methods that are congruent with the underlying physical processes.

Cannon and Rawlins Acknowledgment. This work was supported through the U.S. Department of Energy (DOE) Office of Advanced Scientific Computing Research under Contracts No. 47901 and 54976 and the Office of Biological and Environmental Research under Contract No. 54976. We thank researchers in the laboratory of Richard Smith at PNNL who generated the data sets herein, as well as Dr. Smith, for providing the data. A portion of the research was performed at the Environmental Molecular Sciences Laboratory, a national scientific user facility sponsored by the Department of Energy’s Office of Biological and Environmental Research and located at Pacific Northwest National Laboratory. The Pacific Northwest National Laboratory is a multiprogram laboratory operated by Battelle for the U.S. DOE under Contract No. DE-AC06-76L01830. References and Notes (1) Eng, K.; McCormack, A. L.; Yates III, J. R. J. Am. Soc. Mass Spectrom. 1994, 5, 976–989. (2) Dancik, V.; Addona, T. A.; Clauser, K. R.; Vath, J. E.; Pevzner, P. A. J. Comput. Biol. 1999, 6, 327–342. (3) Havilio, M.; Haddad, Y.; Smilansky, Z. Anal. Chem. 2003, 75, 435– 444. (4) Cannon, W. R.; Jarman, K. H.; Webb-Robertson, B.-J.; Baxter, D. J.; Oehmen, C. S.; Jarman, K. D.; Heredia-Langner, A.; Auberry, K. J.; Anderson, D. C. J. Proteome Res. 2005, 4, 1687–1698. (5) Beer, I.; Barnea, E.; Ziv, T.; Admon, A. PROTEOMICS 2004, 4, 950–960. (6) Craig, R.; Cortens, J. C.; Fenyo, D.; Beavis, R. C. J. Proteome Res. 2006, 5, 1843–1849. (7) Yates, J. R., 3rd; Morgan, S. F.; Gatlin, C. L.; Griffin, P. R.; Eng, J. K. Anal. Chem. 1998, 70, 3557–65. (8) Elias, J. E.; Gibbons, F. D.; King, O. D.; Roth, F. P.; Gygi, S. P. Nat. Biotechnol. 2004, 22, 214–219. (9) Kim, S.; Gupta, N.; Pevzner, P. A. J. Proteome Res. 2008, 7, 3354– 63. (10) Purvine, S.; Picone, A. F.; Kolker, E. Omics J. Integr. Biol. 2004, 8, 79–92. (11) Keller, A.; Purvine, S.; Nesvizhskii, A. I.; Stolyar, S.; Goodlett, D. R.; Kolker, E. Omics 2002, 6, 207–12. (12) Aebersold, R. Nature 2003, 422, 115–6. (13) Craig, R.; Cortens, J. P.; Beavis, R. C. Rapid Commun. Mass Spectrom. 2005, 19, 1844–50. (14) McQuarrie, D. A. Statistical mechanics; Harper and Row: New York, 1976. (15) Shannon, C. E. Bell System Tech. J. 1948, 27, 379–423. (16) Klammer, A. A.; Park, C. Y.; Noble, W. S. J. Proteome Res. 2009, 8, 2106–13. (17) Dongre, A. R.; Jones, J. L.; Somogyi, A.; Wysocki, V. H. J. Am. Chem. Soc. 1996, 118, 8365–8374. (18) Paizs, B.; Suhai, S. Mass Spectrom. ReV. 2005, 24, 508–548. (19) Cannon, W. R.; Taasevigen, D.; Baxter, D. J.; Laskin, J. J. Am. Soc. Mass Spectrom. 2007, 18, 1625–37. (20) Fusaro, V. A.; Mani, D. R.; Mesirov, J. P.; Carr, S. A. Nat. Biotechnol. 2009, 27, 190–8. (21) Kuster, B.; Schirle, M.; Mallick, P.; Aebersold, R. Nat. ReV. Mol. Cell Biol. 2005, 6, 577–83. (22) Mallick, P.; Schirle, M.; Chen, S. S.; Flory, M. R.; Lee, H.; Martin, D.; Ranish, J.; Raught, B.; Schmitt, R.; Werner, T.; Kuster, B.; Aebersold, R. Nat. Biotechnol. 2007, 25, 125–31. (23) Tang, H.; Arnold, R. J.; Alves, P.; Xun, Z.; Clemmer, D. E.; Novotny, M. V.; Reilly, J. P.; Radivojac, P. Bioinformatics 2006, 22, e481–8. (24) Webb-Robertson, B. J.; Cannon, W. R.; Oehmen, C. S.; Shah, A. R.; Gurumoorthi, V.; Lipton, M. S.; Waters, K. M. Bioinformatics 2008. (25) McCammon, J. A. Science 1987, 238, 486–91.

JP905049D