Article pubs.acs.org/jpr
Quantitative Statistical Analysis of Standard and Human Blood Proteins from Liquid Chromatography, Electrospray Ionization, and Tandem Mass Spectrometry Peter Bowden,† Thanusi Thavarajah,† Peihong Zhu,† Mike McDonell,‡ Herbert Thiele,§ and John G. Marshall*,† †
Department of Chemistry and Biology, Ryerson University, 350 Victoria Street, Toronto, Canada Bruker Daltonics, Inc., Bellerica, Massachusettes, United States § Bruker Daltonik GmbH, Bremen, Germany ‡
S Supporting Information *
ABSTRACT: It will be important to determine if the parent and fragment ion intensity results of liquid chromatography, electrospray ionization and tandem mass spectrometry (LC−ESI−MS/MS) experiments have been randomly and independently sampled from a normal population for the purpose of statistical analysis by general linear models and ANOVA. The tryptic parent peptide and fragment ion m/z and intensity data in the mascot generic files from LC−ESI−MS/MS of purified standard proteins, and human blood protein fractionated by partition chromatography, were parsed into a Structured Query Language (SQL) database and were matched with protein and peptide sequences provided by the X!TANDEM algorithm. The many parent and/or fragment ion intensity values were log transformed, tested for normality, and analyzed using the generic Statistical Analysis System (SAS). Transformation of both parent and fragment intensity values by logarithmic functions yielded intensity distributions that closely approximate the log-normal distribution. ANOVA models of the transformed parent and fragment intensity values showed significant effects of treatments, proteins, and peptides, as well as parent versus fragment ion types, with a low probability of false positive results. Transformed parent and fragment intensity values were compared over all sample treatments, proteins or peptides by the Tukey-Kramer Honestly Significant Difference (HSD) test. The approach provided a complete and quantitative statistical analysis of LC−ESI−MS/MS data from human blood. KEYWORDS: mass spectra, spectrum, parent ion, fragment ion, intensity, log, ln, transformation, normal distribution, ANOVA, Tukey-Kramer Honestly Significance Difference test, human blood, LC−ESI−MS/MS, structured query language SQL, statistical analysis system SAS, parent fragment ion intensity, log transformation, normal distribution, general linear model, analysis of variance ANOVA
■
INTRODUCTION Elution peaks in chromatography are in the form of a Gaussian (normal) distribution. During unbiased liquid chromatography followed by electrospray ionization and tandem mass spectrometry (LC−ESI−MS/MS), there is variation in the elution time of the peaks and the MS and MS/MS data are collected randomly over the course of the peak elution and not necessarily at the time of maximum peak intensity. Since there is random variation in the timing of the LC experiment and MS and MS/MS sampling, and since the sampling of a © 2012 American Chemical Society
peak in one experiment has no effect on the next, then LC−ESI−MS/MS approximates a random and independent sampling from a normal population. The prerequisite for the use of the statistical method Analysis of Variance (ANOVA) is that the data are randomly and independently sampled from a normal population. Hence, it might be very useful to use ANOVA to compare chromatography columns and fractions Received: January 8, 2011 Published: February 9, 2012 2032
dx.doi.org/10.1021/pr2000013 | J. Proteome Res. 2012, 11, 2032−2047
Journal of Proteome Research
Article
sampled by LC−ESI−MS/MS may be transformed with logarithmic or other functions. Ion intensity values could be fit to gamma, normal or log-normal distributions with the deviations from the calculated normal plotted. The most appropriate statistic approaches might be applied to the transformed parent and fragment intensities to reveal the compounds that differ significantly in intensity between treatments by general linear models and ANOVA with correction for multiple comparison.4
for the detection of proteins from their tryptic peptides. It remains to be considered if parent and fragment intensity data from LC−ESI−MS/MS have been randomly and independently sampled from a normal distribution for statistical analysis by ANOVA.1 Data that have been randomly and independently sampled from a normal distribution may be modeled using ANOVA followed by mean testing with correction for multiple comparisons. The condition of random and independent sampling are immutable properties of the physical experiment. In contrast, a normal distribution may be attained from transformation of the data. The intensity distribution of the parent and fragment ions must approximate the normal distribution for the use of common statistical methods such as ANOVA. The intensity distribution of ions from MS spectra and noise have been related to the normal distribution and log transformation resulted in more homogeneous variation.2,3 Previously the sampled ion intensity values were organized into bins along the m/z axis with one way and multiple ANOVA prior to means comparison by the Tukey-Kramer Honestly Significant Difference test.4 Here the protein and peptide identification results of the X!TANDEM algorithm were used to organize the transformed parent and fragment intensity values prior to complete statistical analysis. We compared the presence of peptides from blood proteins over 12 pH fractions from two different chromatography resins.
Combining mgf and X!TANDEM Results into One Database
From each peptide a precursor ion mass/charge (m/z) ratio is measured and an ion intensity value is obtained. The precursor ion collected in the ion trap is fragmented producing a set of resulting (m/z) values where charge is assumed by X! TANDEM to be +1 and so the fragment m/z values often approximate the fragment [M + H]+. Subsequently the resulting fragment m/z values are correlated to the amino acid sequences of the protein database by algorithms such as X!TANDEM. An SQL database may be used to capture the raw m/z and intensity values of the parent and fragment ions in the mgf files from populations of LC−ESI−MS/MS runs.8,9 Correlation algorithms such as X!TANDEM might be used to assign the parent and fragment ions in the mgf file to peptides sequences within proteins.10 Parsing both the raw parent m/z and intensity values together with the protein, peptide and [M + H]+ assignments into the same SQL database will permit the hierarchal organization of the data for subsequent complete statistical analysis. The many mgf files from the MS provide the sample treatment, parent and fragment ion intensity values while the correlation algorithm X!TANDEM provides the blocking variables of protein, peptides and calculated [M + H]+ values.
Random and Independent Sampling of LC−ESI−MS/MS
Proteins were loaded on quaternary amine (QA) anion exchange and propyl sulfate (PS) cation exchange columns each yielding 12 pH fractions that were digested with trypsin and the resulting peptides separated over a reversed-phase column with sequential elution into the electrospray ionization source. The QA and PS columns were previously shown to be complementary.5 LC−ESI−MS/MS may be considered as a sampling of the m/z value and intensity of ions obtained from a population of peptides and the resulting fragments.6 Peptides from the QA and PS fractions were sampled as they eluted from the end of the reversed-phase column into the ionization source. The fragmentation of the most intense ions proceeds as the peptides elute from the end of the reversed-phase column in order of solubility in the water-acetonitrile gradient. The moment of sampling is not fixed with respect to the progress of the chromatographic separation. Some peptides ionize efficiently and others poorly. The mass-overcharge ratio (m/z) of the parent peptides ions are recorded with a small experimental error from the ideal value. In complex mixtures the peptides may be sampled at least once or a few times over the course of the eluting peak, depending in part on whether a list of recently sampled m/z values are temporarily ignored (rotating exclusion list). The sampling of the peptides from one experiment has no effect on the sampling of other peptides from the same or separate experiment. After parent ions are detected in the MS scan, the target peptide might be fragmented to result in m/z and intensity values in the MS/MS scan. Variation in peptide sampling is the largest source of disagreement between LC−ESI−MS/MS experiments on blood.5,7
The Structural Relationship of LC−ESI−MS/MS Data
It is important to understand the structure and relationships between the elements of an LC−ESI−MS/MS data set prior to designing statistical analyses. The set of LC−ESI−MS/MS experiments contains different types of information including treatments, fractions, replicates, retention times, ion numbers, and other data beside the parent and fragment ion m/z and intensity values. The ion intensity values, the m/z values and the results of the subsequent correlation analysis may be parsed into a database and the hierarchical relationships between the data elements established. There is a hierarchy of data arranged beneath each protein sequence: within one protein there may exist several peptide sequences, each with at least one parent ion and the resulting fragment ion m/z and intensity values. Mass spectrometers detect some peptides redundantly in many instances, other peptides only once, and still other peptides may never be detected even after many attempts.5,7,11 Each parent peptide has many fragment ions each with a measured m/z and an intensity value. Thus, a large set of LC−ESI−MS/MS results is structured as a set of proteins each identified from a family of at least one detected peptide and where each peptide may have many parent and fragment ions. The m/z scale of the mass spectrometer is a continuous scale. While the m/z scale is freely continuous,12,13 the observed peptide m/z values are not necessarily continuous variables. Instead the measured m/z values are best conceived of as an ideal value plus or minus a measurement error. Hence the MS/MS correlation algorithm matches the fragment m/z values onto the predicted [M + H]+ values of the peptide sequence.
Transformation of Ion Intensity Values to a Normal Distribution
The MS and MS/MS spectra intensity values from electrospray remain to be thoroughly explored using transformations and statistical modeling. The fit of the peptide and fragment intensity values after log transformation to the normal distributions remain to be seen. The intensity values of the ions 2033
dx.doi.org/10.1021/pr2000013 | J. Proteome Res. 2012, 11, 2032−2047
Journal of Proteome Research
Article
Declaring the Variables as Nominal, Ordinal or Continuous
Preparation of Protein Standards
Once the relationship between the treatments, proteins, peptides and fragments has been established, it remains to declare the variables as nominal, ordinal, or continuous so that they may be analyzed with a statistical analysis system (SAS). Parent and fragment ion intensity values are continuous variables. The peptide [M + H]+ is an ordinal variable. The sample treatment, protein name and peptide sequences may be declared as nominal variables. The m/z values of the peptide fragments were correlated to known amino acid sequences. The treatment of the identified peptide, or the corresponding protein, as nominal variables dramatically simplifies the statistical analysis of large sets of transformed intensity values from LC−ESI−MS/MS experiments without the requirement for retention times.
Separately, 1 mg of each of standard, alcohol dehydrogenase (ADH), glycogen phosphorylase B (GPB), and cytochrome c (CYC) was weighed and dissolved in 1 mL of 100 mM Tris 200 mM Urea 5% acetonitrile and digested with 25 μg of trypsin for 15 h at 37 °C before adding 500 μL of 1% acetic acid to the tube. The resulting digests were mixed such that 5 μL of the stock contained a final amount 694 pmol of ADH, 260 fmol of GPB and 2 pmol of cytochrome c. Preparative Fractionation and Digestion of Normal Human Serum
The proteins from 25 μL of normal human serum (NHS) were separated by cation and anion exchange chromatography resins that show complementary protein identifications.4,5,17 Serum contains on average ∼55 mg/mL of albumin and so on the order of a milligram of HSA was applied to the column that had a total binding capacity of ∼5 mg. Proteins from serum were prefractionated by preparative partition chromatography using propyl sulfate (PS) and quaternary amine (QA) chromatography columns as previously described.4,5,17 The proteins were assayed by the Dumbroff method.18 One hundred micrograms of protein from each column fraction were digested with 1 μg of trypsin for 12 h and the sample was reduced for 30 min in 1 mM DTT at 50 °C before digestion with another 1 μg of trypsin for 4 h prior to freezing.5
Statistical Analysis
The ion intensity values were rendered log-normal and then matched to the nominal peptide sequences within proteins and so the problem can be addressed by general linear models (GLM) and analysis of variance (ANOVA). The intensity values over all experiments were examined using whole ANOVA models (treatment, proteins, peptides and ion types) followed by one way ANOVA with the appropriate correction for multiple comparisons such as the Tukey-Kramer Honestly Significant Different (HSD) test.4 Previously, the variance in MALDI m/z values from the same peptide between experimental treatments and replicates was addressed by comparing the intensity values in 5 Da bins along the m/z axis by one way and multiple ANOVA with no other relationships between variables.4 Sliding the 5 Da window one Da at a time revealed the register that produced maximal significance of the unknown ions.4 To take the approach of ion statistical analysis to its logical completion, correlation algorithms can be used to supply the structural relationships of parent and fragment ions to peptides and proteins that will permit detailed statistical modeling using classical methods in SAS. The results of correlation algorithms such as X! TANDEM or PARAGON can be summarized in SQL along with parent and fragment ion intensity values and made available to SAS from many different equipment manufacturers and algorithms.7,11,14−16 A large population of parent and fragment intensity values from many LC−ESI−MS/MS experiments were matched to the protein name and peptide sequence variables supplied by the correlation algorithm to create complete models of each protein at the level of the many parent and fragment intensity values that will yield powerful statistical models for the sampled peptides and proteins.
■
LC−ESI−MS/MS
The tryptic peptides were collected over preparative C18 micro columns and separated over a 300 μm ID, 15 cm C18 (5 μm, 300 Angstrom) reversed-phase column with an Agilent 1100 HPLC pump. The LC−ESI−MS/MS analysis was performed with an Esquire 3000 ion trap (Bruker Daltonics, Bellerica, MA) as previously described.19 No target list was employed but sampled peptides masses within 5 Da were ignored by a rotating exclusion list. A federated library of ∼135000 human proteins predicted from cDNA and genomic sequences from the NCBI, Swiss Prot, Ensembl, Trembl and other sources was assembled and rendered distinct with SQL prior to output in a FASTA format for correlation analysis. The MS/MS spectra were correlated against the semi tryptic peptides of the federated human library considering +1, +2 and +3 charge states by X!TANDEM within −3 to +3 Da for parent ions [M + H]+, and within 0.5 Da for the +1 b and +1 y ions with no modifications. Peptides with a ≥90% probability of correct identification by the classical goodness-offit test of the fragments by X!TANDEM were accepted into the protein model. The fragment ions in the mgf files were correlated to peptide sequences and the expectation value of type I error was calculated using the goodness of fit by X!TANDEM as previously described.10,20,21
MATERIALS AND METHODS
Materials
HPLC grade water and solvents from Caledon Laboratories (Mississauga, Ontario, Canada) were used for all steps. Buffers and salts were obtained from the Sigma-Aldrich chemical company (St Louis, MO) unless otherwise indicated. Sequencing grade trypsin was obtained from Roche (Basel, Switzerland). Preparative chromatography resins were obtained from Bio-Rad (Hercules, CA). Analytical C18 columns were obtained from Vydac (Hesperia, CA). The protein standards alcohol dehydrogenase (ADH) from yeast (Sigma A-7011, Mr = 36 kDa), glycogen phosphorylase B (GPB) from rabbit muscle (Sigma P-6635, Mr = 97.4 kDa) and cytochrome c (CYC) from bovine heart (Sigma C-2037, Mr = 12.3 kDa) were obtained from Sigma-Aldrich chemical company (St Louis, MO)
MGF and X!TANDEM Parser
Blank runs with the LC−ESI−MS/MS system indicated that the upper limit of noise and contamination spectra from blank runs was ∼E3.0 parent signal intensity and so parent ions with intensity values ≤ E3.0 were not converted to mgf outputs. Fragment ions of less than E2 (100) counts were not statistically analyzed to avoid noise. The contents of the mgf files were parsed into an SQL database and the results of X! TANDEM were parsed into the same SQL database using the methods previously described.11 The X!TANDEM group result table contains a reference to the specific spectra in the mgf file which led to the peptide identification and is thus related to the 2034
dx.doi.org/10.1021/pr2000013 | J. Proteome Res. 2012, 11, 2032−2047
Journal of Proteome Research
Article
peptide and protein identification results within the SQL database which is subsequently imported for statistical analysis in SAS.11
indicated that they were not significantly different from normal by curve fit or the log probability plot but the sample sizes are too small for smooth curves (Figures 2 and 3).
Statistical Analysis
Blood Proteins
A database of the raw m/z and intensity data from the mgf files matched to the corresponding peptide identifications from the X!TANDEM correlation algorithm results in a format amenable to direct computation by a statistical analysis system (SAS). The transformed intensity of the many parent and fragment ions were mapped to the peptide and proteins provided by the X!TANDEM correlation algorithm. The peptide to protein distribution of the authentic blood data was compared to the null model from a random mass spectra generator by the Chisquare test Σ[(observed − expected)2/expected].5,7,11−13 The nominal, ordinal and continuous variables were declared using the SAS JMP graphical interface. The variables treatment (column fraction), protein and peptide were declared nominal, the [M + H]+ value was declared ordinal and the parent and fragment intensity values were declared continuous using the column information dialogue box. The many parent and fragment ion intensity values (arbitrary counts) were transformed into Log10 using the column information > formula dialogue box. The inherent database relationship of transformed fragment and parent intensity values permitted complete ANOVA models of ion currents at the level of sample treatment, protein, peptide and ion type (parent versus fragment). General linear models were fit using the Analyze menu, with the f it model option and selecting the blocking variables such as sample treatment (file name), protein, peptide or fragment and selecting the response variables such as log parent intensity or log fragment intensity. First, significant effects on the mean transformed ion current of peptides and fragments at the level of sample treatment, protein, peptide ion type were determined by ANOVA models of the whole experiment. Subsequently the ion intensity of peptides, fragments or both were compared by one way ANOVA using the Analyze menu, selecting the f it-y-by-x dialogue box and selecting the response variables such as log parent intensity or log fragment intensity and the appropriate blocking variables such as sample treatment, protein or peptide. The graphical results from the SAS JMP report were converted to metafiles for inclusion in figures. The ANOVA results were converted to rich text formats for inclusion as tables. The whole reports were converted to text files for inclusion in the Supporting Information.
All of the proteins identified were previously established blood proteins.7,11 Parent and fragment m/z and intensity results from MS and MS/MS spectra were linked to the calculated peptides, proteins and [M+H]+ values in SQL for statistical analysis in SAS. The intensity values were rendered normal by log transformations to yield general linear models of the data at the level of the whole experiment and to compare means by one way ANOVA where variation was partitioned across treatments, proteins, peptides and fragments. The statistical analysis indicated that the parent and fragment ion data from many LC−ESI−MS/MS experiments together may be completely and efficiently analyzed by SQL and SAS on a personal computer. The results shown are arbitrarily selected to illustrate the nature of the analysis made with parent and/or fragment ion intensity values at the level of peptides or proteins. Statistical analyses using both parent and fragment ion intensity values of 12 pH fractions each from quaternary amine (QA) and 12 propyl sulfate (PS) chromatography resins are shown in the Supporting Information for completeness. False Positive Identification Rate. The Chi-square test permits the comparison of frequency counts to determine whether the relative distribution of data across category bins are different. We compared the peptide to protein distribution and calculated the Chi-square value for bins of 1, 2, and 3 or more peptides per protein to that of random expectation. Peptide to protein tabulation of the entire distribution from many sets of LC−ESI−MS/MS fractions were compared to the results from random spectra. The frequency distribution of experimental data may be compared to the expectation of random or false positive results that typically shows approximately 88% single peptides, with 11% having two peptides and 1% (or less) with three peptides or more depending on the method employed.12,13,22 Thus experimental data that shows greater proportion of proteins with many peptides, and with a reduced frequency of proteins with one peptide, would appear to differ from random or false positive results. The standard use of the Chi-square test indicated there is little chance the protein identifications from ion traps with high signal-to-noise ratios are the same as random false positive results.12,13 The use of the X!TANDEM algorithm for identification of human blood proteins with an ion trap has been shown to have an acceptable false positive rate with expectation scores of E-2 or less5 and shows good agreement with the PARAGON algorithm.16 Here the Chi-square tests of the peptide to protein distribution with respect to random expectation agreed there was a low probability of false positive identification by X!TANDEM (Table 1). The peptide to protein distribution of the set of proteins identified by X!TANDEM was tested against the expected distribution of random MS/MS spectra (Table 1). The identified proteins showed a high peptide to protein distribution with many proteins each showing at least several correlated peptides. In contrast, the random spectra show mostly proteins with one peptide. After calculating the expected peptide to protein distribution based on frequency estimates of Zhu et al12,13 a Chi-square value of approximately 811 is obtained showing a low probability that the complete data set from all quaternary amine and propyl sulfate fractions is the same as random false positive identifications (p < 0.0001).
■
RESULTS
Standard Proteins
No correlations to the protein standards were observed in the initial blank runs. Analyses of 1, 5, 10, 15, 20, or 30 μL of the tryptic digest of the defined mixture of three standard protein yielded a linearly increasing series of intensity values for some but not all peptides (Figure 1). A 30 fold increase in the amount of the standard mixture injected resulted in less than a 10-fold increase in parent peptide signal intensity. The relationship between protein amount and signal could be accounted for by a General Linear Model (GLM) with a highly significant result of p < 0.001 or less in all instances shown. Injecting increasing amounts of digest in each experiment results in trace amounts of new peptides from the target detected at each new amount and precludes averaging of all peptides over the protein. Statistical analysis of the intensity values from the same set of linear standard parent peptides at the 20 μL injection 2035
dx.doi.org/10.1021/pr2000013 | J. Proteome Res. 2012, 11, 2032−2047
Journal of Proteome Research
Article
Figure 1. Linearity of log10 parent ion intensity values from a digested mixture of alcohol dehydrogenase (ADH), cytochrome c (CYC) and glycogen phosphorylase (GPB) that was randomly and independently sampled by unbiased LC−ESI−MS/MS. (A) EELFRSIGGEVFIDFTK (ADH); (B) ATDGGAHGVINVSVSEAAIEASTR (ADH); (C) EALDFFAR (ADH); (D) LPLVGGHEGAGVVVGMGENVK (ADH); (E) NLAENISR (GP); (F) SRPLSDQEK (GPB); (G) TGPNLHGLFGR (CYC); (H) KTGQAPGFSYTDANK (CYC). A general linear model resulted in a p value of