Article pubs.acs.org/ac
MRMPROBS: A Data Assessment and Metabolite Identification Tool for Large-Scale Multiple Reaction Monitoring Based Widely Targeted Metabolomics Hiroshi Tsugawa,*,†,‡ Masanori Arita,†,§ Mitsuhiro Kanazawa,∥ Atsushi Ogiwara,∥ Takeshi Bamba,‡ and Eiichiro Fukusaki‡ †
RIKEN Center for Sustainable Resource Science, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa, 230-0045, Japan Department of Biotechnology, Graduate School of Engineering, Osaka University, Suita, Osaka 565-0871, Japan § Department of Biophysics and Biochemistry, Graduate School of Science, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033, Japan ∥ Reifycs Incorporated, 1-6-12 Nishishinbashi, Minato-ku, Tokyo, 105-0003, Japan ‡
S Supporting Information *
ABSTRACT: We developed a new software program, MRMPROBS, for widely targeted metabolomics by using the large-scale multiple reaction monitoring (MRM) mode. The strategy became increasingly popular for the simultaneous analysis of up to several hundred metabolites at high sensitivity, selectivity, and quantitative capability. However, the traditional method of assessing measured metabolomics data without probabilistic criteria is not only time-consuming but is often subjective and makeshift work. Our program overcomes these problems by detecting and identifying metabolites automatically, by separating isomeric metabolites, and by removing background noise using a probabilistic score defined as the odds ratio from an optimized multivariate logistic regression model. Our software program also provides a user-friendly graphical interface to curate and organize data matrices and to apply principal component analyses and statistical tests. For a demonstration, we conducted a widely targeted metabolome analysis (152 metabolites) of propagating Saccharomyces cerevisiae measured at 15 time points by gas and liquid chromatography coupled to triple quadrupole mass spectrometry. MRMPROBS is a useful and practical tool for the assessment of large-scale MRM data available to any instrument or any experimental condition.
W
of these precursor−product ion pairs in the MRM mode can yield highly sensitive and selective measurements of the target metabolites. Usually two or three MRM transitions are specified for the quantification and identification of metabolites, whereas one or two ion pairs are specified to assess whether an observed peak is true (metabolite) or false (noise).9,11−14 Stein and Heller discussed in detail the risk of false positive identification from a few selected ions monitoring without scan mode analysis and listed steps for reducing the risk.15 Later, Berendsen et al. introduced a method to evaluate (un)certainty of MRM selectivity to reduce the risk of false positives.16 By using a high-end MS instrument and by optimizing analytical conditions, we can now measure 100−500 transitions in a single run and quantify 50−200 metabolites by the MRM mode.
idely targeted metabolomics is a novel methodology in metabolomics; levels of preselected metabolites are determined by using chromatography coupled to quadrupole (Q) or triple quadrupole (QqQ) mass spectrometry (MS).1−4 High-speed scanning of Q/MS or QqQ/MS data makes it possible to simultaneously analyze a fixed number (several dozen to hundred) of preselected metabolites in a single run with high sensitivity and a wide dynamic range. Such preselected metabolites can be quantified with the calibration curves of internal or external standards,5 and its quantitative nature is complementary to conventional nontargeted analysis using time-of-flight MS.6,7 In addition, the ever-increasing scan speed of mass spectrometers has led to an increase in the number of selectable metabolites.8 In widely targeted metabolomics, multiple reaction monitoring (MRM) by QqQ/MS is important for its high sensitivity, selectivity, and quantification ability.9,10 The MRM mode consists of two stages for mass selection, the precursor ion (MS1) and its fragment (product ion, MS2). Although the additional processes such as product ion scanning would be needed for the accurate identification, the appropriate selection © 2013 American Chemical Society
Received: February 21, 2013 Accepted: April 12, 2013 Published: April 12, 2013 5191
dx.doi.org/10.1021/ac400515s | Anal. Chem. 2013, 85, 5191−5199
Analytical Chemistry
Article
Figure 1. Flowchart and overview of the MRMPROBS program. The program supports all processes of data analysis from raw data import to statistical analysis for widely targeted metabolomics. A probability is assigned to each peak group by means of the MRMPROBS scoring system.
in C# language for widely targeted metabolomics. It evaluates the metabolite peaks by posterior probability, defined as the odds ratio by means of a newly optimized multivariate logistic regression model, and visualizes large-scale MRM data sets with user-friendly graphical user interfaces to allow data curation and statistical analyses. Specifically, the probability of a peak is calculated from five machine-independent variables, i.e., peak intensity and retention time, ratio, shape, and coelution similarity, without decoy transitions and label compounds. In addition, MRMPROBS offers auxiliary statistical analyses including multi t tests with graphs and principal component analysis (PCA). To demonstrate its main features, we measured the timecourse data of propagating Saccharomyces cerevisiae (S. cerevisiae) by gas chromatography coupled to triple quadrupole mass spectrometry (GC/QqQ/MS) and liquid chromatography coupled to triple quadrupole mass spectrometry (LC/ QqQ/MS). In a single run, GC/QqQ/MS hunted 110 metabolites with 325 transitions, and LC/QqQ/MS 60 metabolites with 166 transitions. These metabolites cover a wide range of metabolic properties involved in the construction of glycolysis, the pentose phosphate pathway, TCA cycle, nucleic acid metabolism, major amino acids, and cofactors. We sampled 15 time points with four biological replicates from early log to the diauxic shift phase. To our knowledge, this is the first single experiment that provides a detailed time course of the yeast metabolome. We expect MRMPROBS to be a useful and practical tool for the automated, objective, and consistent evaluation in widely targeted metabolomics. Our program and details of the MRM conditions are freely available at http://prime.psc.riken.jp/.23
In contrast to technological improvements, software development for the data analysis of MRM transitions lags behind in metabolomics: data assessment usually relies on manual evaluation due to the lack of automated probabilistic measures. Manual verification of such large-scale MRM data sets is not only laborious, but often subjective, erroneous, and even irreproducible. Therefore, objective evaluation is needed to minimize misinterpretations of biological issues. The dedicated software programs from MS vendors such as Xcalibur (Thermo Fisher Scientific), MassLynx (Waters), MassHunter (Agilent), LabSolutions (Shimadzu), and Analyst (AB Sciex) are indeed useful, but they do not provide a probabilistic evaluation for identifying metabolites. In addition, the packaged software is only available for instruments sold by the vendor. In proteomics, on the other hand, researchers can use, for example, Skyline software for MRM data from most vendors to qualify and quantify proteins and peptides.17 Objective evaluation is also possible with the mProphet algorithm that can evaluate the peptide peaks with a probabilistic scoring measure.18 These approaches are, however, not directly applicable to metabolomics. In proteomics, noise peaks are evaluated by using decoy transitions and heavy peptides, but in metabolomics isomeric metabolites detected by the same MRM transitions must be further discriminated.9 In addition, due to the scarcity of pure label compounds, isomeric differences tend to be assessed without label compounds.19 In the proteomics field, especially in “shotgun proteomics”, false discovery rate is effectively estimated from appropriate libraries of false identifications by means of decoy techniques. The metabolomics studies so far have not shown the feasibility or practicality of such a probabilistic scoring system.20 This is because the cleavage pattern and subsequent product m/z of metabolites in the collision cell are difficult to predict due to a large variety of molecular properties.21,22 For these reasons the retention time and spectral library become considerably more important. Under these circumstances, the probability that each peak in fact represents the target molecular structure must be considered and an easy-to-use graphical user interface is needed to double-check the theoretical results and to construct an organized data matrix. To satisfy these requirements we developed a software program, MRMPROBS (multiple reaction monitoring based probabilistic system for widely targeted metabolomics), written
■
THEORETICAL BASIS Terminology. A “transition group” is a set of MRM transitions for one metabolite whose cardinality is two or more. “Transition group record” refers to data points obtained in a transition group. A set of peaks (i.e., mass spectra) detected in a transition group is called a “peak group”. In one peak group from two or more MRM transitions, one transition is used to quantify a metabolite; this is defined as the “target” transition. The remaining transitions are “qualifier” transitions and used for qualification of a target metabolite to discriminate it from isomeric metabolites and the background noise. Among the
5192
dx.doi.org/10.1021/ac400515s | Anal. Chem. 2013, 85, 5191−5199
Analytical Chemistry
Article
“target” and “qualifier” transitions, the qualifier/target (QT) ratio is defined as the percentile of the maximum intensity of the qualifier divided by the intensity of the target. The QT ratio is used for assessments of peak groups. “Shape” and “coelution” similarity refer to the peak shape and the peak top deviation between a target and a qualifier transition record, respectively. In this study, a peak group is classified into three groups: “true peak”, “false peak”, and “noise peak”. “True peak” refers to peaks derived from the target metabolites. They must be experimentally confirmed by adding standard compounds with a dilution series into biological samples. The remaining peaks in the same transition record are “false peaks” and are used to evaluate isomeric metabolites that produce the same precursor and product m/z and a very similar retention time. Finally, “noise peak” refers to peaks in decoy transition records. They are used to estimate technical or machine noise. Software Description. Figure 1 shows the MRMPROBS flowchart. We currently utilize a freely available file-format converter (Reifycs Inc.) that converts raw data from each instrument to the Reifycs Analysis Base File (ABF) format. The purpose of this file conversion is to provide a common data format for the rapid data retrieval of desirable data ranges. The ABF library accepts raw file formats from several MS instruments and provides indexed data access to designated retention time and/or m/z ranges, and to polarity and/or mass levels. MRMPROBS imports ABF format files from measured samples and the reference (i.e., MRM transition) library in the text format that contains the target compound name, retention time, target or qualifier, QT ratio (optional), and precursor and product m/z. The construction method and an example library file are described in section SI 1 and supplemental data 1 in the Supporting Information, respectively. After ABF files are imported, the software detects and constructs the peak groups and calculates five scores (intensity, retention time, QT ratio, shape, and coelution score) for each peak group by comparing the detected peak intensity, qualifier transition records, and reference information. Our peak detection method is described in Supporting Information section SI 2. It also accepts a peak group consisting of only one transition record. Although this is not recommended, in such cases only two scores (the intensity and the retention time score) are calculated. Then the probability is computed for each peak group by a multivariate regression model (see below). The odds ratio is calculated as a posterior probability of a true peak given the five scores. A peak group of the highest probability is selected as a quantification value in the resulting data matrix. This process is repeated for all transition group records of all samples. Upon constructing the matrix, MRMPROBS can normalize the variables by the internal standard and perform further statistical and graphical analyses. The software can visualize the bar or line chart of each metabolite or perform principal component analysis and multi t tests based on the false discovery rate. The results can be exported in several image formats (e.g., JPEG or PNG). The data matrix can also be exported as a text- or comma-separated value format for further statistical analysis using external software programs. Scoring Scheme. Five scores, each standardized from 0 (minimum value) to 1 (maximum value), for machineindependent analysis are used to evaluate detected peaks. Retention Time and QT Ratio. The retention time (RT) and QT ratio scores are the similarity between a measured peak
group and a library. The Gaussian function is used to compute both scores. ⎛ 1 ⎛ RT − RT ⎞2 ⎞ ref ⎟ ⎟ SRT = exp⎜⎜ − ⎜ sam ⎝ ⎠ ⎟⎠ δ 2 ⎝
SQT =
1 n
⎛ 1 ⎛ QT − QT ⎞2 ⎞ ref exp ⎟ ⎟⎟ ∑ ⎜⎜− ⎜⎝ sam ⎠⎠ 2 δ ⎝
Subscripts “sam” and “ref” denote the measured sample and the reference values (see below), respectively. δ is the standard deviation for all differences of target metabolite peaks in all samples. The reference value for one metabolite is constructed based on 10 different concentrations, each with five technical replicates (total measurements = 50). The experimental detail is described in Supporting Information section SI 1. In fact, it is important for the theoretical investigation of similarity assessments to construct a reliable reference. Therefore, only peaks of more than 10 000 raw intensities were used for the QT ratio evaluation. This threshold is based on our observation that the relative standard deviations of the height of such peaks are less than 10% under our experimental conditions (Supporting Information Figure S-1a). Consequently, the average of RT and QT ratio of peak groups of more than 10 000 peak intensities were calculated as the reference value of each metabolite. We also confirmed that the differences follow a Gaussian distribution (Supporting Information Figure S-1, parts b and c). By normalization, the score ranges are from 0 (no similarity) to 1 (identical to a reference). Because these scores are strongly dependent on the analytical condition (i.e., machine-dependent), we particularly investigated their features, importance, and contribution in Result and Discussion section. Peak Intensity. The intensity score is based on (1) the peak height of a peak group compared to the highest intensity in all target transition records and (2) the rank of the peak intensity in all peak groups of a focused transition record: Sintensity = a
ln(intensity(peaki)) ln(highest intensity)
+ (1 − a)(rank(peaki))
Intensity (peaki) and highest intensity mean the peak intensity of a peak group and the highest peak intensity in all transition records, respectively. Rank(peaki) means the rank of the intensity of a peak group in a transition record. The highest peak in a transition record is 1, and 0.1 is decremented as the rank lowers until the 10th decrement. The other peaks receive a rank of 0. In our study the number of detected peaks in a transition record of biological samples tended to be less than 10 peaks (Supporting Information section SI 3 and Figure S-2). The coefficient value “a” is set to 0.5 in our experiment. Shape and Coelution. In addition to the intensity, retention time, and QT ratio that are often used in metabolomics, we recruited the shape and coelution scores from the mProphet algorithm.18 Scoelution =
Sshape = 5193
1 n
1 n
⎛ 1 ⎛ ΔT ⎞2 ⎞ ⎜ ⎟ ⎟ ⎝ 2⎝ δ ⎠ ⎠
∑ exp⎜−
∑ (Δshape) dx.doi.org/10.1021/ac400515s | Anal. Chem. 2013, 85, 5191−5199
Analytical Chemistry
Article
tubes were centrifuged at 16 000g for 5 min at 4 °C and a 400 μL supernatant was inoculated through a PTFE filter (Millex, Millipore, Billerica, MA, U.S.A., pore size 0.2 μm) into two clear tubes. One tube was subjected to GC/QqQ/MS, the other to LC/QqQ/MS. Methanol in the tubes was evaporated by placing them in a vacuum centrifuge dryer for 1 h, and the mixtures were freeze-dried overnight. In the LC/QqQ/MS experiment, the dried samples were dissolved in 200 μL of Milli-Q water and a 3 μL supernatant aliquot was injected into the LC/QqQ/MS. In the GC/QqQ/ MS experiment, the derivatization process was as reported elsewhere.25 Briefly, 100 μL of methoxyamine hydrochloride in pyridine (20 mg/mL) was added; incubation was at 30 °C for 90 min for oximation. For trimethylsilylation, 50 μL of Nmethyl-N-(trimethylsilyl)trifluoroacetamide (MSTFA) was added; this was followed by 30 min of incubation at 37 °C. A 1 μL aliquot of the supernatant was injected into the GC/ QqQ/MS in split mode (15/1, v/v). Decoy Transitions. Decoy transitions were created as in an earlier report on proteomics.18 For the decoy precursors a random integer ranging between 5 and 15 was subtracted from the actual precursor m/z of the target metabolites. To avoid conflicts with adduct ions, the random value was only subtracted and never added. Transitions were removed if they incidentally shared a similar precursor m/z with another true target (