Method for Assessing the Statistical Significance of Mass Spectral

Aug 3, 2013 - (10, 11) The basic local alignment search tool (BLAST) employs ... the method showed the best fit to the modified Karlin–Altschul stat...
1 downloads 0 Views 631KB Size
Article pubs.acs.org/ac

Method for Assessing the Statistical Significance of Mass Spectral Similarities Using Basic Local Alignment Search Tool Statistics Fumio Matsuda,*,†,§ Hiroshi Tsugawa,‡,§ and Eiichiro Fukusaki‡ †

Department of Bioinformatic Engineering, Graduate School of Information Science and Technology, and ‡Department of Biotechnology, Graduate School of Engineering, Osaka University, Suita, Osaka 565-0871, Japan § RIKEN Center for Sustainable Resource Science, 1-7-22 Suehirocho, Turumi-ku, Yokohama, Kanagawa 230-0045, Japan S Supporting Information *

ABSTRACT: A novel method for assessing the statistical significance of mass spectral similarities was developed using modified basic local alignment search tool (BLAST; Karlin− Altschul) statistics. In gas chromatography/mass spectrometrybased metabolomics, many signals in raw metabolome data are identified on the basis of unexpected similarities among mass spectra and the spectra of standards. Since there is inevitably noise in the observed spectra, a list of identified metabolites includes some false positives. In the developed method, electron ionization (EI) mass spectrometry−BLAST, a similarity score of two mass spectra is calculated using a general scoring scheme, from which the probability of obtaining the score by chance (P value) is calculated. For this purpose, a simple rule for converting a unit EI mass spectrum to a mass spectral sequence as well as a score matrix for aligned mass spectral sequences was developed. A Monte Carlo simulation using randomly generated mass spectral sequences demonstrated that the null distribution or the expected number of hits (E value) follows modified Karlin−Altschul statistics. A metabolite data set obtained from green tea extract was analyzed using the developed method. Among 171 metabolite signals in the metabolome data, 93 signals were identified on the basis of significant similarities (P < 0.015) with reference data. Since the expected number of false positives is 2.6, the false discovery rate was estimated to be 2.8%, indicating that the search threshold (P < 0.015) is reasonable for metabolite identification.

I

similarity score of the observed and standard spectrum of identical metabolites must be less than 1. This means that a search threshold for metabolite identification is needed. GC/ MS metabolomics studies empirically employ threshold levels at 0.6−0.7. Furthermore, the metabolite identification procedure is iteratively performed for hundreds of metabolite signals in a metabolome data set. This suggests that a list of identified metabolites is likely to include some false identifications derived from chance similarities between mass spectra (false positives). A metabolite identification list produced using a smaller or loose threshold tends to include a larger number of false positives. In this regard, control of the false discovery rate (FDR) for a list of identified metabolites is essential to minimize misinterpretation of metabolome data.6 However, FDR estimation has not been considered in metabolomics because of the lack of relevant methodology. In the field of proteomics, FDRs in peptide identification results have been estimated using the target−decoy method.7 Database searching was performed for a set of peptide MS/MS spectra using original protein (target) and reversed amino acid

n nontargeted metabolomics studies for comprehensive profiling of metabolites, structural annotation of metabolite signals (peaks) is performed after acquisition of raw chromatographic data. In the case of metabolome analyses using gas chromatography/mass spectrometry (GC/MS), chromatographic and spectroscopic data of each metabolite signal, including the retention time (or index) and electron ionization (EI) mass spectrum, are subjected to library searching to identify metabolites.1,2 Since the EI mass spectrum and retention time are physicochemical properties unique to each metabolite, peak identification is based on unexpected similarities among metabolite signals observed in raw chromatograms and those of authentic compounds recorded in a library. For similarity scoring of two mass spectra, the cosine (dot)product method has been the de facto standard in GC/MS metabolomics because of its good performance and simplicity.1−5 The cosine-product method scores the similarity of two mass spectra from 0 (no similarity) to 1 (identical). The scoring of mass spectral similarity using this method has two limitations, namely, choosing a rational threshold for searching and estimating the probability of false positives. Since an observed mass spectrum obtained from a raw chromatogram inevitably includes some noise derived from the analysis, the © 2013 American Chemical Society

Received: May 24, 2013 Accepted: August 3, 2013 Published: August 3, 2013 8291

dx.doi.org/10.1021/ac401564v | Anal. Chem. 2013, 85, 8291−8297

Analytical Chemistry

Article

sequence (decoy) databases.8 FDRs were determined by comparing the number of query hits in the decoy and target databases since hits in the decoy databases are random.9 The target−decoy method cannot be applied to metabolite identification, since methods of generating unbiased decoy EI mass spectra of metabolites are unknown. A probability-based method has been used for assessing nucleotide and amino acid sequence similarities.10,11 The basic local alignment search tool (BLAST) employs Karlin−Altschul statistics based on the extreme value distribution to determine the statistical significance of a similarity score of two nucleotide or amino acid sequences (P value).12,13 In the score matrix for a nucleotide sequence, scores of matched and unmatched nucleotide pairs are defined to be +5 and −4, respectively. A similarity score of two sequences (S) is determined for the highest scoring pair of identical length segments chosen from two sequences (named the maximal segment pair, MSP). When a random sequence of length n is searched against a database with a total sequence length m, the expected number of MSPs with scores of at least S (E value) is given by the formula E = Knme−λS

version 2.71 and Metabolite Detector version 2.0.6 beta were used for manual curation of the GC/MS metabolome data.5 Cosine (Dot)-Product Method. In the cosine (dot)product method, a similarity score (C) is determined using the following equation: C=

∑ xi 2 ∑ yi 2

Here, xi represents the intensity value of a fragment ion with mass number i in mass spectrum X. When spectra X and Y are almost identical, C is close to 1.



RESULTS AND DISCUSSION Mass Spectral Sequence and Score Matrix. In this study, the similarity of two mass spectra with nominal (unit) mass numbers is taken into consideration. A typical unit EI mass spectrum consists of an array of intensity values of fragment ions. For example, the mass spectral data used in this study are an array of intensity values of 416 fragments within m/z 85−500 (500 − 85 + 1 = 416). The intensity values range from 0 to 999, since the level of the most intense fragment (base peak) is set at 999. The intensity value is converted to the corresponding symbol using a simple conversion rule, as shown in Table 1. For

(1)

where K and λ are Karlin−Altschul parameters depending on the database. The probability of finding at least one such MSP by chance (P value) is P = 1 − e −E

∑ xiyi

(2)

Table 1. Rules for Conversion of Mass Spectra into Mass Spectral Sequences

Using this procedure, P values for two amino acid sequences have been determined using the score matrices developed for amino acid analysis, such as the blocks substitution matrix (BLOSUM) and percentage of acceptable point mutations (PAM) series.10,14 This indicates that Karlin−Altschul statistics is applicable for any biological sequence analysis, by preparing a suitable score matrix. For assessing similarities among EI mass spectra, the probability-based matching method has been developed.15 However, the method has not been widely used for GC/MS metabolomics, probably because of its complexity and inadequate performance.4 More recently, X-Rank was developed for probability-based scoring of tandem mass spectra, but did not support EIMS.16 In this study, a novel method, EIMS− BLAST, was developed to evaluate the similarity of two EI mass spectra as a probability of obtaining the score by chance (P value), using BLAST-based statistics. EIMS−BLAST enables FDRs to be estimated for an identification list from an expected number of false-positive hits in iterative searching.

range of signal intensitya

symbol

numberb

frequencyb

0 ≤ m < 42 42 ≤ m < 56 56 ≤ m < 75 75 ≤ m < 100 100 ≤ m < 133 133 ≤ m < 178 178 ≤ m < 237 237 ≤ m < 316 316 ≤ m < 422 422 ≤ m < 563 563 ≤ m ≤ 750 750 ≤ m ≤ 1000

A B C D E F G H I J K L

199653 3138 2475 2091 1652 1321 909 662 529 395 280 719

0.934 0.015 0.012 0.010 0.008 0.006 0.004 0.003 0.002 0.002 0.001 0.003

The segmentation of intensity values is generally described by 1000 × 0.75n ≤ x < 1000 × 0.75n−1 (n = 1, 2, 3, ..., 12). bNumbers and frequencies of symbols in all mass spectral sequences converted from the standard spectral library are shown in the table.

a



example, intensity values (x) for 750 ≤ x < 1000 and 0 ≤ x ≤ 42 are converted to the symbols “L” and “A”, respectively. The segmentation of the intensity values is generally described by 1000 × 0.75n ≤ x < 1000 × 0.75n−1 (n = 1, 2, 3, ...., 12). Among the segmentation rules tested, such as 1000 × 0.5n ≤ x < 1000 × 0.5n−1, and 10 equal segmentations, the method showed the best fit to the modified Karlin−Altschul statistics, as discussed below (data not shown). To investigate the frequency distributions of the symbols, 475 mass spectra in the standard library (Library_RT_NT)17 were converted to mass spectral sequences. The numbers for each symbol in all the mass spectral sequences are shown in Table 1. The frequency of the symbol A (pA) was 0.934, reflecting the fact that most fragment ions in mass spectra have weak or zero intensities. Using the conversion rule, a mass spectral sequence with 416 letters (m/z 85−500) is generated. Although the whole

METHODS Libraries of Standard Spectra and GC/MS Metabolome Data Sets. A mass spectral library (Library_RT_NT) containing 475 accessions with EIMS spectra (m/z 85−500) and retention times was used in this study.17 The OUF series of MassBank records was derived from the library. The KZ series of EI mass spectral data produced by the Kazusa DNA Research Institute was downloaded from MassBank (http://www. massbank.jp/).3 Three metabolome data sets obtained from green tea,2,18 yeast,19 and mouse plasma17 were processed using MetAlign and AIoutput version 2.0 to produce deconvoluted mass spectra (height threshold, 100; RT bining, 3; filtering method, accurate; height filter, 1000; RSD(CV) filter, 10; RT tolerance, 3 s; match threshold, 0.75).20 All data analyses were performed using in-house scripts written in Perl 5. AMDIS 8292

dx.doi.org/10.1021/ac401564v | Anal. Chem. 2013, 85, 8291−8297

Analytical Chemistry

Article

Figure 1. Generation of the mass spectral sequence from unit mass data and similarity scoring using the score matrix. (a) Two virtual mass spectra consisting of 16 intensity values (m/z 85−100); the values in the table indicate the intensities of fragments of each m/z. (b) Mass spectral sequences converted using the rule shown in Table 1 and scoring of mass spectral similarity using the score matrix shown in Table 2. The sum of the scores is determined for maximal sequence pairs (MSPs).

Table 2. Score Matrix Used in This Study A B C D E F G H I J K L

A

B

C

D

E

F

G

H

I

J

K

L

1 −54 −63 −77 −86 −105 −86 −90 −110 −109 −107 −127

−54 414 401 402 412 413 364 361 382 352 344 321

−63 401 461 453 460 460 418 393 389 379 372 349

−77 402 453 531 526 530 473 452 463 439 441 432

−86 412 460 526 589 588 524 499 510 470 454 455

−105 413 460 530 588 761 598 557 557 553 524 527

−86 364 418 473 524 598 615 613 610 596 571 564

−90 361 393 452 499 557 613 703 693 703 675 696

−110 382 389 463 510 557 610 693 859 836 831 858

−109 352 379 439 470 553 596 703 836 922 883 921

−107 344 372 441 454 524 571 675 831 883 953 953

−127 321 349 432 455 527 564 696 858 921 953 1159

expected value of a score determined using the matrix is negative (−2.89). Modification of Karlin−Altschul Statistics. To find MSPs for two nucleotide sequences, segments can be started from any pair of nucleotides for all possible alignments. There are therefore approximately m × n starting points for finding an MSP of two sequences with m and n base pairs. The E value determined using eq 1 is corrected using the parameter K to avoid redundancy from overlapping segments. For a comparison of two mass spectral sequences, there is only one alignment since the symbols of identical mass numbers have to be compared (Figure 1b). MSPs are started from (i) the left edge of the alignment (or smallest mass number) and (ii) next to a symbol pair with a negative score. In the case of the virtual mass spectral data shown in Figure 1b, a segment pair covering the whole alignment shows the maximal score among candidates. Thus, when a database search of a mass spectral sequence with length n is performed against a database with a total of l spectra, the expected number of MSPs with scores of at least S (E value) is given by the formula

sequence with 416 letters is used for similarity scoring in this study, here the scoring procedure is explained by using two virtual mass spectra with 16 intensity values (m/z 85−100, shown in Figure 1a). Similarity scorings, using the whole sequence for ornithine and fructose 6-phosphate, are shown in the Supporting Information (S1). In the case of the two virtual mass spectra, mass spectral sequences with 16 symbols are generated using the conversion rule (Figure 1b). It should be noted that fragment ions with weak or zero intensities are converted to symbol A used for sequence comparison. The two mass spectral sequences are aligned by pairing symbols of identical mass numbers to determine the similarity score of the two mass spectral sequences (Figure 1b). The score matrix shown in Table 2 was developed using the following procedure. The score of two symbols, i and j (sij), is determined as the logodds ratio of the observed and expected frequencies:12 sij = ln

qij pp i j

E = Klnge−λS

where pi represents the frequency of symbol i in the library data (Table 1) and qij represents the observed frequency of a pair of symbols, i and j, in the total number of alignments, i.e., 112 575 [=(474 × 475)/2], of the library data. The original score matrix is shown in the Supporting Information (S2). To avoid distortion derived from the high frequency of symbol A, the score of sAA is arbitrarily reduced to one-fifth of the original score. The levels of sij are normalized by setting sAA = 1.0 and rounding to nominal values. These modifications did not affect the Karlin−Altschul statistics, as demonstrated below. The

(3)

where K is a parameter for correcting redundancy, g is the frequency of starting points for MSPs, and λ is a parameter for correcting the score S, determined by the following equation:

∑ ∑ pri je λs

ij

i

j

=1 (4)

where pi and ri represent the probabilities of the occurrences of symbols i and j in the library data and query spectra, 8293

dx.doi.org/10.1021/ac401564v | Anal. Chem. 2013, 85, 8291−8297

Analytical Chemistry

Article

respectively, and sij is the score for a pair of symbols, i and j, defined in Table 2. Figure 2 shows the theoretical relationship between S and E, determined by eq 3. To confirm the theoretical relationship,

was obtained using multiple local alignments of more distantly related sequences in a large sequence database. It is expected that a more sophisticated score matrix would be obtained by analysis of a larger standard library of methoxyaminated and trimethylsilylated metabolites. Since the conversion method to generate a mass spectral sequence (Figure 1) depends on the nature of the unit mass data, the developed method is unsuitable for application to high-resolution mass spectra. Considering this, the method employed in X-Rank is interesting since a probability-based score is determined on the basis of matching the mass numbers of the intense fragments in high-resolution tandem mass spectra. Modification of the EIMS−BLAST method by merging the methodology will be essential for more rigid metabolite identification in advanced GC/MS metabolomics using highresolution mass spectra data. Procedure for Probability-Based Searching of the Mass Spectral Database. The procedure for searching a mass spectral database using the developed method (EIMS−BLAST) is as follows: (1) The EI mass spectrum and retention time (or index) of a query metabolite signal are obtained from raw chromatographic data. (2) The EI mass spectrum is converted to a mass spectral sequence using the conversion rule shown in Table 1 (Figure 1a). The frequency of each symbol in the mass spectral sequence and the whole standard library is determined, from which the parameter λ is determined using eq 4. (3) Accessions of the standard library whose retention times or indexes close to those of the query metabolite signal (±1.5 s or ±5.0 retention index) are obtained from the spectral library. The number of obtained accessions of standard compounds is l in eq 3. (4) The mass spectrum of each standard compound is converted to mass spectral sequences using the conversion rule shown in Table 1. The mass spectral sequences are compared with those of a query to produce a score S using the score matrix (Table 2). The full lengths of mass spectral sequence pairs are considered as MSPs to find a completely identical spectrum; g is determined from the number of symbol pairs with negative scores. (5) The P value is determined from the score S using eqs 2 and 3. The metabolite signal with the smallest P value below the threshold level is identified as the standard compound. Comparison with the Cosine-Product Method. A characteristic of similarity scoring using EIMS−BLAST is that the P values can reflect the amount of information in a query mass spectrum. For instance, the P value of two essentially identical EI mass spectra of ornithine acquired at two different laboratories was 5.77E − 12 (Figure 3a,b). The probability is lower than those of other metabolites such as D-ribose 5phosphate and fructose 6-phosphate (P = 2.88E − 50, Figure 3d,e). This means that a mass spectrum with sparse fragments, such as that of ornithine, has a higher probability of chance similarities. In contrast, the similarity score of the two ornithine spectra determined by the cosine-product method (C = 0.985) is larger than those of D-ribose 5-phosphate and fructose 6phosphate (C = 0.780). This is because the best similarity score is normalized to 1.0 in the case of the cosine-product method. Furthermore, an actual mass spectrum obtained from raw chromatograms often lacks some data. For example, the mass spectrum of metabolite signal no. 346 in the green tea data set lacks all intensity data except for one fragment ion (Figure 3c). The incomplete spectrum is a result of the difficulty in the deconvolution of peaks with low-intensity signals. A mass spectrum search was simulated to evaluate the effect of missing

Figure 2. Distribution of expected numbers of maximal segment pairs above score S. The solid line represents the theoretical distribution determined using modified Karlin−Altschul statistics (K = 0.094, l = 100 000, n = 476, g = 0.10, λ = 0.00423). Black dots indicate total numbers of maximal segment pairs above the score observed in 100 000 comparisons of randomly generated mass spectral sequences.

100 000 pairs of random mass spectral sequences were generated on the basis of the symbol frequency shown in Table 1. The number of MSPs (E values) with scores of at least S were determined. When the parameter K is set at 0.094, the simulated null distribution of the E values is essentially identical to the theoretical distribution (Figure 2). The results demonstrate that the similarity score of mass spectral sequences determined using the scoring matrix follows modified Karlin− Altschul statistics. This means that the P value of a spectral similarity could be determined using eqs 2 and 3. As shown in eq 3, the E value is proportional to the number of spectra in a database (l), indicating that a smaller spectral library is preferable for reducing the probability of false positives. Here, a subset of the spectral library is prepared for each query metabolite signal by obtaining data of the spectral library with retention times or indexes close to those of the query metabolite signal. In this study, thresholds of ±1.5 s and ±5.0 retention index are employed since these are approximately 2σ of the observed gaps (data not shown). A wider and narrower threshold would increase false positives and negatives, respectively. The number of accessions in the subset is l in eq 3. It should be noted that the P values determined using EIMS−BLAST depend on the segmentation rule and the score matrix. This implies that another set of a segmentation rule and score matrix should be suitable for analyzing underivatized metabolites and synthetic compounds. It is also expected that the scoring could be improved by further development of the rule and the matrix, as demonstrated for amino acid sequences.10,14 PAM series were originally developed on the basis of the length of evolutionary time, but BLOSUM series of matrices have been widely used in recent studies. BLOSUM 8294

dx.doi.org/10.1021/ac401564v | Anal. Chem. 2013, 85, 8291−8297

Analytical Chemistry

Article

Figure 3. Comparison of EI mass spectra: (a, b) standard spectra of ornithine acquired at two different laboratories; (c) mass spectrum of a metabolite signal in the green tea data set (peak no. 346); (d) standard spectrum of ribose 5-phosphate; (e) standard spectrum of fructose 6phosphate; (f) mass spectrum of a metabolite signal in the green tea data set (peak no. 587). Similarity scores determined by EIMS−BLAST and the cosine-product method are also shown.

spectrum with 5% of the data missing, 92.8% of top-ranked accessions were matched with queries using the cosine-product method. The frequency of exact ranking was increased to 96.8% by using EIMS−BLAST (Table 3). The results suggest that EIMS−BLAST is robust for scoring actual mass spectral data with the data missing. Application to GC/MS Metabolome Data Sets. The metabolite signals in three GC/MS metabolome data sets,

data on the similarity score ranking. An EI mass spectrum was sampled from the accessions of a standard library (Library_RT_NT), from which some of the intensity values were randomly set at zero. The modified mass spectrum with missing data was searched against the library to check the top-ranked accession. The procedure was repeated 10 000 times (Table 3). When no data are missing, the top-ranked result is always identical to that of the query. In the case of a modified mass 8295

dx.doi.org/10.1021/ac401564v | Anal. Chem. 2013, 85, 8291−8297

Analytical Chemistry

Article

samples. Although its accuracy still has to be validated, EIMS− BLAST enables us to compare the qualities of metabolite annotations among data sets. As shown in Table 3, caution should be exercised in biological interpretations of the mouse plasma data set, since the FDR, 4.4%, is slightly higher than those of the other data sets (Table 4). The green tea data set was also processed by the cosineproduct method using the same standard library. The metabolites for 92 signals were successfully deduced when the threshold level was set at C > 0.65 (Table 4). A comparison of the metabolite identification lists demonstrated that the first or second best search results for 79 metabolite signals matched for the BLAST-based and cosine-product methods. In contrast, 14 and 13 metabolite signals were annotated using the EIMS− BLAST and cosine-product methods, respectively. Similar disagreements were observed for the yeast and mouse plasma data sets (Table 4). The reason for the disagreement was investigated by comparing the mass spectra. A metabolite signal (assigned no. 346 in the green tea data set, Figure 3c) was deduced to be ornithine by the cosine-product method, with a good score above the threshold level (C = 0.896, comparison between parts b and c of Figure 3). However, the mass spectrum of no. 346 lacks all intensity data except for one fragment ion, as mentioned above. The EIMS−BLAST score of the top-ranked metabolite (ornithine, P = 0.148) failed to satisfy the threshold, suggesting a high probability of chance similarities of these spectra. A metabolite signal (no. 587) in the green tea data set has a lot of data missing (Figure 3f). The signal was annotated as D-fructose 6-phosphate by EIMS−BLAST (P = 2.36E − 05), whereas the score determined using the cosine product (C = 0.583) was below the threshold level (compare parts e and f of Figure 3). Manual curation of the raw chromatogram data showed that complete mass spectra of these metabolite signals could not be generated via data processing using AMDIS and Metabolite Detector since the metabolite signals were too weak.

Table 3. Effect of Missing Data on the Similarity Score Ranking frequency of top exact resultb (%) ratios of missing data in spectraa (%)

cosine product

EIMS− BLAST

0 5 10 15

100 92.8 86.4 79.7

100 96.8 94.6 91.1

frequency of top exact resultb (%) ratios of missing data in spectraa (%)

cosine product

EIMS− BLAST

20 25 30

72.8 67.2 60.3

88.2 85.7 82.6

a

The ratios of intensity values were randomly set to zero in the EI mass spectrum sampled from the standard library (Library_RT_NT). b Frequencies of exact top-ranked results in 10 000 tests are shown.

obtained from green tea,2,18 yeast,19 and mouse plasma17 samples, were processed using EIMS−BLAST. The results were compared with metabolite annotations using the cosineproduct method (Supporting Information, S3). The green tea data set was originally acquired to evaluate the relationship between tea quality and metabolic phenotypes.18 Among 171 metabolite signals detected from the data set, the metabolites responsible for 93 signals were identified by searching Library_RT_NT using EIMS−BLAST, with threshold levels at P < 0.015 (Table 4). Since the library search was iterated 171 Table 4. Number of Metabolite Signals Annotated by the Cosine-Product and EIMS−BLAST Methodsa

total number of metabolite signals number annotated using both methodsb number annotated using ESI−BLAST only number annotated using cosine product only number not annotated expected number of false positives false discovery rate (FDR; %)

green tea

yeast

mouse plasma

171 79 (0) 14

141 83 (0) 7

369 106 (1) 19

13

15

34

65 2.6 2.8

36 1.8 2.4

210 5.5 4.4



a

The thresholds for metabolite annotations were set at P < 0.015 and C > 0.65 for the EIMS−BLAST and cosine-product methods, respectively. bThe number in parentheses indicates a metabolite signal whose first or second best search results were unmatched between BLAST-based and cosine-product methods.

CONCLUSIONS

In this study, we demonstrated that the statistical significance of mass spectral similarities could be determined using modified BLAST statistics. To develop EIMS−BLAST, a simple rule for converting a unit EI mass spectrum to a mass spectral sequence (Table 1 and Figure 1) and a score matrix for aligned mass spectral sequences (Table 2) were developed. The applications of the EIMS−BLAST method to the three GC/MS metabolome data sets demonstrated that there are disagreements in search results obtained using the EIMS−BLAST and cosine-product methods. This is because the similarity score determined using EIMS−BLAST reflects the nature of the query mass spectra. The disagreement suggests that future metabolomics studies with reliable metabolite annotation need further development of similarity scoring methods. In relation to this, recent progress in the improvement of the cosineproduct method, such as by compound matching based on isotope clusters22 and by partial and semipartial correlations,23 are promising. MetAlignID also proposed a distinct methodology using a match factor and mass weighting of spectra.24 In addition to these efforts, the present study provides a novel probability-based approach to estimate mass spectral similarities with control of the metabolite annotation quality.

times, the expected number of false-positive identifications was deduced to be 2.56 (= 171 × 0.015). This indicates that the FDR in an identification list containing 93 metabolites is expected to be 2.8% (= 2.56/93). When a higher threshold level at P = 0.05 was used, the FDR was 8.6%, and the number of annotatable metabolite signals increased to 99 (data not shown). For the yeast and mouse plasma data sets, containing 141 and 369 metabolite signals, respectively, EIMS−BLAST successfully annotated 90 and 125 signals (P < 0.015), respectively (Table 4). The FDRs of the annotated metabolite lists were deduced to be 2.4% and 4.4%, respectively. In proteomics studies, to validate the number of false-positive hits, spiking experiments have been performed. For example, false positives and false negatives were experimentally determined by analysis of 100 synthetic peptides spiked into background samples.21 However, spiking experiments are unrealistic for metabolomics, since they require a series of metabolite-like compounds that never existed in the actual 8296

dx.doi.org/10.1021/ac401564v | Anal. Chem. 2013, 85, 8291−8297

Analytical Chemistry



Article

(20) (a) Lommen, A. Anal. Chem. 2009, 81, 3079−86. (b) Lommen, A.; Kools, H. J. Metabolomics 2012, 8, 719−726. (21) Reiter, L.; Rinner, O.; Picotti, P.; Huttenhain, R.; Beck, M.; Brusniak, M. Y.; Hengartner, M. O.; Aebersold, R. Nat. Methods 2011, 8, 430−5. (22) Wegner, A.; Sapcariu, S. C.; Weindl, D.; Hiller, K. Anal. Chem. 2013, 85, 4030−7. (23) Kim, S.; Koo, I.; Jeong, J.; Wu, S.; Shi, X.; Zhang, X. Anal. Chem. 2012, 84, 6477−87. (24) Lommen, A.; van der Kamp, H. J.; Kools, H. J.; van der Lee, M. K.; van der Weg, G.; Mol, H. G. J. Chromatogr. 2012, 1263, 169−78.

ASSOCIATED CONTENT

S Supporting Information *

(S1) Procedure for the similarity scoring of whole mass spectra, (S2) original score matrices, and (S3) metabolite annotations of three GC/MS data sets derived from green tea, yeast, and mouse plasma samples (Excel). This material is available free of charge via the Internet at http://pubs.acs.org.



AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS This research was partially supported by JST, the Strategic International Collaborative Research Program, and SICORP for JP-US metabolomics. F.M. was also supported by Grants in Aid for Scientific Research (B).



REFERENCES

(1) Lisec, J.; Schauer, N.; Kopka, J.; Willmitzer, L.; Fernie, A. R. Nat. Protoc. 2006, 1, 387−96. (2) Tsugawa, H.; Tsujimoto, Y.; Arita, M.; Bamba, T.; Fukusaki, E. BMC Bioinf. 2011, 12, 131. (3) Horai, H.; Arita, M.; Kanaya, S.; Nihei, Y.; Ikeda, T.; Suwa, K.; Ojima, Y.; Tanaka, K.; Tanaka, S.; Aoshima, K.; Oda, Y.; Kakazu, Y.; Kusano, M.; Tohge, T.; Matsuda, F.; Sawada, Y.; Hirai, M. Y.; Nakanishi, H.; Ikeda, K.; Akimoto, N.; Maoka, T.; Takahashi, H.; Ara, T.; Sakurai, N.; Suzuki, H.; Shibata, D.; Neumann, S.; Iida, T.; Tanaka, K.; Funatsu, K.; Matsuura, F.; Soga, T.; Taguchi, R.; Saito, K.; Nishioka, T. J. Mass Spectrom. 2010, 45, 703−14. (4) Stein, S. E.; Scott, D. R. J. Am. Soc. Mass Spectrom. 1994, 5, 859− 866. (5) Hiller, K.; Hangebrauk, J.; Jager, C.; Spura, J.; Schreiber, K.; Schomburg, D. Anal. Chem. 2009, 81, 3429−39. Halket, J. M.; Przyborowska, A.; Stein, S. E.; Mallard, W. G.; Down, S.; Chalmers, R. A. Rapid Commun. Mass Spectrom. 1999, 13, 279−84. (6) Matsuda, F.; Shinbo, Y.; Oikawa, A.; Hirai, M. Y.; Fiehn, O.; Kanaya, S.; Saito, K. PLoS One 2009, 4, e7490. (7) Tabb, D. L. J. Proteome Res. 2008, 7, 45−6. Sadygov, R. G.; Cociorva, D.; Yates, J. R., 3rd. Nat. Methods 2004, 1, 195−202. (8) Choi, H.; Nesvizhskii, A. I. J. Proteome Res. 2008, 7, 47−50. Elias, J. E.; Gygi, S. P. Nat. Methods 2007, 4, 207−14. (9) Kall, L.; Storey, J. D.; MacCoss, M. J.; Noble, W. S. J. Proteome Res. 2008, 7, 29−34. (10) Sansom, C. Briefings Bioinf. 2000, 1, 22−32. (11) Altschul, S. F.; Erickson, B. W. Mol. Biol. Evol. 1985, 2, 526−38. (12) Karlin, S.; Altschul, S. F. Proc. Natl. Acad. Sci. U.S.A. 1990, 87, 2264−8. (13) Altschul, S. F.; Gish, W.; Miller, W.; Myers, E. W.; Lipman, D. J. J. Mol. Biol. 1990, 215, 403−10. (14) Mount, D. W. CSH Protoc. 2008, 2008, pdb.top39. Henikoff, S.; Henikoff, J. G. Proc. Natl. Acad. Sci. U.S.A. 1992, 89, 10915−9. (15) McLafferty, F. W.; Hertel, R. H.; Villwock, R. D. Org. Mass Spectrom. 1974, 9, 690−702. (16) Mylonas, R.; Mauron, Y.; Masselot, A.; Binz, P. A.; Budin, N.; Fathi, M.; Viette, V.; Hochstrasser, D. F.; Lisacek, F. Anal. Chem. 2009, 81, 7604−10. (17) Tsugawa, H.; Bamba, T.; Shinohara, M.; Nishiumi, S.; Yoshida, M.; Fukusaki, E. J. Biosci. Bioeng. 2011, 112, 292−8. (18) Pongsuwan, W.; Fukusaki, E.; Bamba, T.; Yonetani, T.; Yamahara, T.; Kobayashi, A. J. Agric. Food Chem. 2007, 55, 231−6. (19) Yoshida, R.; Tamura, T.; Takaoka, C.; Harada, K.; Kobayashi, A.; Mukai, Y.; Fukusaki, E. Aging Cell 2010, 9, 616−25. 8297

dx.doi.org/10.1021/ac401564v | Anal. Chem. 2013, 85, 8291−8297