X-Rank: A Robust Algorithm for Small Molecule Identification Using

Our algorithm called X-Rank first sorts peak intensities of a spectrum and second establishes a correlation between two sorted spectra. X-Rank then co...
0 downloads 0 Views 2MB Size
Anal. Chem. 2009, 81, 7604–7610

X-Rank: A Robust Algorithm for Small Molecule Identification Using Tandem Mass Spectrometry Roman Mylonas,*,† Yann Mauron,*,† Alexandre Masselot,‡ Pierre-Alain Binz,†,‡ Nicolas Budin,‡ Marc Fathi,§ Ve´ronique Viette,§,| Denis F Hochstrasser,§,⊥ and Frederique Lisacek† Swiss Institute of Bioinformatics (SIB), Geneva Bioinformatics SA, Geneva University Hospitals, ADMed Fundation, and Swiss Center for Applied Human Toxicology The diversity of experimental workflows involving LC-MS/ MS and the extended range of mass spectrometers tend to produce extremely variable spectra. Variability reduces the accuracy of compound identification produced by commonly available software for a spectral library search. We introduce here a new algorithm that successfully matches MS/MS spectra generated by a range of instruments, acquired under different conditions. Our algorithm called X-Rank first sorts peak intensities of a spectrum and second establishes a correlation between two sorted spectra. X-Rank then computes the probability that a rank from an experimental spectrum matches a rank from a reference library spectrum. In a training step, characteristic parameter values are generated for a given data set. We compared the efficiency of the X-Rank algorithm with the dot-product algorithm implemented by MS Search from the National Institute of Standards and Technology (NIST) on two test sets produced with different instruments. Overall the X-Rank algorithm accurately discriminates correct from wrong matches and detects more correct substances than the MS Search. Furthermore, X-Rank could correctly identify and top rank eight chemical compounds in a commercially available test mix. This confirms the ability of the algorithm to perform both a straight single-platform identification and a cross-platform library search in comparison to other tools. It also opens the possibility for efficient general unknown screening (GUS) against large compound libraries. In the past decade, the use of mass spectrometry has dramatically increased in a broad panel of research fields and applications.1 Some domains are more frequently cited, such as peptide identification,2 forensic investigation,3 food safety,4 antidoping,5 * To whom correspondence should be addressed. E-mail: roman.mylonas@ isb-sib.ch (R.M.); [email protected] (Y.M.). † Swiss Institute of Bioinformatics (SIB). ‡ Geneva Bioinformatics SA. § Geneva University Hospitals. | ADMed Fundation. ⊥ Swiss Center for Applied Human Toxicology. (1) J. Mass Spectrom. 2008, 43, 1432-1439. (2) Lam, H.; Deutsch, E. W.; Eddes, J. S.; Eng, J. K.; King, N.; Stein, S. E.; Aebersold, R. Proteomics 2007, 7, 655–667. (3) Concheiro, M.; De Castro, A.; Quintela, O.; Cruz, A.; Lopez-Rivadulla, M. J. Anal. Toxicol. 2007, 31, 573–580. (4) Pico, Y.; Font, G.; Ruiz, M. J.; Fernandez, M. Mass Spectrom. Rev. 2006, 25, 917–960.

7604

Analytical Chemistry, Vol. 81, No. 18, September 15, 2009

and clinical and toxicological analysis.6 While the use of a liquid chromatography tandem mass spectrometer (LC-MS/MS) for peptide identification has been widely assessed in proteomics research, this technology is rather new in other fields. Along with this increase, the need for efficient identification algorithms and software has become crucial, particularly for LC-MS/MS. Meeting this need is the main objective of the present work. The analysis of LC-MS/MS data is less stable than that of gas chromatography/mass spectrometry (GC/MS) data. GC/MS technology is well established in many laboratories, and as such, it is understood and mastered. Stable ionization/fragmentation and reliable acquisition processes guarantee reproducible experimental spectra. Consequently, substance identification by searching a library of annotated MS spectra is reliable. Large GC/MS spectral libraries are commercially distributed by NIST7 or WileyVCH (for the Maurer/Pfleger/Weber MS Library8). They contain spectra generated in various applications with a range of instrument types. Even though GC/MS analysis allows for efficient identification of many small molecules (e.g., fatty acids, amino acids, and organics acids),9 it involves the extraction and derivatization of analytes and limits the detection range of molecules with respect to size and type. For example, nonvolatile, thermally labile, polar, and high mass substances are poorly detectable. These limitations triggered new initiatives for detecting small molecules and promoted the use of LC-MS/MS. Both high and low mass compounds, such as amines, alcohols, and carboxylic acids, are detectable with an LC-MS/MS system. LC-MS/MS is also highly sensitive and reliable for quantitation. The diversity of workflows confirms the activity of the community behind LC-MS/ MS. These workflows cover both searching for known substances10 and general unknown screening (GUS).11 However, spectra of the same compound can show variability, depending on sample processing and instrument type.12,13 The collision energy (CE) plays a predominant role in the resulting spectra.12,14 (5) http://www.asms.org. (6) Ostapenko, Y. N.; Lisovik, Z. A.; Belova, M. V.; Luzhnikov, E. A.; Livanov, A. S. Przegl. Lek. 2005, 62, 591–594. (7) Stein, S. E. J. Am. Soc. Mass Spectrom. 1995, 6, 644–655. (8) Aebi, B.; Bernhard, W. J. Anal. Toxicol. 2002, 26, 149–156. (9) Want, E. J.; Cravatt, B. F.; Siuzdak, G. ChemBioChem 2005, 6, 1941–1951. (10) Mueller, C. A.; Weinmann, W.; Dresen, S.; Schreiber, A.; Gergov, M. Rapid Commun. Mass Spectrom. 2005, 19, 1332–1338. (11) Saint-Marcoux, F.; Lachatre, G.; Marquet, P. J. Am. Soc. Mass Spectrom. 2003, 14, 14–22. (12) Bogusz, M. J.; Maier, R. D.; Kruger, K. D.; Webb, K. S.; Romeril, J.; Miller, M. L. J. Chromatogr., A 1999, 844, 409–418. 10.1021/ac900954d CCC: $40.75  2009 American Chemical Society Published on Web 08/24/2009

The high variability among LC-MS/MS spectra thus remains the main obstacle to efficient running of conventional library search software. This situation led different research teams to construct their own library, independently. These homemade libraries are usually not exhaustive due to technical and financial limitations. They cover selected areas of knowledge and are used for specific applications. Various research groups have proposed solutions to overcome the bad reproducibility of LC-MS/MS spectra. A first approach is to combine different analytical methods in order to obtain more stable results. One of these initiatives suggests the combined use of two analyzers (ESI-QqTOFMS and ESI-QqTOF-MS/MS).15 A second approach is the calibration of different instruments relative to one another. Gergov and his team13 proposed the use of a calibration substance for standardizing spectra acquisition. Unfortunately, the cross-instrument spectrum variation is still greater than the variation among spectra coming from the same instrument. Furthermore, two or more spectral libraries from different databases cannot be calibrated at the same time. As a third possibility, different groups have relied on MS/MS-fragmentation of a precursor with different collision energies.13 Even though this was shown to be a powerful technique to improve the overall performance, spectra produced from different instrument types remain weakly comparable. Alternatively, the performance of the library search for LC-MS/MS systems can be improved by optimizing search algorithms themselves, in particular, the matching procedure.14 Since the 1970s, several algorithms were described among which the most popular is the dot-product derived algorithm.2,16,17 The dot product, or the scalar product, takes as input two vectors and returns a scalar quantity. If the two vectors represent the peak masses and intensities of two distinct spectra, the final scalar quantity measures the similarity between those two spectra. In weighted dot products, this value is weighted with some criteria, such as the quality of the spectra or the presence of the precursor peak, among others. The most frequently used tools that implement a dot-product algorithm are MS Search (NIST), Analyst and Cliquid (Applied Biosystems|MDS Sciex), ToxId (Thermo-Fisher), and Mass Frontier (HighChem). To our knowledge, MS Search, Analyst, and Cliquid all use a weighted dot-product algorithm, involving multiple empirically based extensions. The NIST algorithm and ToxId also use this algorithm. While shown to perform well in certain applications,2 these algorithms do not properly address the questions associated with LC-MS/MS. MS Search is well adapted to GC/MS data but cannot cope with the high variability of LC-MS/MS. Analyst and Cliquid are mainly run with Applied Biosystems|MDS Sciex instruments and proprietary data, which are acquired with the same instrument (Q TRAP). Regardless of performance, these tools are not transferable across platforms. A range of other algorithms has been implemented. For instance, some authors introduce the compound structure (13) Gergov, M.; Weinmann, W.; Meriluoto, J.; Uusitalo, J.; Ojanpera, I. Rapid Commun. Mass Spectrom. 2004, 18, 1039–1046. (14) Bristow, A. W. T.; Nichols, W. F.; Webb, K. S.; Conway, B. Rapid Commun. Mass Spectrom. 2002, 16, 2374–2386. (15) Pavlic, M.; Libiseller, K.; Oberacher, H. Anal. Bioanal. Chem. 2006, 386, 69–82. (16) Gan, F.; Yang, J. H.; Liang, Y. Z. Anal. Sci. 2001, 17, 635–638. (17) McLafferty, F. W.; Zhang, M. Y.; Stauffer, D. B.; Loh, S. Y. J. Am. Soc. Mass Spectrom. 1998, 9, 92–95.

analysis7 or the comparison of angles formed by spectra represented as vectors of mass over charge.18 However, none of these isolated attempts tackles spectrum variability except for the recent work of Oberacher et al.19,20 The corresponding algorithm is based on peak matching and does not rely on the absolute intensity of peaks. In this article, we present a new algorithm called X-Rank. This method serves three purposes. First, it efficiently supports crossplatform identification. Second, X-Rank addresses the high variability inherent to LC-MS/MS methodology. Finally, X-Rank provides a solution to general unknown screening (GUS) experiments. The importance of GUS has grown with the need to detect compounds not known a priori, in many applications. X-Rank is a library search algorithm, based on statistical relations between mass over charge values, ordered by intensities. This new algorithm can be trained with specific data sets and accounts for cross-platform identification, while being robust to interspectrum differences. SmileMS, the platform that uses X-Rank, also grants the scientific community with the completely new possibility of sharing and enriching homemade and high-quality spectral databases. EXPERIMENTAL SECTION X-Rank: Algorithm Description. Unlike previously published methods, X-Rank does not take into account absolute nor relative intensities. A study on peptide identification methods showed improved results when using ranks instead of peak intensities.21Oberacher et al. reached a similar conclusion in the context of small molecule identification. The accuracy of identification was shown to improve when reducing the importance of peak intensities.19,20 Initially, the X-Rank algorithm first sorts peak intensities of a spectrum and second establishes a correlation between two previously sorted spectra. Then, the X-Rank algorithm relies on a scoring model to differentiate between two fragmentation MS spectra. The model is built as follows: Suppose S and S′ are two measures of fragmentation spectra. Those two measures may originate from the same chemical compound (H1, the alternative hypothesis) or from different compounds (H0, the null hypothesis). S can then be represented as a set of fragment ion peaks {f1, f2, . . . fsize(S)}. fi S S′ indicates that fi matches any fragment of S′, i.e., the mass of fi differs from that of the matching peak in S′ by less than a given tolerance τ. fi S fj′ indicates that fragment fi matches the specific fragment fj′ from S′. Let fλ(1) be the most intense fragment of S (fλ(2) is the second most intense fragment and so on). The first step is to model probabilities for each hypothesis (i.e., H1 and H0) that a peak of S matches one of S′ according to their respective ranks. This can be mathematically stated as eqs 1 and 2, respectively. P(fλ(i) S fλ'(j)'|H1)

(1)

P(fλ(i) S fλ'(j)'|H0)

(2)

(18) Wan, K. X.; Vidavsky, I.; Gross, M. L. J. Am. Soc. Mass Spectrom. 2002, 13, 85–88. (19) Oberacher, H.; Pavlic, M.; Libiseller, K.; Schubert, B.; Sulyok, M.; Schuhmacher, R.; Csaszar, E.; Kofeler, H. J. Mass Spectrom. 2009, 44, 485–493.

Analytical Chemistry, Vol. 81, No. 18, September 15, 2009

7605

Figure 1. Results of computation of parameters using a large set of manually validated observations. (Left) Density of the probability, depending on λ, that two fragments from spectra of the same chemical substance match. The experimental distribution is fitted by a negative exponential distribution. This case will account for the correct hypothesis H1. This is denoted P(fλ(i) S S′|H1). (Right) Box plot of the probability that two fragments from spectra of two different chemical substances match. This case will account for the random hypothesis H0. It is denoted P(fλ(i) S S′|H0).

The theorem of composite probability can be applied to obtain eq 3. The case of H0 is presented, but the same applies to H1. P(fλ(i) S fλ'(j)'|H0) ) P(fλ(i) S S'|H0)P(fλ(i) S fλ'(j)'|fλ(i) S S', H0) (3)

The experimental distributions of the two probabilities on the right part of eq 3 can be computed empirically given a large set of observations, i.e., manually validated identification spectra against a library of chemical compound spectra. We consider as an example the high quality spectrum library from Weinmann10 (see the Supporting Information for further details). Two spectra of the same compound will account for the correct hypothesis H1. Two spectra of different compounds will account for the random hypothesis H0. The distributions for P(fλ(i) S fλ′(j)′|H1) and P(fλ(i) S fλ′(j)′|H0) are shown in the left and right images of Figure 1. The left image in Figure 1 shows that P(fλ(i) S fλ′(j)′|H1) can be modeled by fitting a negative exponential curve. Since we did not find any correlation between λ (i.e., the rank of the fragment) and the estimated probability that two fragments match for H1, we simply approximated this value by the mean, 0.065. The values for P(fλ(i) S f ′λ′(j)|H1) and P(fλ(i) S fλ′(j)′|H0) are empirically determined and can be different depending on the set selected for training the algorithm. Experimental distributions for P(fλ(i) S fλ′(j)′|fλ(i) S S′,H1) are shown on the (20) Oberacher, H.; Pavlic, M.; Libiseller, K.; Schubert, B.; Sulyok, M.; Schuhmacher, R.; Csaszar, E.; Kofeler, H. J. Mass Spectrom. 2009, 44, 494–502. (21) Magnin, J.; Masselot, A.; Menzel, C.; Colinge, J. J. Proteome Res. 2004, 3, 55–60.

7606

Analytical Chemistry, Vol. 81, No. 18, September 15, 2009

right in Figure 2 for selected values of i. In the same figure, we show how these distributions can be fitted by “γ-like” distributions (plain lines). A γ distribution is a statistical distribution that involves two parameters: a scale parameter that determines the position of the curve on the x-axis and a second parameter that determines the shape of the curve. Here we use “γ-like” distributions that imply a third parameter which represents a shift of the curve on the y-axis. This choice is justified by a better fit to the experimental curve. In contrast, experimental distributions for P(fλ(i) S fλ′(j)′|fλ(i) S S′,H0), shown on the left in Figure 2, can be fitted with negative exponential functions. Fitting these density functions with explicit functions results in a simpler model and thus avoids over fitting. In these distributions, only the 30 most intense peaks were kept in each spectrum. In practice lower intensity signals, i.e., higher ranks, are poorly informative. Given the observed alignment between two spectra S ∼ S′, we can now compute the global probabilities P(S ∼ S′|H0) and P(S ∼ S′|H1) of this observation being random or correct. If we assume that matches between individual fragment peaks are independent, we can factorize for H0 (respectively, H1):

∏ {1 - P(f n

P(S ∼ S'|H0) )

i)0

P(fλ(i) S fλ'(j)'|H0), if fλ(i) S fλ'(j)' λ(i) S S'|H0), if fλ(i) ™ S' (4)

Deciding wether an observation is random or not is now reduced to a simple hypothesis testing problem. The NeymanPearson lemma shows that the optimal statistics, for deciding between the null hypothesis (random match) and the alternative

Figure 2. Results of the computation of parameters using a large set of manually validated observations (for an overview, only the first three ranks of spectra S are displayed). The green circles represent the estimated probability that the first rank in spectrum S matches the fragment of rank x in spectrum S′. The red plus signs and the blue asterisk signs represent the estimated probability of the second, respectively, and the third rank of the spectrum S. (Left) In this figure, S and S′ come from two different chemical compounds. This is denoted P(fλ(i) S fλ′(j)′|fλ(i) S S′,H0). These distributions can be fitted by negative exponential distributions (plain lines). (Right) In this figure S and S′ come from the same chemical compound. This is denoted P(fλ(i) S fλ′(j)′|fλ(i) S S′,H1). These distributions can be fitted by γ-like distributions (plain lines). Table 1. Four Different Comparative Tests Were Undertakena reference experimental

Paul trap

QTRAP

Paul trap

Figure 4, upper left-hand part Figure 4, lower left-hand part

Figure 4, upper right-hand part Figure 4, lower right-hand part

QTRAP

Figure 3. Illustration of a score computation (for a clearer view only the three first ranks of the spectrum S are considered). In this example, the first rank of spectrum S matches the first rank of spectrum S′, and the second matches the fourth one. This is denoted as fλ(1) S fλ′(1)′, respectively, fλ(2) S fλ′(4)′. The third rank does not have a matching fragment and is consequently denoted as fλ(3) ™ S′. This results in P(S ∼ S′|H0) ) P(fλ(1) S fλ′(1)′|H0)P(fλ(2) S fλ′(4)′|H0)(1 - P(fλ(3) ™ S′)) (respectively, H1).

hypothesis (correct match), is the likelihood ratio, that is, the probability of a correct match divided by the probability of a random match. Consequently, we define the score of a match as

σ(S, S') ) log

P(S ∼ S'|H1) P(S ∼ S'|H0)

(5)

In practice, the logarithm is taken for numerical convenience. Figure 3 illustrates the computation of a match (for simplification only the three first ranks of the spectrum S are considered). The first rank of spectrum S matches the first rank of spectrum S′, and the second matches the fourth one. This is denoted as fλ(1) S fλ′(1)′ and fλ(2) S fλ′(4)′. The third rank does not have a matching fragment and is consequently denoted as fλ(3) ™ S′. According to eq 4, the probability P(S ∼ S′|H0) can thus be expressed as P(fλ(1) S fλ′(1)′|H0)P(fλ(2) S fλ′(4)′|H0)P(fλ(3) ™ S′) (respectively, H1). X-Rank: Algorithm Implementation. X-Rank was prototyped and tested in Perl and R. It was then reimplemented in Java. The

a The Paul Trap set was identified against the Paul Trap library and the QTRAP library, and the QTRAP set was identified against the QTRAP ion trap library and the Paul Trap library.

current implementation of this algorithm is integrated in SmileMS, an identification platform we develop for small molecules. SmileMS covers needs from the routine analysis of small molecules, such as data storage and management, and intuitive interface. The core of SmileMS was developed in Java and the web interface in Flash. The lightweight architecture supports a server as well as a desktop installation. This architecture also complies with modern software development paradigms and is therefore highly modular and evolutive. All tests presented in this article were conducted on an Apple laptop computer, with 2 GB of RAM memory. X-Rank: Algorithm Performance. In order to assess the performance of the algorithm, we conducted two different experiments. For the first experiment, two data sets were searched against each other and against itself in a leave-oneout manner. Each of the four identification tasks was conducted using X-Rank and MS Search tool from NIST. Because the MS Search tool does not allow for any parameter training on specific data, X-Rank parameters have been trained only with QTRAP data, voluntarily avoiding retraining for each comparison. Neither SmileMS nor MS Search include the precursor ion in score computation. The identification was conducted as described in Table 1. Indications in Table 1 refer to results presented in the Results and Discussion. The two used data sets are (1) the Weinmann data set: This data set consists of Analytical Chemistry, Vol. 81, No. 18, September 15, 2009

7607

Figure 4. The figure shows the four comparison cases. The upper left-hand figure shows ion trap data identified in a leave one-out-manner, the upper right-hand figure shows ion trap data identified against a QTRAP library, the lower left-hand figure shows QTRAP data identification in a leave one-out-manner, and finally the lower right-hand figure shows QTRAP data against an ion trap library. Each figure shows performance in red for X-Rank and in blue for MS Search.On the whole, X-Rank appears as a more specific and sensitive method, especially for crossplatform identification.

1050 product ion spectra acquired on a Applied Biosystems|MDS Sciex QTRAP from 344 different substances. Details of the chemical substances and the number of spectra per substance are provided in the Supporting Information. The data mainly originated from the commercially available Weinmann library and additional compounds acquired by Geneva University Hospital (GUH) on the same instrument under similar conditions. The acquisition method consists in multiple reacton monitoring/enhanced product ion (MRM/EPI) runs with three different collision energies: 20, 35, and 40 eV. (2) IT-NIST data set: Product ion spectra acquired on a Paul Trap machine and containing at least five fragmentation peaks were extracted from the NIST LC-MS/MS 2008 library. Fragmentation spectra of this library were acquired under different conditions and 7608

Analytical Chemistry, Vol. 81, No. 18, September 15, 2009

collected by NIST. A total of 4504 spectra from 1171 different substances were obtained. Details of the chemical substances and the number of spectra per substance are provided in the Supporting Information. For the second experiment, a publicly available test mix for drug analysis and acquired on a Applied Biosystem QTRAP 4000, was identified against the aforementioned Weinmann data set. This test mix, developed by Restek with the help of Applied Biosystems|MDS Sciex, is routinely used for the assessment of Applied Biosystem’s Cliquid software. The test mix consists of the eight forensic drugs, presented in Table 2. As for the previously presented comparisons, the identification was conducted on both X-Rank and MS Search.

Table 2. Publicly Available Test Mix for Drug Analysis Compositiona chemical substance

concentration (in µg/mL)

amiodarone amphetamine caffeine codeine diazepam doxepine haloperidol morphine

10 10 10 10 10 10 1 10

a This test mix, developed by Restek with the help of Applied Biosystems|MDS Sciex, consists of the eight forensic drugs.

RESULTS AND DISCUSSION Results of the comparative study across instruments and libraries are now presented as indicated in Table 1. Recall that four different comparative tests were undertaken whereby the IT set was identified against itself and the QTRAP library and the QTRAP set was identified against itself and the IT library. Figure 4 presents corresponding receiver operating characteristic (ROC) curves of X-Rank and MS Search performance. The red curves represent identifications conducted with X-Rank, whereas blue curves represent identification conducted with MS Search. On the whole, X-Rank appears as a more specific and sensitive method, especially for cross-platform identification. X-Rank shows an important improvement over the other algorithm. This good performance opens the possibility of spectra identified using heterogeneous libraries. The upper left-hand part of Figure 4 shows the comparison of MS Search and X-Rank while identifying ion trap data in a leave one-out-manner. Along with the left-hand lower part of Figure 4, this figure illustrates one of the easiest cases; test data are identified against data generated with the same experimental design. All along, both specificity and sensitivity are improved with X-Rank. Since this situation is rather straightforward, performances for both tools are close. The slight improvement on the upper-left corner of the chart is nonetheless significant, while it indicates an increase of correct identifications given a smaller false positive rate. At most, 3932 identifications were produced by X-Rank, while only 3916 by MS Search. The upper right-hand part of Figure 4 shows the comparison of MS Search and X-Rank while identifying ion trap data against a QTRAP library. It displays the performance of the two algorithms while identifying a test data set acquired with a fragmentation type against a reference data set acquired with a distinct fragmentation type. Although other parameters differ, fragmentation is the most important one in this situation. Results are less impressive than in the previous case but still highlight a significant improvement compared to the existing identification algorithm. The curve of MS Search does not reach the 100% value. This indicates that MS Search does not manage to identify all the substances even with a low score. At most, 188 identifications were produced by X-Rank, while only 174 by MS Search. The lower left-hand part of Figure 4 shows the comparison of MS Search and X-Rank while identifying QTRAP data in a leave one-out-manner. This figure is very similar to the upper left-hand part of Figure 4. X-Rank performs better for the low positive rate section. MS Search performs slightly better for a false positive

rate between 15 and 30. Finally, the two algorithms obtain equal performance for a false positive rate above 30. At most, 1049 identifications were produced by X-Rank, while only 1044 by MS Search. The lower right-hand part of Figure 4 shows the comparison of MS Search and X-Rank while identifying QTRAP data against an ion trap library. This last figure demonstrates clear improvement with X-Rank. Improvement spreads across the low false positive rate up to the maximum false positive rate. Like in the upper right-hand part of Figure 4, the curve representing the performance of MS Search does not reach the 100% value. This indicates that MS Search does not manage to identify all the substances even with a low score. At most, 172 identifications were produced by X-Rank, while only 162 by MS Search. The second part of the performance assessment consisted of a test mix identification comparison. Figure 5 shows in a three column table the comparison of SmileMS and MS Search for this test mix identification. Columns one and two show two different MS Search scores. The first one is described as a probability score, while the second one is described as a matching score. The third column corresponds to identification with X-Rank. Correct identifications are highlighted in green, according to the known test mix content. The names presented are retrieved without modification from Pubchem except for the D5-Amphetamine, which is named by our system. In column one and two, D5-Amphetamine could be considered green, since this substance was not one of the searched substances but used as an internal standard. The other internal standard, the D5-Doxepin is never present. Our interpretation is that the D5-Doxepin is wrongly identified as the Doxepin since no filter on the precursor mass was used. The same could apply for the D5-Amphetamine in the case of X-Rank. This table clearly shows the better behavior of X-Rank. CONCLUSIONS Recently, LC-MS/MS has gained importance for small molecule identification, mostly because the detection range is increased compared to other methods. However, high fragment spectra variability is the major challenge of this technology. This problem can either be addressed by defining more consistent workflows or by improving matching algorithms. We chose the latter option in the present work that describes a new library search algorithm X-Rank. It is based on probabilistic calculations and is trainable for specific conditions. The performance of the X-Rank algorithm implemented in SmileMS and the dot-product algorithm implemented in MS Search were compared. Both algorithms were tested using data generated by two different instruments (1050 QTRAP product ion spectra and 4504 Paul Trap product ion spectra). The two data sets were used alternatively as input or reference (for each or for one another), and in all four cases, X-Rank showed better sensitivity and specificity. Because there is no data specific parameter training MS Search, the X-Rank algorithm was not trained for the different conditions. A second performance assessment, using a publicly available test mix, confirms the accuracy of X-Rank. Since this comparison involves publicly (22) Ahrne, E.; Masselot, A.; Binz, P.-A.; Muller, M.; Lisacek, F. Proteomics 2009, 9, 1731–1736.

Analytical Chemistry, Vol. 81, No. 18, September 15, 2009

7609

Figure 5. This figure shows in a three-column table the comparison of SmileMS and MS Search for a test mix. Columns one and two show two different MS Search scores. The first one is described as a probability score, while the second one is described as a matching score. The third column corresponds to X-Rank identification. Correct identifications are highlighted in green, according to the known test mix content. This table clearly shows the better behavior of X-Rank.

available data, it is reproducible by the scientific community. All results presented were obtained using parameters calibrated for QTRAP data. The data specific parameter training of X-Rank would provide significant improvement of performance. X-Rank is integrated in SmileMS. This identification platform covers the needs of routine identification of small molecules. It is used in several laboratories on a daily basis. A broader description of the platform is available at http://www.genebio.com/ smilems. The increased performance of library-search algorithms along with better wet-lab workflows will bring LC-MS/MS closer to routine identification of small molecules. However, the identification performance under cross-platform conditions needs to be assessed in further studies and subsequently adapted to real lab conditions. Since library-search techniques in proteomics are also attracting attention, the X-Rank algorithm could be adapted to the peculiarities of peptide data. Even though a library search is not the most commonly used peptide identification technique, it was shown that its combination with

7610

Analytical Chemistry, Vol. 81, No. 18, September 15, 2009

conventional sequence-based search could increase the discovery rate.22 It would then be challenging to assess the performance of X-Rank in a similar test. ACKNOWLEDGMENT R.M. and Y.M. equally contributed to this work and should be considered co-first authors. The authors thank Marc Augsburger and Hicham Kharbouche for providing data and spending time testing X-Rank and SmileMS. They also thank Nasri Nahas and Catherine Zwahlen for their support as well as Patrick Brechbiehl for his assistance in organizing the computing environment. SUPPORTING INFORMATION AVAILABLE Additional information as noted in text. This material is available free of charge via the Internet at http://pubs.acs.org.

Received for review May 4, 2009. Accepted July 27, 2009. AC900954D