Mass Spectrometry and Informatics: Distribution of Molecules in the

Jump to Analysis of PubChem Database: General Requirements for Mass ... - ... a good position to further benefit from databases and informatics method...
0 downloads 0 Views 1MB Size
LETTER pubs.acs.org/ac

Mass Spectrometry and Informatics: Distribution of Molecules in the PubChem Database and General Requirements for Mass Accuracy in Surface Analysis F. M. Green,* I. S. Gilmore, and M. P. Seah National Physical Laboratory, Teddington, Middlesex, U.K. ABSTRACT: Mass spectrometry is a powerful tool for the analysis and identification of substances across a broad range of technologies from proteomics and metabolomics through to surface analysis methods used for nanotechnology. A major challenge has been the development of automated methods to identify substances from the mass spectra. Public chemical databases have grown over 2 orders of magnitude in size over the past few years and have become a powerful tool in informatics approaches for identification. We analyze the popular PubChem database in terms of the population of substances with mass when resolved with typical mass spectrometer mass accuracies. We also characterize the average molecule in terms of the mass excess from nominal mass and the modal mass. It is shown, in agreement with other studies, that for the identification of unknowns a mass accuracy of around 1 ppm is required together with additional filtering using isotope patterns. This information is an essential part of a framework being developed for experimental library-free interpretation of complex molecule spectra in secondary ion mass spectrometry.

A

grand challenge in mass spectrometry is the automated interpretation of spectra to identify the substances in an analyte from their chemical formula and molecular structure. This challenge spans almost five decades of research from the first innovative attempts in the DENDRAL project1 in the mid sixties through to the present automated search engines and sequencing tools developed in response to the demands of high-throughput screening in proteomics.2,3 Metabolomics4 has similar needs owing to the complex chemical mixture under study. Mass spectrometry search engines and sequencing tools generally use the MS/MS (tandem) spectra to identify molecular structure. Electron impact spectra are also commonly used. These informatics methods may be categorized into three basic types: (1) direct searching with experimental data such as the National Institute of Standards and Technology (NIST) database,5,6 (2) comparison of experimental data with rule based fragmentation using programs such as Mass Frontier7 and ACD Fragmentor,8 and (3) non rulebased methods.911 An elegant recent method12 automatically assigns possible chemical compositions to each ion in an MS/MS spectrum and then calculates all possible fragmentation pathways with typical neutral losses which are represented using graph theory (like the DENDRAL project). Fragmentation graphs may then be compared with those generated using rule based methods or by expert interpretation. The effectiveness of some of these methods has been the subject of recent studies and reviews.4,13,14 Briefly, experimental libraries are effective but are necessarily limited in the number of substances contained, and spectral reproducibility15 between Published 2011 by the American Chemical Society

different instrument designs can be problematic. Rule based methods which incorporate well established mass spectrometry rules16 as well as semiempirical rules have become increasingly effective.4 However, it has been shown13 that these are more robust when only simple rule sets are used. Non rule-based methods have some advantages, and we now briefly discuss two methods. Secondary ion mass spectrometry (SIMS) is a popular technique for the direct chemical analysis of surfaces using a primary ion probe and a mass spectrometer to analyze the emitted secondary ions. The chemistry at surfaces is often complicated, and the SIMS process is energetic, creating significant fragmentation that results in complex mass spectra that are hard to interpret. The G-SIMS method17 developed by the National Physical Laboratory (NPL) produces simpler spectra with ions that are simply or directly related to the analyte structure. These spectra have some similarities with MS/MS spectra.18 To aid analysts, a method18 was developed for simulating fragmentation pathways based on the popular Simplified Molecular Input Line Entry Specification (SMILES) molecular structure format.19 A computer program developed at NPL using MATLAB (The MathWorks, USA) simulates the fragmentation pathways by recursively breaking all bonds except bonds to hydrogen and aromatic rings. The simulated fragmentation pathways were then compared with experimental data for Irganox 1010,18 folic acid,18 Received: January 10, 2011 Accepted: February 15, 2011 Published: April 01, 2011 3239

dx.doi.org/10.1021/ac200067s | Anal. Chem. 2011, 83, 3239–3243

Analytical Chemistry valine, tyrosine, and a simple peptide valinetyrosinevaline.20 It was found that approximately 90% of the G-SIMS fragmentation pathways could be explained. Subsequently, this method has been developed with a fragmentation database known as G-DB1 accessed directly from the Internet. This will be discussed in detail in a future publication; the key feature here is that molecules are entered into the database in the SMILES format which may be taken directly from any of the freely available chemical databases such as PubChem,21 ChemSpider,22 KEGG,23 and LipidMaps.24 PubChem21 is a database of chemical molecules maintained by the National Centre for Biotechnology Information (NCBI). It contains over 37 million compounds and over 71 million substances. ChemSpider22 is a chemistry search engine for aggregating and indexing chemical structures and their associated information into a single searchable repository. The database contains more than 20 million molecules. As demonstrated in ref 4, such databases are very powerful to assist in the interpretation of mass spectral data. Recently, a combinatorial fragmenter approach has been developed10 for metabolomics based on a “systematic bond disconnection method without a rule set” developed by Hill and MortishireSmith.9 This is similar in concept to the G-SIMS and SMILES method.18 Their system, called MetFrag,10 works by searching PubChem,21 Chemspider,22 and KEGG23 databases against the mass of the parent ion taking into account the addition or loss of hydrogen depending on the charge state. Candidate molecules are then fragmented to the second level of fragmentation tree. Bonds in ring systems are treated specially. Their algorithm takes into account neutral losses of H2O, HCN, NH3, CH2O, and HCOOH from fragments. Bond dissociation energies are also calculated. The results for each candidate are then scored against the key peaks and intensities in the MS/MS spectrum. They show examples where MetFrag performs better than the commercially available MassFrontier 4.0 software,7 which is often used as a benchmark. It is clear that, in any method that searches a database, the number of results depend on the search tolerance. In the case of mass spectrometry, this is the mass accuracy. Bristow et al.25 show, in an intercomparison study on accurate mass measurement of a small molecule (475 u), that typical mass accuracies for FT-MS and magnetic sector instruments are e1 ppm and for time-of-flight instruments are between 5 and 10 ppm. Technological developments in time-of-flight systems have improved their accuracy to between 2 and 5 ppm.26 Unfortunately, in secondary ion mass spectrometry, the mass accuracy is rather poorer. In an interlaboratory study of 19 time-of-flight SIMS instruments, it was found that the accuracy of the mass scale calibration was on average 150 ppm at 647 u.27,28 There are many reasons for this including the high energy spread of the secondary ions and the limited mass range of the calibration, both of which are discussed in detail elsewhere.28 With instrument optimization and following a recommended procedure, a mass accuracy of 30 ppm may be achieved.28 Kind et al.29 show from an analysis of all possible combinatorial molecules assembled from C, H, N, O, P, and S in the mass range of 20500 u that a mass accuracy of 1 ppm is insufficient to uniquely identify the chemical formula above 300 u. They show that an orthogonal isotopic pattern filter is very effective so that, with a mass accuracy of 3 ppm and isotopic abundance of 2% accuracy, the number of possible chemical formulas is less than 5 up to 600 u. B€ocker et al.30 develop a method known as SIRIUS30 for determining the chemical formula from the high resolution isotope pattern in the mass spectrum of the molecule and show correct identification for 90% of molecules with masses up to 1000 u. Since the work of Kind,29 public chemical databases have grown enormously and now

LETTER

Figure 1. Number of substances contained in the PubChem database21 centered around folic acid (441.139681 u) (a) within a mass range of (x ppm and (b) distribution of substances over a ( 100 ppm range with mass intervals of (0.000221 u. The position of folic acid is shown with a vertical green line.

give a better representation of the molecules relevant in general analyses. The subject of this short communication is to determine (i) how sensitive a search of a chemical database is to the mass accuracy, (ii) what the distribution of molecules with mass is in such databases, and (iii) for effective use of databases, what mass accuracy is required.

’ ANALYSIS OF PUBCHEM DATABASE: GENERAL REQUIREMENTS FOR MASS ACCURACY Public databases such as PubChem have become important with a large and rapidly growing community of users. This has naturally led to the development of many helpful and freely available software tools to access and search such databases. Here, we use the freely available PubChemSR31,32 tool, version 3.5.2, for searching the PubChem database. This was conducted on 30 and 31 December 2010 (74 561 764 substances). This software includes a very helpful facility to run a batch search. We simply require the number of substances in the PubChem substance database between two mass limits m1 and m2 using the “ExactMass” field. This is accomplished with a search command in the following format “m1:m2 [ExactMass]”. The software has the facility to operate in a batch mode by loading in a text file with a list of such commands. In the following, we generate text file lists automatically using simple MATLAB programs (The MathWorks, USA). 3240

dx.doi.org/10.1021/ac200067s |Anal. Chem. 2011, 83, 3239–3243

Analytical Chemistry

LETTER

Figure 2. Distribution of number of substances across the mass scale at low resolution ((0.01 u) with Gaussian fits (a) at 200 u, (b) at 600 u, (c) the average molecule excess mass from the nominal mass (rounded down) across the mass scale, and (d) the fwhm of the Gaussian fits.

Figure 1a shows the number of substances centered on the calculated mass of folic acid, 441.139681 u, within a mass range of (x ppm. There are two key features about this plot. First, the high number of substances found. For example, in time of flight (TOF)SIMS with a good mass scale calibration accuracy of 30 ppm,28 there are 31 061 possible substances (this reduces a little to 26 141 substances if only the most common biological elements are included, C, H, N, O, P, and S). Second, the number of substances depends approximately linearly on the search mass range, i.e., the mass scale calibration accuracy. It is clear then that as the mass scale calibration accuracy improves the number of possible molecules reduces linearly. In the case of folic acid, even at 1 ppm, there are still almost a thousand possible substances. This is consistent with the analysis of all the combinatorial possible structures by Kind et al.29 They showed that additional filtering by isotopic patterns removes 95% of false candidates.29 With isotopic filtering, it becomes reasonable to identify the substance from a few candidates. The relationship is not quite a simple linear one but has discrete discontinuities. This is because the mass distribution in the database has fine scale structure as is seen in Figure 1b. The sharp changes in the number of substances at approximately 10 and 75 ppm relate to two intense peaks in the distribution. The fine scale structure does cause experimental issues. The mass resolution can then become a more significant issue than the mass accuracy. A further issue in ion trap design instruments could be space-charge effects, for ions that are

close in mass, on the mass accuracy, but this has not been found to be a significant issue in Fourier transform spectrometers.33 How representative of the database is folic acid? To evaluate this, we search the PubChem database at 50 u mass intervals between 50 and 1000 u. Since it is clear that there is structure in the data, we first conduct a coarse scan over 1 u for each of these masses with 0.02 mass intervals to see how molecules populate this mass range. Two examples are shown in Figure 2a,b for 200 and 600 u, respectively. We find that a Gaussian function gives an excellent fit to the data for masses up to 750 u; the data for higher masses has not been studied closely. The plots for other masses follow a similar trend. The difference between the mass of the average molecule, Mav, and the nominal mass (rounded down), Mnom, progressively shifts to higher mass as shown in Figure 2c. The mass of the average molecule, Mav, is approximately linearly related to Mnom where Mav ¼ Mnom ð1 þ 0:00037Þ

ð1Þ

which is approximately CnH0.6n. The full width half-maximum, MFWHM, of the distribution increases approximately linearly, as shown in Figure 2d, given by MFWHM ¼ 0:00032Mnom

ð2Þ

With this average molecule data, we may conduct a refined search where most molecules are at each 50 u mass interval with 1, 10, and 100 ppm resolution. Since, at high resolution, there is 3241

dx.doi.org/10.1021/ac200067s |Anal. Chem. 2011, 83, 3239–3243

Analytical Chemistry

LETTER

TOF-TOF design shows significant promise for improving the mass accuracy toward this goal.

Figure 3. Average number of molecules in the PubChem database within (1, 10, and 100 ppm across the mass scale. At each mass position, the average is calculated from 50 separate searches centered at the average molecule mass position and spaced evenly within the full width half-maximum of the distribution for the average molecule (see Figure 2c,d).

significant structure in the data, we sample, for each mass, 50 mass positions at intervals x, located evenly over the MFWHM of the average molecule distribution. This significantly reduces random noise. In Figure 3, we show the average number of molecules in the PubChem database between 50 and 1000 u representative of mass accuracies of 1, 10, and 100 ppm. Typical scatter factors are ÷ 1.030 but are higher at low mass (1.80 at 50 u and 1.24 at 100 u). Fitted functions are also shown composed of the sum of Gaussian and Lorentzian functions centered at 385 u which are weighted down at low mass using an error function. The same function fits the data at each of the three mass accuracies (1, 10, and 100 ppm) with scaling factors of exactly 10. Since the function is arbitrary, we do not give further details here, but the linearity of Figure 1 is upheld for all of the masses studied. This shows that the most abundant molecule likely to be encountered has a mass of 385 u (assuming that PubChem is a fair representation of molecules typically encountered), similar to the low resolution (50 u) results in ref 4. Similar results have been obtained through searching the ChemSpider database (not shown). Thus, folic acid is a good representative choice. The typical mass accuracies of FTICR and magnetic sector, Orbitrap, and TOF mass spectrometers are approximately e1,25,34 3,35 and 25 ppm,26 respectively, and for TOF-SIMS of unknown molecules is rather poorer at 150 ppm at 647 u28 but can be improved to 30 ppm with optimization and careful procedures.28 It is very clear that for effective use of large databases such as PubChem that mass accuracies of the order of 1 ppm are required together with the use of isotope pattern filtering29,30 to reduce the number of candidate molecules to a manageable number. For molecules with a mass >600 u or