Automated Strategies To Identify Compounds on ... - ACS Publications

Jan 12, 2011 - saving the output into a table and figures containing the esti- mated data. All scripts mentioned are included in the Supporting Inform...
0 downloads 0 Views 1MB Size
ARTICLE pubs.acs.org/ac

Automated Strategies To Identify Compounds on the Basis of GC/EI-MS and Calculated Properties Emma L. Schymanski,†,* Markus Meringer,‡ and Werner Brack† †

Department of Effect-Directed Analysis, UFZ;Helmholtz Centre for Environmental Research, Permoser Strasse 15, D-04103 Leipzig, Germany ‡ Remote Sensing Technology Institute, DLR;German Aerospace Centre, M€unchner Strasse 20, D-82234 Oberpfaffenhofen-Wessling, Germany

bS Supporting Information ABSTRACT: The identification of unknown compounds based on GC/ EI-MS spectrum and structure generation techniques has been improved by combining a number of strategies into a programmed sequence. The program MOLGEN-MS is used to determine the molecular formula and incorporate substructural information to generate all structures matching the mass spectral information. Mass spectral fragments are then predicted for each structure and compared with the experimental spectrum using a match value. Additional data are then calculated automatically for each candidate to allow exclusion of candidates that did not match other analytical information. The effectiveness of these “exclusion criteria”, as well as the programming sequence, was tested using a case study of 29 isomers of formula C12H10O2. The default classifier precision resulted in the generation of too many structures in some cases, which was improved by up to several orders of magnitude by including additional classifiers or restrictions. Combining this with the exclusion of candidates based on a Lee retention index/boiling point correlation, octanol-water partitioning coefficients, steric energies, and finally spectral match values limited the number of candidate structures further from over 1 billion without any restrictions down to less than 6 structures in 10 cases and below 35 in all but 3 cases. This method can be used in the absence of matching database spectra and brings unknown identification based on MS interpretation and structure generation techniques a step closer to practical reality.

T

he identification of unknown compounds remains one of the drivers behind analytical chemistry. Continuous improvements to instrumental capabilities mean that more and more compounds can be separated and detected at lower concentrations than ever before. As a result, many new compounds can now be identified and assessed in environmental samples or other similarly complex matrixes. The complexity of the sample often excludes the use of structurally information-rich techniques such as nuclear magnetic resonance (NMR) that require high sample purity and relatively high concentrations, such that chromatographic systems coupled with mass spectrometry (MS) are often the only systems available for suitable separation and detection of sample components.1 As the identification of each unknown can be very time-consuming, streamlining of the identification process is needed once unknowns of interest are found. The challenge in environmental analysis based on MS, therefore, is to retrieve as much information as possible from the separation and analysis of samples. As an example, reversedphase high-performance liquid chromatography (RP-HPLC) is often used to fractionate samples, and the partitioning behavior of compounds within the sample can then be determined using standards.2 Gas chromatography coupled with mass spectrometry (GC/ MS), specifically using electron impact (EI) ionization, is a common starting point for environmental investigations for a number of reasons.3 The mass spectra are generally reproducible, r 2011 American Chemical Society

with fragmentation patterns providing structural information that can be used for identification. This has led to the development of a number of quite comprehensive databases (e.g., NIST4 and Wiley,5 containing over 670 000 unique spectra in the latest versions combined) with well-developed search functions to allow the quick identification and/or comparison of sample and database spectra. In addition, retention indexes based on either alkanes (Kovat’s retention index, KRI) or polycyclic aromatic hydrocarbons (PAHs; Lee retention index, LRI) are used for confirmation based on the retention index of the standard compound compared with the measured compound.6 The QPID database developed as part of the EU Project MODELKEY used, for example, criteria of 60% probability for the NIST database search and (20 for KRI match to identify compounds on the basis of mass spectral match.7 The increasing popularity of high-resolution, high-accuracy mass spectrometry, especially tandem or multistage MS, MSn, where only limited spectral databases are available, has resulted in the evolution of alternative identification strategies based on online compound database searching. While the online compound databases contain many more entries than the GC/MS databases mentioned above (e.g., PubChem8 and ChemSpider9 Received: September 30, 2010 Accepted: December 20, 2010 Published: January 12, 2011 903

dx.doi.org/10.1021/ac102574h | Anal. Chem. 2011, 83, 903–912

Analytical Chemistry

ARTICLE

Figure 1. Information flow used to identify compounds on the basis of GC/EI-MS and structure generation. Once the molecular formula is determined, substructures identified from the spectrum (via classifiers in MOLGEN-MS, NIST) are used to limit the number of candidate structures, using either the default value (95%) or additional classifiers with lower probability (94% in this case). Mass spectral fragments are then predicted for each structure and compared with the experimental spectrum using a match value (defined by Kerber et al.16). Additional properties can then be calculated and used to exclude structures from consideration. These steps have been automated and tested as part of this study. Spectral data taken from NIST,4 spectrum number 232580.

by the developers17 and shown to identify the correct structure, on average, within the top 27% of all possible structures (without the use of substructural information). Two other programs available to predict mass spectral fragments, Mass Frontier18 and ACD MS Fragmenter (part of the MS Manager Suite19), were tested, along with MOLGEN-MSF20 (an advanced, standalone implementation of the spectral match feature of MOLGEN-MS) in another study. This showed that the simplest fragmentation settings were the most selective and that the best results had the correct structure on average, as previously, only in the top 27% of all possible structures using Mass Frontier with three-step fragmentation and MOLGEN-MSF.16 This means that the prediction of mass spectra is not sufficient alone to identify the correct structure during unknown identification. As a result, the focus needs to shift to the calculation of other properties to assist in the selection of the correct or most likely candidates and to exclude the less likely candidates. Previous work has shown that the match value and the octanol-water partitioning coefficient (log Kow) could be used in some cases to restrict the number of candidates by an order of magnitude.12 Eckel and Kind21 correlated the boiling point and Lee retention index of 370 compounds and found that 95% of those compounds had boiling points within the range of (LRI - 10) to (LRI þ 50) °C. Compounds with measured boiling points outside this range could be excluded from consideration, or predicted boiling point values could be used if the error margin associated with the prediction was also taken into account. 21 The Ph.D. thesis of T. Kind 22 (p 47) also contained a brief exploration of the use of steric energy based on the MM2 algorithm23 to exclude energetically unfavorable PAH structures from consideration.

with approximately 60 million entries each), these entries are not associated with spectra. Recent reviews including identification strategies based on these methods include Kind and Fiehn10 and Krauss et al.11 The combination of substructural information and structure generation provides an alternative to database searching, especially applicable where no database entries are available for the compounds of interest,12 but also useful to provide information in a weight-of-evidence approach.1 MS fragmentation patterns are used to retrieve substructural information from the spectrum for input during structure generation. A number of substructural “classifiers” have been developed over the past decades4,13,14 such that a probability can be associated with the presence or absence of the substructure for a given spectrum. The calculation of probabilities associated with the NIST substructural information4 is still dependent on spectra within the database, while the classifiers developed by Varmuza, Werther, and co-workers13,14 assign probabilities based on the spectrum and information stored within the classifiers, without being coupled to a database for prediction. MOLGEN-MS15 can then be used to determine the molecular formula of the compound, select substructures present and absent by adjusting the probability (precision) level, and then generate all possible structures matching this input information. This information flow is demonstrated in Figure 1. Previous studies have shown that a default value of 95% precision is generally a good starting point for structure generation12,13 and that inclusion of the NIST classifiers improved the results of structure generation dramatically.12 In the final step of MOLGEN-MS, virtual fragment ions predicted for each structure are then compared with the experimental spectrum to calculate a match value. This was evaluated 904

dx.doi.org/10.1021/ac102574h |Anal. Chem. 2011, 83, 903–912

Analytical Chemistry

ARTICLE

criterion (see below) was used for those compounds where the measured LRI was available, excluding those compounds with estimated boiling points outside the range of (LRI - 31) to (LRI þ 71) °C. Following this, the predicted octanol-water partitioning coefficient was used to eliminate compounds outside the range of estimated log Kow ( 1. The calculated steric energy was then used to exclude all compounds with ChemBio3D energies above 213.24 kcal/mol and MOLGEN-QSPR energies above 429.0 kcal/mol. Finally, the spectral match value was used to exclude remaining candidates with match values distinctly below those of the other candidates. As this is a very example-specific criterion, it is best applied following the other criteria rather than before. The derivation of the values used for the exclusion criteria is covered in the Results. Automatic Spectral Interpretation and Data Processing. Spectral processing using MOLGEN-MS, demonstrated pictorially in Figure 1, is described in greater detail elsewhere.12 This method was followed here, with the exception that the NIST substructural information is now included automatically. The MSP spectral files saved from NIST were converted into CSV format for import into MOLGEN-MS using a Matlab24 script. The NIST substructural information was also processed using a script with optional user input to include additional substructures and saved to be automatically uploaded at the Molin stage of MOLGEN-MS. Then the substructure classifier information from NIST and MOLGEN-MS was used to limit the elements present and/or absent on the basis of the spectrum to enable calculation of the molecular formulas. These were selected on the basis of the deviation of the calculated isotope pattern from the experimental pattern. The ring and double bond count (RDB) was incorporated into formula selection using a Matlab script.12 Following calculation and selection of the desired formula, MOLGEN-MS automatically filters the substructural classifiers from both programs to ensure compatibility with the formula. Either the substructure classifier probability was left at the default 95% or additional classifiers of lower precision were considered, depending on the case (see above). The structures were then generated and fragmented within MOLGEN-MS for the calculation of match values16,17 to assess how well each structure matches the experimental spectrum. Once the spectral interpretation with MOLGEN-MS was complete, the input and output files were listed in a summary file. A Matlab script read this summary file and generated the additional data mentioned above to assist in selection of the most likely candidates. The script coordinates file conversions (e.g., from SDF25 into SMILES) and inputs into the EPI Suite26 programs MPBPWin (boiling point and melting point calculation) and Kowwin (octanol-water partitioning coefficient or log Kow calculation), as well as ChemBio3D (steric energy calculation), saving the output into a table and figures containing the estimated data. All scripts mentioned are included in the Supporting Information. Retention Index Calculation. A total of 29 different compounds with the formula C12H10O2 were found in the NIST database,4 shown in Table S-1, Supporting Information, along with CAS numbers and NIST spectrum numbers. Of those 29, 14 are compounds with different functionalities (numbers 1-14, Table S-1) and the remaining 15 are positional isomers of the 14. Estimated Kovat’s retention index (KRI) data for each compound were retrieved from the NIST database. The estimated values are calculated on the basis of group contributions and

Figure 2. Strategy for the identification of unknown compounds on the basis of GC/EI-MS spectra, structure generation, and progressive exclusion of candidates using calculated data, including the exclusion criteria. Derivation of the criteria is covered in the Results.

This study builds on the previous work12 with the incorporation of the NIST substructural information into MOLGEN-MS and automatic calculation of the log Kow, boiling point, and steric energy for all candidates following structure generation. The use of the boiling point-LRI correlation and the steric energy of a compound as potential exclusion criteria were tested using a case study based on 29 unique NIST spectra of formula C12H10O2. These spectra include a number of potentially environmentally relevant compounds with different functional groups as well as several different positional isomers. This was also used to assess the ability of the overall method to reduce the number of structures generated down to “manageable levels” for identification purposes.

’ METHODS Overall Strategy: Structure Generation and Progressive Elimination. The overall method used for the processing of

GC/EI-MS spectra to identify the compound measured using structure generation and progressive exclusion of candidates is shown schematically in Figure 2 and explained in the text below. Figure 2 was used to define the order of material within the Methods, Results, and Discussion of Structure Elimination Strategies sections. To start, two cases were considered for structure generation. First, substructure classifiers were selected according to the default settings of 95% precision (top of Figure 2). As this was not always sufficient to reduce the number of candidates generated within the program limits (here set at 20 000 structures), a second case was introduced. This case, termed “modified classifier selection”, included either additional classifiers (e.g., those with less than 95% precision) or, where no classifiers with reasonable precision were available, alternative restrictions to limit the number of structures generated. This forms the second part of the “Structure Generation” group in Figure 2. Following structure generation and data calculation, the exclusion criteria were applied in the order shown in Figure 2. First, a boiling point-Lee retention index (BP-LRI) correlation 905

dx.doi.org/10.1021/ac102574h |Anal. Chem. 2011, 83, 903–912

Analytical Chemistry compound class27 and are reported with 50% and 95% confidence intervals. Of the 29 compounds, 19 were commercially available to generate experimental retention index data, marked in Table S-1, Supporting Information. The compounds were obtained from ABCR (structures 1 and 24), Aldrich (11, 22), Alfa Aesar (2, 7, 21, 23, 25, 27, 29), Applichem (3), Fluka (28), Chembridge (5), Key Organics Ltd. (20), Merck (26), MP Biochemicals (13, 15), and Sigma (19). Experimental Kovat’s and Lee retention indexes were calculated for these compounds according to Rostad and Pereira.6 GC/MS (model 6890 N, detector MSD 5973, Agilent Technologies, Waldbronn, Germany) analysis was performed using an HP-5MS capillary column (30 m  0.25 mm i.d., 0.25 μm film, 5% phenylmethylsiloxane, Agilent Technologies) and temperature program of 70 °C (held for 4 min), 3 °C/min to 300 °C (held for 20 min). The calibration mix C8-C36 from Supelco (46827U) was used to calculate the KRI, whereas the standard EPA-PAH mix from Ermsdorfer was used to record PAHs to calculate the LRI. Measured LRIs were used to generate an expected boiling point range for the correct structure according to Eckel and Kind.21 As the boiling point of compounds can be predicted for generated structures using EPISuite,26 this range can be used to eliminate structures with calculated boiling points outside the calculated range from consideration. The error associated with the boiling point prediction,21 given as 20.4 K,28 was added to the original range of (LRI - 10) to (LRI þ 50) °C to account for prediction error. The LRI measurement error of 0.53 (see below) was also included and rounded up to give the overall expected boiling point range of (LRI - 31) to (LRI þ 71) °C, as shown in Figure 2. Partitioning Coefficient (log Kow). The estimated octanolwater partitioning coefficient for structure candidates was calculated using Kowwin from EPISuite. In line with the previous study,12 an error margin of (1 was considered, which contains 96.5% of all predicted values compared with the experimental data used to build Kowwin.26 Steric Energy Calculation. The use of the steric energy of a molecule to exclude energetically unfavorable molecules from consideration was tested using two programs based on a heat of formation calculation. Two independent programs, ChemBio3D from CambridgeSoft29 and MOLGEN-QSPR,30 were used to calculate the steric energy. ChemBio3D is based on the MM2 algorithm23 and calculates an enthalpy of formation (kcal/mol), while MOLGEN-QSPR uses force field mechanics to calculate the steric energy31 (also kcal/mol). A batch function was programmed for ChemBio3D using Python32 and ChemScript29 which exported the SMILES code and energy. This batch function requires a structure description file (SDF file)25 with 2D coordinates already defined. In contrast, MOLGEN-QSPR works with all SDF files (with and without 2D placement). Following SDF import and the explicit addition of hydrogen atoms, 3D optimization was undertaken (using five iterations) and a molecular descriptor representing the steric energy was calculated, with the results exported as structure number and energy. As the calculations started from random placements, resulting in occasional anomalies where no chemically relevant energy minimum was found, these calculations were repeated three times, with the minimum of the three runs taken as the final result. To have a baseline interpretation of the energy values of both programs, energy values were calculated for a random set of 1000

ARTICLE

molecules used previously16,17 to assess the energy distribution of molecules known to exist in reality. The resulting energy quantiles can be used to include structures likely to exist (in terms of energy) with a given probability. Match Value Comparison. A match value comparison was conducted for the C12H10O2 isomers following the method described in a previous study.16 As the simplest program settings were identified as the most effective in terms of candidate separation, only these have been reported here. The calculations were performed with MOLGEN-MSF20 using the default reaction settings, as well as Mass Frontier18 and ACD Fragmenter,19 both with three-step fragmentation, on each of the 29 molecules and the 29 spectra. More details about the settings used are given by Schymanski et al.16

’ RESULTS This section, as with the Methods, is structured on the basis of the information flow shown in Figure 2 and covers first the use of substructure classifiers and then each exclusion criterion in the order applied followed by the assessment of the combined method both overall and on specific examples. Substructure Classifier Assessment. As mentioned above, two cases were considered for the inclusion of substructure information (classifiers). As well as using the default 95% precision, a modified classifier selection (i.e., including those associated with a probability of less than 95%, removing clashing classifiers, or adding additional restrictions) was considered in this example to assess the benefits of restricting structure numbers versus lowering the classifier probability. The results, showing the substructural classifiers used, number of structures calculated, calculation time, match value (MV) range, and MV and rank of the correct structure, are given in the Supporting Information, Table S-2. Where multiple runs were needed, the modified classifier selection is also included. The results using 95% classifier selection (i.e., no additional user input) and modified classifier selection form the first two columns for each structure in Figure 3. These are also shown individually in the Supporting Information, Figure S-1. The modified classifier selection improves the results significantly, with all but 5 of the 29 structures down to less than 100 possible candidates (from a starting point of >1 500 000 000; see Figure 1), compared with 9 on the basis of the default 95% classifiers. Furthermore, only 1 compound no longer has the correct structure present in the structure space (structure 12), compared with 5 for the 95% classifiers. Furthermore, the modified classifier selection already reduces the possible structures for 19 of 29 compounds down to a group of positional isomers (e.g., compounds 1-3, 7, 8, 11, 15-25, and 27-29). However, for at least 5 of the remaining 10 compounds, several thousand structures are still possible, necessitating the exclusion criteria to restrict the candidate numbers further. Retention Index Comparison. The measured KRI and LRI for the purchased substances are given in Table 1, along with the predicted KRI taken from NIST and the boiling point calculated using EPISuite. A total of 4 of the 19 compounds purchased (structures 2, 22, 23, and 26) could not be detected with the GC/MS method used in this study, as they were too polar and are thus not included in the table. The errors in the indexes were calculated using the standards, i.e., KRI values were calculated for the PAHs and LRI values for the alkanes, to determine the variation between the RI values. 906

dx.doi.org/10.1021/ac102574h |Anal. Chem. 2011, 83, 903–912

Analytical Chemistry

ARTICLE

Figure 3. Number of structures remaining in consideration after structure generation using classifiers with 95% or more precision (black columns), modified classifier selection (striped columns), and the modified classifiers followed by stepwise application of the exclusion criteria described in the text and Figure 2 (gray columns). Key: x, the correct structure is absent; >, the calculation limit of 20 000 was exceeded. The exclusion step was performed on the 95% classifier runs for structures 10 and 14, as removal of incorrect classifiers led to generation beyond the 20 000 limit; as a result the correct structure is absent.

Table 1. Measured KRI and LRI for Purchased Compounds of Formula C12H10O2, as Well as the Predicted KRI from NIST, the BP Range Calculated from the LRI, and the Predicted BP and log Kow from EPISuitea structure number

a

measured KRI

measured LRI

predicted KRI (NIST)

predicted BP range (°C)

predicted BP (°C)

predicted log Kow

20

1568.9

266.8

1664 ( 382

236.4-337.2

315

2.73

3

1575.8

267.9

1611 ( 201

237.5-338.3

302

2.77

15

1602.0

272.4

1611 ( 201

242.0-342.8

302

2.77

24 7

1625.5 1647.8

275.6 279.2

1611 ( 201 1611 ( 201

245.2-346.0 248.8-349.6

302 302

3.00 3.00

1

1690.5

286.2

1801 ( 382

255.8-356.6

338

3.15

19

1696.9

287.4

1801 ( 382

257.0-357.8

338

3.15

21

1729.2

291.9

1808 ( 301

261.5-362.3

354

2.80

28

1732.1

292.9

1808 ( 301

262.5-363.3

354

2.80

2

1775.6

298.4

1664 ( 382

268.0-368.8

315

3.57

25

1814.2

305.2

1722 ( 382

274.8-375.6

320

2.97

27 11

1830.7 1836.5

307.5 308.6

1664 ( 382 1722 ( 382

277.1-377.9 278.2-379.0

315 320

3.35 2.97

29

1930.1

323.8

1808 ( 301

293.4-394.2

354

2.80

5

2095.8

350.9

1597 ( 382

320.5-421.3

302

2.45

The structure numbers correspond with those in Table S-1, Supporting Information. Values are sorted from lowest to highest retention index. RIs were calculated according to Rostad and Pereira.6 The worst case error margins (see the text) are (3.4 (KRI) and (0.53 (LRI).

Determination of the indexes over 25 measurements resulted in an LRI standard deviation between 0.032 for dodecane and 0.528 for octadecane (total of 7 compounds). For the KRI, the standard deviation ranged from 0.115 for naphthalene to 3.355 for fluorene (total of 10 compounds). The KRI errors are well within the error window of (20 units used, for example, to confirm compound identity within MODELKEY.7 As the LRI values are used in terms of candidate exclusion, the worst case errors of 0.53 (LRI) and 3.4 (KRI) were taken as the error margins of the measurements. Combining the errors associated with the LRI (0.53) and the boiling point prediction using EPISuite MPBPWin (20.4 K) leads to a cumulative error of (21 K for the LRI-BP correlation, such that the relationship becomes BP range = (LRI - 31) (LRI þ 71) °C. This predicted range is also shown in Table 1. All compounds in Table 1 are distinguishable from one another within the measurement error margins of 0.53 (LRI) and 3.4 (KRI). Several pairs of compounds elute closely (e.g., 3 and

20, 1 and 19, 21 and 28, 11 and 27) such that these are within the (20 window for KRI (no such window for LRI is reported to the best of our knowledge). Of these pairs, some are distinguishable by the mass spectrum (3 and 20, 11 and 27), while others are not (e.g., 1 and 19, where differences in the spectra are of the same magnitude as differences between the replicate spectra within NIST), which would mean these compounds would need to be separated on alternative chromatographic systems to ensure correct separation and identification of the isomers. The range of the measured values for all isomers was 520 for KRI and 83 for the LRI. The prediction error margins for retention indexes are of similar magnitude, between 201 and 382 for KRI and 102 for the LRI. This confirms previous conclusions that general prediction of RIs is still insufficient to identify single structures but may be useful as a filter.10 The results also confirm the weakness of prediction based on group contributions, discussed by Stein et al.,27 where all isomers have the same predicted values despite large discrepancies in reality 907

dx.doi.org/10.1021/ac102574h |Anal. Chem. 2011, 83, 903–912

Analytical Chemistry

ARTICLE

generation, as these are energetically less likely to exist in reality. The steep increase in the energies for the last 10% of molecules shown in Figure 4 was not entirely expected given that the 1000 molecules are known to exist with a spectrum measured in the NIST database. The energies of the last 10% cover a much wider range than those of the other 90%, especially for ChemBio3D. The energy values calculated using the two programs for the C12H10O2 isomers are shown in Figure S-2, Supporting Information. As a form of quality control of the ChemBio3D batch processing, these 29 values were also calculated manually via the user interface, which does not require predefinition of the 2D coordinates. The values are all well below the 90% quantiles in Figure 4 and Table S-3 (Supporting Information), except structure 26 for the ChemBio3D calculation. Otherwise the programs show a relatively similar distribution of energies, although the energy values are very different, in accordance with Figure 4. The user interface and batch values for ChemBio3D show slight differences (larger for structure 26) as the calculations start from a different 2D placement. The ChemBio3D batch runs (repeated three times) were almost identical as the molecules were optimized from the same 2D placement. The results shown in Figure S-2 (Supporting Information) confirm that the choice of the 90% cutoff for energy is appropriate. Even the clear outlier, structure 26, is within the 90% energy limit of 213.24 kcal/mol for ChemBio3D, while the MOLGENQSPR values are well below the 90% cutoff of 429.0 kcal/mol. Using the 80% cutoff would exclude structure 26 for ChemBio3D, while a less conservative value, e.g., 95% of the values, would result in the inclusion of too many energetically unlikely structures. Match Value Comparison. The results of the match value comparison using ACD MS Manager, MOLGEN-MSF, and Mass Frontier to calculate the predicted spectra for each of the 29 C12H10O2 isomers using the simplest fragmentation settings are summarized here; the values are given in Table S-4, Supporting Information. In contrast with the previous study,16 which indicated that the results of MSF and Mass Frontier were generally comparable and those of ACD generally slightly worse, for this example, Mass Frontier was clearly the better. This is shown using the relative ranking position (RRP16,17), which ranks candidates between 0 (best match) and 1 (worst match) taking into account candidates with equal match values, such that if all candidates have the same match value, RRP = 0.5. For this study, Mass Frontier had a much lower average RRP (0.216) compared with ACD (0.303) and MSF (0.366). In terms of the 29 isomers, the correct molecule is ranked 7th, 9th, and 11th compared with the other 28 isomers on average for Mass Frontier, ACD, and MSF calculations, respectively. The poor spectral prediction performance of MSF (and hence MOLGEN-MS) for some isomers can be seen quickly, for example, in structures 3, 5, 6, and 15, where the match values are significantly lower than for Mass Frontier or ACD. Structures 4 and 14 have relatively low values for MSF and Mass Frontier, while structure 25 has low values for all programs. For structures 3 and 15 (1- and 2-naphthyl acetate, respectively), this was due in part to one main fragment that was missing within MOLGEN-MS, m/z = 144, which was the base peak of both spectra. Inclusion of this reaction in MOLGEN-MS would improve the results for these compounds considerably, e.g., increasing the MV from 1.8% to 49% for structure 3.

Figure 4. Steric energy distribution of 1000 randomly selected molecules, kcal/mol. The data are given in Table S-3, Supporting Information. Black tilted squares represent ChemBio3D results using the batch function and gray triangles MOLGEN-QSPR results.

(e.g., structures 21 and 29). This is one of the largest sources of error in the KRI prediction based on group contribution27 and also affects the LRI prediction (see Table 1). Furthermore, despite the very large errors associated with the predicted KRI reported in NIST, the measured KRI of one compound (structure 5, 2-methyl-6-phenyl-4H-pyran-4-one) is well outside the 95% confidence interval of the predicted value (2095.8 compared with 1597 þ 382 = 1979). The same structure is also outside the error margins used for the predicted LRI. This shows the lack of applicability of these predicted values outside the compound domain used to form the predictions, even with such large error margins. As such, these criteria may be of limited value in structure exclusion in some cases. More accurate prediction algorithms for smaller subsets of compounds exist, reviewed by Kind and Fiehn10 (see also references therein). These could be used on a case-by-case basis to improve candidate selection using retention prediction if sufficient information about the compound is available (e.g., compound class, functional groups). This information is, however, often not available during unknown identification, especially during environmental analysis where a lack of sample often prevents additional analyses that may yield functional group information. Furthermore, compounds with different combinations of functional groups can often occur when using structure generation techniques for unknown identification, unless sufficient substructure classifiers are available to cover all functional groups present in the molecules. Partitioning Coefficient (log Kow). The predicted log Kow for each of the measured structures is also given in Table 1. A brief look at these values shows that all but two structures are within the (1 range used for exclusion, with structures 2 and 5 (log Kow = 3.57 and 2.45) the only structures that could be separated on the basis of the predicted log Kow. Despite this, the results presented below show that this is still a useful criterion to exclude several structures from consideration. Steric Energy Calculation. The distribution of the steric energy calculated for the 1000 randomly selected structures is shown in Figure 4 and given in Table S-3 (Supporting Information) for ChemBio3D and MOLGEN-QSPR. A total of 90% of the molecules had a steric energy below 231.24 kcal/mol (ChemBio3D) and 429.0 kcal/mol (MOLGEN-QSPR). These values can be used to exclude candidates with higher energies following structure 908

dx.doi.org/10.1021/ac102574h |Anal. Chem. 2011, 83, 903–912

Analytical Chemistry

ARTICLE

Table 2. Number of Structures Remaining in Consideration Using MOLGEN-MS Default Settings (>95% Precision Substructure Classifiers) versus Additional Classifiers or Restrictions (Columns) in Structure Generation, Followed by Sequential Exclusion Based on Calculated Data (Rows)a

a

The first column pair is structure 9 and the second column pair structure 15. The criteria used are summarized in Figure 2. Notes indicated by superscript numbers: 1, additional criterion “5 atoms or more in cycles only”; 2, inclusion of additional classifier CdOO (94% precision); 3, N/A indicates this criterion was not applied (standard compound not available to measure LRI).

Despite the results here showing that spectral prediction within MOLGEN-MS was not the best performer for this example, the results of the previous study on more examples show that MOLGEN-MS is generally on par with Mass Frontier.16 As the calculation is relatively fast and inbuilt in MOLGEN-MS, the MOLGEN-MS match values were used further in the overall work flow, rather than those of Mass Frontier. An option for final candidate selection would be to perform additional calculations using Mass Frontier to complement the MOLGEN-MS match values, once less likely candidates have been excluded to speed up the Mass Frontier calculation times. The ability of all three programs to identify the “correct” structure on the basis of the predicted mass spectrum is poor compared with the ability of the NIST database to match the experimental spectrum to the database spectra. All programs only ranked 3-4 of the 29 structures correctly in the first place on the basis of the predicted spectrum, and on average the correct structure was ranked between 7th and 11th of 29 structures. In contrast, the spectral match function within the NIST database identified the correct spectrum using the library search in all 29 cases with a very high probability (range 80.8-98.2%; see Table S-2, Supporting Information). This confirms the conclusions from previous studies16,17 that show that spectral match based on predicted fragments alone is insufficient to identify candidate structures,

even for compounds of the same formula but with very different mass spectra. Combining Substructural Classifiers and Exclusion Criteria. The structures generated using the classifiers from the two cases above (default and modified classifier selection) were processed using the exclusion criteria shown in Figure 2. The results of these two cases without exclusion plus the application of the exclusion steps on the modified classifier selection are shown in Figure 3. This plot shows clearly that the exclusion criteria are effective in eliminating at least 1 order of magnitude of structures in most cases where over 100 structures remain (note the logarithmic scale), including structures 4, 5, 9, 13, and 14. In most cases where the filtering criteria had no effect, the classifiers had already reduced the structures down to positional isomers (see above). The relative influence of the classifiers and the different exclusion criteria is shown in Table 2 for structures 9 and 15. For structure 15, although the exclusion criteria are effective in reducing the number of structures from 3902 to 689, the use of the additional classifier with precision just below the default 95% is much more effective in reducing structure numbers in this case (from 3902 to 36, reduced further to 20 after application of the exclusion criteria). The influence of a different strategy to reduce structure numbers is shown for structure 9. A total of 10 893 structures 909

dx.doi.org/10.1021/ac102574h |Anal. Chem. 2011, 83, 903–912

Analytical Chemistry

ARTICLE

were generated using classifiers with g95% precision, but no additional classifiers were available. Thus, an alternative restriction was introduced to limit the structures generated to those with five or more atoms in a ring only (i.e., excluding three- and four-membered rings). This strategy is not always optimal, as this can lead to elimination of some possible compounds (e.g., epoxides, cyclopropyl groups); however, the drastic reduction in the number of structures generated is demonstrated clearly (from 10 893 to 1474), and such a reduction is necessary in examples with insufficient classifiers to come up with candidates as a first step. The likelihood of eliminating the “correct” structure as a result is generally minimal but case-dependent. In this case, the exclusion criteria reduce the number of candidates further, such that 307 structures remain at the final step, compared with 922 without the additional restrictions. The correct structure remained in consideration in both cases. Although 307 candidates are still too many for identification purposes, this is a big improvement over 10 893 structures or even 1474. In some examples (e.g., structure 15), the classifiers themselves restrict the generation sufficiently and the use of this additional criterion is not necessary.

classifiers to use. Recording a UV spectrum of the sample during liquid chromatographic (LC) measurements (e.g., during sample fractionation) could provide supplementary information to select which of the classifiers applies in this case. The results shown in Figure 3 reveal that the correct candidates are missing for structures 10, 12, and 14. Structures 10 and 14 are quinones, whereas structure 12 has a bridging CH2 group over a naphthalene ring. Although structure 12 is relatively unusual, structures 10 and 14 are not unusual compounds, but removal of the clashing classifiers restricting generation of the correct structure resulted in over 20 000 molecules. The development of substructural classifiers specifically for quinones and similar compounds would help improve the identification of these structures. Improvement to the classifiers overall would also improve the applicability of this method to spectra further outside the domain of the NIST database. Boiling Point-Lee Retention Index. The retention index data presented in this study show that the experimentally measured values are very useful and accurate in terms of identifying the correct from false candidates. The prediction of KRI, included for example in the NIST database, is still too inaccurate to exclude a significant number of candidates (see Table 1) and does not cover all experimental values despite the large error margins. Although the same applies for the LRI, where the boiling point-LRI relationship developed by Eckel and Kind21 results in a very large “inclusion window”, this has still proved to be a useful exclusion criterion in all examples where the LRI was available. Care still needs to be taken when using this relationship, however, as shown for structure 5, where the value is outside the error margins despite the large range. Partitioning Coefficient (log Kow). As shown previously12 and again in Table 2, the log Kow is a useful exclusion criterion despite having a relatively high error associated with the values. This is especially relevant where the LRI-based exclusion is not available. Taking a less conservative error margin would improve the exclusion based on this criterion, as would improvements to the prediction of log Kow. If the compound class of the unknown is clear, more specific calculations of log Kow could be used, rather than the very general calculations available in EPISuite. The disadvantage of more specific calculations is that this would have to be applied on a compound-specific basis and is no longer applicable for an automated strategy such as the one presented here, designed for any unknown organic contaminant. Steric Energy. The steric energy was a surprisingly versatile and effective exclusion criterion in this study. The calculation with MOLGEN-QSPR is very quick (seconds up to minutes for several thousand structures, Intel Core 2 Duo 1.83 GHz, 1.00 GB of RAM) and provides similar information in a much shorter time frame compared with that with ChemBio3D, despite the different values. The quantiles show that molecules known to exist can have very high energies; however, the majority have low energies. Furthermore, molecules that were very sterically hindered (e.g., one-atom bridge bonds over PAH systems) had energies far above the 90% quantile used here as the exclusion criterion. The combination of both programs resulted in the exclusion of slightly more candidates than one alone. These results are in agreement with a previous study investigating the energy of known (natural or synthesized) C6H6 isomers,31 which also found that force field steric energy calculation was sufficiently accurate to describe the energy of structures compared with high-level enthalpy of formation calculation, with much lower computational effort.

’ DISCUSSION OF STRUCTURE ELIMINATION STRATEGIES The results presented above show that the application of exclusion criteria following structure generation using MOLGEN-MS is successful in reducing the number of candidates as a result of structure generation by several orders of magnitude in many cases. Some issues with each of the criteria are discussed briefly below. Substructure Classifier Use. The use of substructural classifiers is clearly shown in Table 2 as the key to success in limiting the number of structures generated for a given molecular formula. Although the exclusion criteria can reduce the number of structures dramatically following generation, exclusion by using the correct substructural information prior to generation is still more effective. The choice of which classifiers to use in some examples can be “trial and error” and may require the consideration of several different structure generation runs to avoid exclusion of candidates for an unknown spectrum. Problems can occur in some cases when combining the NIST and Varmuza classifiers in MOLGEN-MS. When many similar classifiers are identified for a spectrum, it is best to manually review the list prior to structure generation to ensure some classifiers do not restrict the structure generation in undesired ways. For example, the presence of Ar;CdOCH2 instead of the CdOCH2 classifier combined with the naphthalene substructure excludes the generation of 2-naphthaleneacetic acid for structure 13 (see Table S-2, Supporting Information) due to the arrangement of the double bonds. Similarly, the presence of the classifier “ph-O” (interpreted as C6H5-O) restricts the generated structures correctly for structures 2, 20, and 27 but excludes the correct structures for structures 22 and 23. A feature of the example used here is that biphenyl and naphthalene often both appear with probability 95%, leaving the choice of classifier up to the user (only one is possible for the formula C12H10O2). As this example is conducted with known spectra, it was possible to choose the classifiers to get the right answer; this option would need to be covered using multiple runs with different classifier combinations for real unknown determination, as it can be difficult to determine which of the clashing 910

dx.doi.org/10.1021/ac102574h |Anal. Chem. 2011, 83, 903–912

Analytical Chemistry Improvements to this criterion could be made by taking a bigger random set of molecules known to exist to generate more accurate quantiles or to restrict the data sets to include only compounds containing certain elements, a certain range in the numbers of atoms, or other properties, such that a more accurate energy distribution for compounds of a given formula could be estimated. As the molecular formula is known prior to structure generation, this could represent a simple improvement to the exclusion criterion presented here, which is currently very conservative. Another factor not considered in the steric energy calculation is, for example, the activation energy. It is possible that this calculation excludes molecules that are energetically unfavorable but stable due to a high activation energy required for the formation of breakdown products. This could be especially relevant for some environmental contaminants that have been produced to take advantage of specific properties, despite unfavorable reaction conditions. Spectral Match. As identified in a previous study, the use of predicted spectra to assess the spectral match appears to be a very subjective selection criterion, depending on the quality of fragment prediction. This is discussed in detail elsewhere.16 The match value still forms an important part of the filtering process shown in Figure 2, however, and was used to exclude approximately 1000 and 100 candidates, respectively, for structure 9 (see Table 2). Furthermore, the MV is also useful to generate a rank of remaining candidates in terms of the original measured data for final candidate selection. Overall Strategy. The method for structure generation based on substructural identification and progressive elimination presented in Figure 2 is very effective in improving the chances of identification of candidates based on structure generation techniques. Using the classifier selection and exclusion criteria, we could limit the number of structure candidates from over 1 000 000 000 on the basis of the molecular formula only to less than 6 candidates in 10 of 29 cases and to less than 35 candidates in 26 of 29 cases. Many of the remaining candidates were positional isomers of the correct structure, which are not able to be separated adequately using the predictive techniques applied during candidate exclusion. Once candidate numbers are restricted to relatively low numbers, more accurate prediction techniques could be used to separate these, or standard measurements could be used to select the candidates on the basis of retention behavior (depending on the cost and availability).

ARTICLE

achieved for spectra similar to those used to develop the current classifiers. Further improvements to substructure identification, such as the inclusion of additional spectra in the NIST database and the development of additional substructure classifiers, will improve the ability of these automated methods to interpret the mass spectrum correctly. Improvements to the prediction techniques and the incorporation of other exclusion criteria (e.g., toxicity prediction, additional partitioning properties) could also improve this method further. As discussed above, the incorporation of more advanced prediction techniques on a case-by-case basis could be applied to improve the exclusion of additional candidates following application of the general criteria presented here. Although based on GC/EI-MS, several principles described here can also be extrapolated to LC and/or to MSn based on other ionization techniques. The lack of substructural classifiers, shown to be the critical first step, is likely to be the limiting factor in developing a similar method for other ionization techniques.

’ ASSOCIATED CONTENT

bS

Supporting Information. Additional information as noted in the text. This material is available free of charge via the Internet at http://pubs.acs.org.

’ AUTHOR INFORMATION Corresponding Author

*E-mail: [email protected]. Fax: þ49 341 235 1785.

’ ACKNOWLEDGMENT This work was supported by the European Commission through the integrated project MODELKEY (Contract Number 511237-GOCE) and by the Helmholtz Interdisciplinary Graduate School for Environmental Research (HIGRADE). We thank Prof. A. Kerber and co-workers for the ongoing use and development of the MOLGEN programs and M. Heinrich for performing the GC/MS measurements. Thanks also go to C. Gallampois, M. Krauss, N. Ulrich, and T. Schulze for their assistance in improving the outcomes of this study and to M. Krauss and the reviewers for improving the manuscript. ’ REFERENCES (1) Brack, W.; Schmitt-Jansen, M.; Machala, M.; Brix, R.; Barcelo, D.; Schymanski, E.; Streck, G.; Schulze, T. Anal. Bioanal. Chem. 2008, 390, 1959–1973. (2) OECD. Guidelines for the Testing of Chemicals; Organisation for Economic Co-operation and Development (OECD): Paris, 2004; Vol. 117. (3) Schymanski, E.; Bataineh, M.; Goss, K.-U.; Brack, W. Trends Anal. Chem. 2009, 28, 550–561. (4) NIST/EPA/NIH. NIST Mass Spectral Library; National Institute of Standards and Technology, U.S. Secretary of Commerce, U.S. Government Printing Office: Washington, DC, 2005. (5) Wiley. Wiley Registry of Mass Spectral Data; Wiley: New York, 2010. (6) Rostad, C. E.; Pereira, W. E. J. High Resolut. Chromatogr. Chromatogr. Commun. 1986, 9, 328–334. (7) Schymanski, E.; Schulze, T.; Hermans, J.; Brack, W. In Handbook of Environmental Chemistry on Effect-Directed Analysis; Brack, W., Ed.; Springer: New York, Chapter 8, submitted for publication. (8) National Center for Biotechnology Information (NCBI). PubChem. http://pubchem.ncbi.nlm.nih.gov/ (accessed March 16, 2010).

’ CONCLUSIONS The methods outlined in this paper have been shown to be very effective in reducing the number of structures generated for a given compound on the basis of data generated during GC/EI-MS analysis. The elimination of candidate structures down to positional isomers in most cases is a very positive step forward in the identification of compounds based on structure generation, without relying on a database spectrum match. The results show clearly that such a step cannot be performed on the basis of pure spectral prediction alone, but that the combination with substructure identification and candidate elimination based on calculated data improves the chances of identification considerably. Testing of the developed method on unknown spectra will be the next step in assessing the practical applicability of this method. As the results depend strongly on the identification of substructure classifiers, the best results will most likely be 911

dx.doi.org/10.1021/ac102574h |Anal. Chem. 2011, 83, 903–912

Analytical Chemistry

ARTICLE

(9) Royal Society of Chemistry. ChemSpider. http://www.chemspider. com (accessed March 15, 2010). (10) Kind, T.; Fiehn, O. Bioanal. Rev. 2010, 2, 23–60. (11) Krauss, M.; Singer, H.; Hollender, J. Anal. Bioanal. Chem. 2010, 397, 943–951. (12) Schymanski, E. L.; Meinert, C.; Meringer, M.; Brack, W. Anal. Chim. Acta 2008, 615, 136–147. (13) Varmuza, K.; Stancl, F.; Lohninger, H.; Werther, W. Lab. Autom. Inf. Manage. 1996, 31, 225–230. (14) Varmuza, K.; Werther, W. J. Chem. Inf. Comput. Sci. 1996, 36, 323–333. (15) Kerber, A.; Laue, R.; Meringer, M.; Varmuza, K. Adv. Mass Spectrom. 2001, 15, 939–940. (16) Schymanski, E.; Meringer, M.; Brack, W. Anal. Chem. 2009, 81, 3608–3617. (17) Kerber, A.; Meringer, M.; R€ucker, C. Croat. Chem. Acta 2006, 79, 449–464. (18) HighChem. Mass Frontier 5.0; HighChem Ltd./Thermo Scientific: Bratislava, Slovakia, 2007. (19) ACD. MS Manager 11.01 Advanced Chemistry Development, Inc. (ACD): Toronto, Ontario, Canada, 2007. (20) Meringer, M. MOLGEN-MSF; MOLGEN: Munich, Germany, 2009. www.molgen.de/documents/MolgenMsf.pdf. (21) Eckel, W. P.; Kind, T. Anal. Chim. Acta 2003, 494, 235–243. (22) Kind, T. Combination of GC-MS and Chemometrics for the Analysis of Compounds in Complex Environmental Samples (in German), Ph.D. Thesis, Chemistry and Mineralogy, University of Leipzig, Leipzig, Germany, 2003. (23) Allinger, N. L. J. Am. Chem. Soc. 1977, 99, 8127–8134. (24) The MathWorks. Matlab, version 7.2.0.232; The MathWorks Inc.: Natick, MA, 2006. (25) Dalby, A.; Nourse, J. G.; Hounshell, W. D.; Gushurst, A. K. I.; Grier, D. L.; Leland, B. A.; Laufer, J. J. Chem. Inf. Comput. Sci. 1992, 32, 244–255. (26) U.S. EPA. Estimation Program Interface (EPI) Suite, version 3.20; United States Environmental Protection Agency: Washington, DC, 2007. (27) Stein, S. E.; Babushok, V. I.; Brown, R. L.; Linstrom, P. J. J. Chem. Inf. Model. 2007, 47, 975–980. (28) Stein, S. E.; Brown, R. L. J. Chem. Inf. Comput. Sci. 1994, 34, 581–587. (29) CambridgeSoft. ChemBio3D Ultra, version 11.0; CambridgeSoft: Cambridge, MA, 2007. (30) Kerber, A.; Laue, R.; Meringer, M.; R€ucker, C. MATCH 2004, 51, 187–204. (31) Kerber, A.; Laue, R.; Meringer, M.; R€ucker, C. MATCH 2005, 54, 301–312. (32) Python Software Foundation. Python, version 2.5; Python Software Foundation: Wolfeboro Falls, NH, 2006.

912

dx.doi.org/10.1021/ac102574h |Anal. Chem. 2011, 83, 903–912