A numerical method for extracting mass spectra from gas

data array Into matrices containing the resolved spectra and chromatogram profiles. ..... where Ox is an s*s orthogonal matrix, 02 is a c*c orthogonal...
1 downloads 0 Views 1MB Size
Anal. Chem. 1985, 57, 1049-1056

integrated GC/IR/MS system and indicates the information available. The versatility of FTMS recommends it for the approach described, as both accurate mass and chemical ionization data, in addition to E1 mass spectra, are potentially available from a single analysis. Although pressure requirementa differ markedly for the collection of CI and AMM data, the use of pulsed values (16)to provide momentary CI reagent gas pressures, in addition to a dual-differentially pumped trapped ion cell to maintain constant pressure for AMM, should provide the means for performing the integrated measurement.

LITERATURE CITED (1) Hertz, H. S.; Hltes, R. A.; Elernann, K. Anal. Chem. 1971, 43, 681. (2) McLafferty, F. W.; Hertel, R. H.; Villwock, R. D. Org. Mass. Spectrom 1974, 9 , 690. (3) Pesyna. G. M.; Venkataraghavan, R.; Dayrlnger. H. 0.; McLafferty, F. W. Anal. Chem. 1978, 48, 1362. (4) Henneberg. D. Adv. Mass Spectrom. 1980, 68, 1511. (5) Schafer, K. H.; Hayes, T. L.; Erasch, J. W.; Jakobsen, R. J. Anal. Chem. 1084, 56,237-240. (6) Chiu, K. S.; Blernann, K.; Krlshnan, K.; Hill, S. L. Anal. Chem. 1084, 56, 1610-1615.

1049

(7) Wilkins, C. L.; Glss, G. N.; White, R. L.; Erlssey, G. M.; Onylrluka, E. C. Anal. Chem. 1982, 54,2260-2264. (8) Wllkins, C. L.; Giss, G. N.; Brlssey, G. M.; Steiner, S. Anal. Chem. 1081, 53, 113-117. (9) Crawford, R.; Hirschfeld. J.; Sanborn, R. H.; Wang, C. M. Anal. Chem. 1982, 54,817-820. (IO) Wilklns, C. L.; Glss, 0. N.; Erlssey, G. M.; Ijarnes, C.; Steiner, S. 1983 Pacific Conference on Chemistry and Spectroscopy, Oct 27, 1983 Paper No. 267. (11) Laude, D. A., Jr.; Erlssey, G. M.; Ijarnes, C. F.; Brown, R. S.; Wilklns, C. L. Anal. Chem. 1984, 56, 1163-1168. (12) Ledford, E. E., Jr.; Ghaderi, S.; White, R. L.; Spencer, R. B.; Kulkarni. P. S.; Wllkins, C. L.; Grass, M. L. Anal. Chem. 1980, 52, 463-468. (13) Johlman, C. L.; Laude, D. A., Jr.; Wilkins, C. L. Anal. Chem., preceding paper In this Issue. (14) Ledford, E. E., Jr.; Rempel, D. L.; Gross, M. L. Anal. Chem. 1984, 56, 2744-2748 -. . . -. .-.

(15) Lowry, S. R.; Huppler, D. S. Anal. Chem. 1981, 53,889-893. (16) Sack, T. M.; Gross, M. L. Anal. Chem. 1983, 55,2419-2421.

RECEIVED for review October 10,1984. Accepted January 15, 1985. Support from the National Science Foundation under NSF Grants CHE-80-18245 and CHE-82-08073 and a Department Research Instrument Grant, CHE-82-17610, is gratefully acknowledged.

A Numerical Method for Extracting Mass Spectra from Gas ChromatographyIMass Spectrometry Data Arrays Martin D. King*’ and Graham S. King2 Bernard Baron Memorial Research Laboratories, Queen Charlotte’s Hospital, London W 6 OXG, United Kingdom

A numerical method Is descrlbed for extracting single-component mass spectra from GWMS data arrays containing poorly separated spectral and chromatographic informatlon. The procedure Is an extension of the 1981 least-squares method of Knorr, Thorshelm, and Harris for decomposlng the data array into matrices containing the resolved spectra and chromatogram profiles. The modifications include an exponential down-scan lntenslty correction, a routine for reconstructing saturated mass spectra, and a background subtractlon facility. Furthermore, the requirement for previously obtained chromatographic Information is eliminated through the use of a nonilnear unconstrained optimization routine to determine all the parameters that characterize the chromatographlc behavior.

The capability of any chromatographicanalytical technique to detect and identify poorly separated components is, in general, considerably enhanced through the use of a multichannel detection system. A mass spectrometer, when interfaced to a chromatographic apparatus, can function as a particularly powerful multichannel detector, and GC/MS is now a widely used analytical technique, particularly in the pharmaceutical industry and in clinical laboratories. A number of mathematical methods have been developed for processing multichannel chromatographic data arrays and these have been used to tackle GC/MS problems. Dromey *Present address: Abteilung Spektroskopie, Max Planck I n s t i t u t fur Biophysikalische Chemie, D-3400Gottingen, Federal Republic of Germany. Present address: Metabolism a n d Residue Section, I.C.I. P l a n t Protection, Jeallot’s Hill, Backnell, Berkshire, U.K. 0003-2700/85/0357-1049$01.50/0

et al. (I) have adopted a procedure in which the data matrix is scanned for the sharpest mass chromatograms. The ions giving rise to these narrow peaks are assumed to be unique to one of the overlapping components and their chromatograms are used to define model peak shapes and hence to resolve the overlapping components. Other researchers have developed routines which rely more heavily on numerical methods. Rather elegant treatments have been suggested by Sharaf and Kowalski (2) who use factor analysis to achieve the separation and Knorr, Thorsheim, and Harris (3) (hereafter referred to as KTH) who have developed a least-squares procedure. In the present paper we outline a number of modifications that have been incorporated into the method described by KTH in order to tailor the procedure to the needs of our laboratory, which is principally concerned with metabolic profile analysis. These modifications include a cubic-spline interpolation procedure to correct for the exponential downscan time delay, a method for reconstructing saturated mass spectra, and a modification to deal with background interference. KTH adopt a procedure in which the chromatographic “response” is determined by running a series of standards, after which the number of theoretical plates and a decay time constant are calculated. This information is required to perform the decomposition. While this approach has its advantages, for routine use in a busy clinical laboratory the need to run many calibration standards was considered unacceptable in terms of the work it would entail. Furthermore, a large number of compounds are present within a single metabolic profile, and these exhibit a wide variety of chromatographic characteristics, making the calibration almost impossible. We therefore sought to eliminate the need to run calibration standards by using a nonlinear unconstrained optimization routine to extract from the GC/MS data matrix 0 1985 American Chemical Society

1050

ANALYTICAL CHEMISTRY, VOL. 57, NO. 6, MAY 1985

itself all the parameters that characterize the chromatographic behavior. We examine the performance of these modifications by using the method to mathematically separate the spectra of two chromatographically unresolved organic acids whose retention times differ by less than one sampling interval. We show that the method can work well but also examine its limitations and discuss its performance in relation to other procedures.

EXPERIMENTAL SECTION The organic acids were dissolved in acetonitrile and silylated with N,O-bis(trimethylsily1)trifluoroacetamide (BSTFA) at 70 “C for 30 min. Small aliquots were injected on to a 3% OV1 silicone oil packed column coated onto a support of Diatomite CLQ. The injection temperature was held at 300 “C and the column temperature programmed from 100 “C to 270 O C at 8 “C/min. Helium carrier gas was used at a flow rate of 60 mL atm m i d . The eluate was introduced into a VG70/70H double focusing mass spectrometer via an all-glass jet separator held at 230 O C . The ion source was operated at 230 O C , and electrons of energy 70 eV were used to ionize the GC eluate. The electron trap current was 200 MA. The source was held at 4 kV positive potential and the magnetic field scanned at 3-sintervals and at a rate of 2 s/decade with a nominal resolving power of 1000 (10% valley). The signal was processed by on-line acquisition into a commercial PDPSAbased VG-DS2250 data system using an interface dynamic range of 4096:l with 10-V full-scale input, an input threshold of 6 mV, and a frequency response of 3000 Hz. MATHEMATICAL METHOD

Our method is an extension of that suggested by KTH (3). Essentially a nonlinear unconstrained optimization routine (6) is used to minimize an objective function which is the sum of the squared weighted differences between the elements of a calculated intensity data matrix and the experimental data matrix elements. The objective function variables are a set of peak-shape function parameters and retention times. At each iteration the current peak-shape function is used to obtain the chromatographic profile of each component and multiple linear regression is used to generate the corresponding least-squares intensity data matrix. For computational convenience we have used a slightly different formulism to that described by KTH, and each of our matrices is the transpose of those defined in ref 3. The GC/MS data are written as an s*m matrix D, where m is the number of mass chromatograms (Le., channels) and s the number of scans. The ith column in D is therefore the chromatogram of the ith ion. Following the method of KTH we write

C(QA) = D

(1) where C is an s*c chromatogram matrix. A is a c*m mass spectral matrix and Q a diagonal matrix containing a normalization factor for the ith component as its ith diagonal element. c is the number of chemical components. The ith row of A contains the mass spectrum of the ith component and the ith column of C contains its chromatogram. Having defined the matrices in this way, eq 1 can be treated as a standard multiple linear regression problem, but with a multiple right-hand side. Chromatogram Matrix and Peak-Shape Function. In order to cater for the presence of a significant level of background interference, we assign the first component (Le., first column in C) as background. The present analysis is computationally feasible only if each calculation is performed on a restricted region of a total chromatogram. Over such a region we find that a linear background function is adequate and calculate the first column of C using c,, = 1.0 + x t i (2) (note, the first diagonal element of Q contains the scaling

factor) where t, is time at the ith scan and x is one of the objective function variables. This linear expression could be replaced by a more flexible function if it were necessary. KTH have calculated the elements of their chromatogram matrix using a “trailing Gaussian” expression obtained by convoluting a Gaussian function with a single-sided exponential. This function contains three variables, these being the retention time and variance of the Gaussian peak and a tail decay time. They adopted a procedure in which measurements were made on pure samples in order to obtain values for the variance and decay time, leaving only the retention times to be determined. We have adopted a different approach and have used a nonlinear unconstrained optimization technique to obtain values for all of the chromatographic parameters. In the present calculations we used the peak-shape function

Cij = exp(-[(ti - r,)2/2a2]) + [l - 0.5(1 - tanh @(ti (rj + h ) ) ) ) ]exp[4.5X{[(ti f - (rl e))2]1/2+ ( t ,- (r, + e))ll (3)

+

to calculate the chromatogram matrix elements. This expression, which was suggested by Cheder and Cram ( 4 ) , contains a Gaussian and an exponential decay term together with a hyperbolic tangent “joining” function. e specifies the point at which the exponential is “turned on”, expressed in relation to the position of the Gaussian maximum r) h determines the midpoint of the tanh function, again relative to the position of the Gaussian maximum. f is the amplitude of the exponential function and h the rate of its decay. g determines the rate of change of the tanh function. In the following pages we refer to the position of the Gaussian maximum as the retention time. This does not precisely coincide with the position of maximum intensity. We use the term “peak shape function parameters”, meaning those variables that determine the background slope and the shape of the chromatogram profiles. These parameters, together with the retention times, are collectively called the “chromatogram parameters”. In this preliminary study we make the simplifying assumption that two virtually coeluting components have identical peak shapes and differ only by a displacement in time. Given initial estimates of the unknown parameter values the corresponding initial estimate of the elements of the chromatogram matrix can be calculated. We proceed by writing an analytical expression for C+, the pseudoinverse (5) of the chromatogram matrix

c+= (CTC)-lCT

(4)

Multiplying eq 1 from the left by C+ yields

QA = (CTC)-lCTD

(5) Following the procedure outlined in detail by Strang (5)we can show that if d, and q, represent the ith column vectors of D and (QA), respectively, then C(CTC)-lCTd,is the projection of d, onto the column space of C. (CTC)-lCTdiis therefore the least-squares solution to the inconsistent system Cqi = d,, i = 1, 2, ..., m, for a given set of parameter values in eq 2 and 3. A nonlinear unconstrained optimization routine (6) is used to refine our estimates of the unknown parameters. The residuals at the kth iteration are the differences between the elements of Dexptland D(k)cdcdwhere

D(k)c,lcd = Cck)(QA)ck)

(6)

and is obtained from eq 5. Following the suggestion of KTH, the residuals are weighted to discriminate against meaningless solutions in which some intensities are negative. We calculate the elements of the residual matrix using

ANALYTICAL CHEMISTRY, VOL. 57, NO. 6, MAY 1985

The denominator in this equation gives increased weight to those solutions in which the component spectra exhibit the greatest differences. w determines the importance of the weighting. If and I t are, respectively, the largest and smallest intensities in the jth column and the second to cth row of the mass spectral matrix, i.e., the largest and smallest j t h ion intensities in the component spectra, not including the background. In the present calculations w was set to 30, but little spectral sensitivity to w was seen in the interval 1 Iw Iloo.

Exponential Down-Scan Correction. Two approaches for making a down-scan time correction were tested. Method A. A corrected scan time (t,,,) was calculated for each mass using t,,,, = t - In (mass/m,)/l (8) where 1 = -d In (O,l)/s. s is the scan rate, d the interscan delay, and m, the lowest mass in the data matrix, for which ion t,,, = t. t,,,, was substituted for t in eq 2 and 3 and a new chromatogram matrix and pseudoinverse calculated for each mass. In this way the elements of (QA)(k)were calculated column by column (i.e., mass by mass). This approach was found to be prohibitively expensive in terms of CPU time, since m pseudoinverse matrices had to be calculated at each iteration of the optimization process. Method B. Cubic-splineinterpolation was employed to fit a curve to each of the m mass chromatograms. The interpolating cubic splines were then used to calculate the intensities at the corrected times (t,,,,) given by t,,,, = t In (mass/ms)/l (9)

+

We routinely adopted method B. Although less rigorous than A, this approach is not demanding of CPU time. Saturation. Clearly, the method must be modified if it is to be applicable to data in which some mass chromatogram peaks are distorted through saturation of the detection system. The following approach was adopted. Firstly, the experimental data matrix is scanned for saturated elements and a record made of the rows and columns in which saturation occurs. A new data matrix is then constructed in which all saturated elements are removed by setting to zero all the elements of the columns in which they lie. This new matrix is used to optimize the chromatogram matrix and to calculate the majority of the elements of the mass spectral matrix (QA). This procedure amounts to a removal of all saturated mass chromatograms and the use of the remaining chromatograms to optimize C. It yields a (QA) matrix containing no information about those ions whose chromatograms were zeroed. To get a least-squares estimate of the missing elements, we take the original data matrix and set to zero all rows containing a saturated element, Le., set to zero all saturated scans. The corresponding rows of the optimized chromatogram matrix are similarly set to zero and a new pseudoinverse calculated. If we label these two new matrices D'and C H , respectively, then an estimate of the missing elements in (QA) is given by the corresponding elements of the matrix (QA) ' where

1051

It is important to note that we only use those elements in (QA)' that are missing in (QA) and do not replace those elements that have already been assigned a value. This process is equivalent to finding the least-squares fit to the wings of the saturated mass chromatograms (Figure 6). Singular Value Decomposition. It is well-known that given a linear system A x = b, the matrix A can be so illconditioned with respect to inversion that the computed solution bears no relation to the exact solution. A small perturbation in either the matrix A or the vector b, caused by error in the original data or by round-off, can produce a massive perturbation in x. Error analysis (5) leads to the following inequalities/

and

n is the condition number of A. llxll and IlAll are the vector and matrix norms, respectively. An analytical expression for (&A) is given in eq 5, but it is not recommended that this equation be used for serious calculations. The condition number n(PC)is the square of n(C). Clearly, premultiplying C by its transpose can turn a well-conditioned problem into one that is ill-conditioned (5). A more stable procedure is to factorize the s*c matrix C into

c = 0,ZOzT

(13)

where O1is an s*s orthogonal matrix, O2is a c*c orthogonal matrix, and is a diagonal matrix containing the singular values of C. This is called singular value decomposition. The inverse of an orthogonal matrix is its transpose and so the pseudoinverse of C is

c+ = O2Z+OlT

(14)

where Z+ is a diagonal matrix containing the reciprocal singular values of C. This procedure is adopted in the present work. Data Matrix Scaling. Trial calculations showed that the nonlinear least-squaressolutions were dominated by a few high intensity mass chromatograms with the result that meaningless intensities were sometimes obtained for the low intensity ions. This did not occur if the data matrix was scaled. A careful compromise must be found between scaling up the lower intensity mass chromatograms to a degree sufficient that they make a significant contribution and overscaling, which would cause very low intensity and hence noisy mass chromatograms to distort the solution. Our data collection system is fitted with a 12-bit a-d converter and the elements of the data matrix therefore lie in the range 0 5 D, 5 4095. For these data matrices we used the scaling

Dkj"= Dij/Sj, Sj > 100

(15a)

Dij" = Dkj,S, I100

(15b)

where D" is the scaled matrix and Sj the largest element of the j t h column of the intensity-corrected data matrix D. Computer Program. The Fortran IV program, which was run on both the CDC 6600 and the Cray 1s computers at the University of London Computer Centre and in double precision on a Univac 1100 time-sharing system, makes extensive use of the NAG (Numerical Algorithms Group, Banbury Road, Oxford, U.K.) Library of subroutines. The following routines were used. Subroutine EO4FCF for performing the uncon-

1052

ANALYTICAL CHEMISTRY, VOL. 57, NO. 6, MAY 1985

1 2 3 4 5 6 7 ' " SCAN NUMBER

Flgure 1. 4-Hydroxybenzoic acid-DITMS/(4-hydroxyphenyl)acetic acid-DITMS GC/MS data matrix.

2

A-

e

strained optimization,subroutine FO2WCF for computing the singular values, and left- and right-hand singular vectors of the chromatogram matrix and subroutines EOBBAF and EO2BBF for calculating the interpolating cubic splines. These routines, in their turn, call many other NAG library routines and several user-written subroutines (see NAG library manual (ref 7) for details). The amount of CPU time required for any one problem depends, of course, on the accuracy of the initial estimates of the objective function variables, on the criteria used to terminate the search for a minimum, and the size of the data matrix. In the next section we examine the results obtained with a 7 scan by 1 2 mass data matrix. For this 7*12 matrix the computation rate on the CDC 6600 was about 64 objective function evaluationsper second. When the same program was run on the Cray 1s (Le., without reconstructing the program to fully exploit the vectorial nature of the Cray processor), a computation rate of about 340 objective function evaluations per second was achieved. The largest number of function evaluationsperformed during any of our trial calculations was 2178. In this calculation initial estimates of the retention times and variance were obtained by visual inspection of the data matrix. All the remaining objective function variables were initialized to zero except for the tanh slope which was set to 10. The search for a minimum was terminated when ~ ~ (I ~loy3 + ~for) all l i objective function variables, where k is the iteration number. This condition was satisfied after 34 iterations during which the 2178 function evaluations were made. The total Cray CPU time was about 6 s.

RESULTS We have examined the various features of the method outlined in the previous section by processing the data matrix obtained from a mixture of bis(trimethylsily1) (DITMS) derivatives of 4-hydroxybenzoic acid and (4-hydroxyphenyl)acetic acid. Our laboratory is routinely engaged in the identification of organic acids in clinical samples and these data were selected as being typical of those obtained with poorly separated organic acids. A second reason for choosing these data was a need for the presence of unique ions in order to examine and illustrate some aspects of the procedure and its performance. While the mathematical methods were under development, we found it an advantage to perform trial calculations on small data subsets. Not only did this keep our CPU demands at a reasonable level, but through the use of small test data matrices a clearer picture emerged of the response to altered procedures and parameters and the reasons for the observed behavior. We draw on some of the results obtained while this work was in progress to illustrate various aspects of this behavior. The intensity data matrix for the ion subset is shown in Figure 1 and the total ion current in Figure 2. Exponential Down-Scan Intensity Correction by CUbic-Spline Interpolation. Shown in Figure 3 is ii portion of the sampled chromatogram of one of the unique masses, i.e., the mlz 296 molecular ion of (4-hydroxypheny1)acetic acid-DITMS. The curves generated by cubic-spline interpolation and the peak-shape function are also shown. The

Figure 2.

1

2

3 1 5 SCAN NUMBER

6

7

Total ion current.

I

1

2

I

I

3 L 5 SCAN NUMBER

I

I

6

7

Flgure 3. The m / z 296 ion chromatogram profile: X, experimental data: .-, peak-shape function data points. The curve was generated by using the cubic-spline interpolant.

Table I. Experimental and Calculated Ion Intensity Ratios" 267:268 ion intensity

296:297 ion intensity

ratios scan no.

exptl data

cubic-spline

data

exptl data

4.24 4.24 4.24 4.27 4.20 4.44 4.23

4.27 4.23 4.25 4.25 4.22 4.66 4.23

3.87 3.78 3.90 3.96 3.83 3.90

m

ratios cubic-spline data 6.41 3.76 3.83 3.94 3.95 3.73 3.90

a The calculated ratios were obtained from the interpolating cubic salines after amlvinn the time correction given bv ea 9.

discrepancy between the two curves is relatively small, as required if the cubic-spline intensity correction method is to succeed. Furthermore, there is a good agreement between the unique-ion intensity ratios in the raw data and those obtained from the cubic-spline interpolants (Table I). The time correction for the m f z 296 and mlz 297 ions corresponds to a shift to a point approximately midway between two data points. The observation that the ion-intensity ratios are maintained at a point at which the error is expected to be a maximum confirms that the cubic-spline intensity-correction method performs satisfactorily. The Separation of the Component Spectra. Shown in Table I1 are the resolved spectra of the two organic acids together with the background spectrum. The library spectra, which are included for comparison, are taken from the Denver Collection compiled by Markey et al. (8). The first point to note is that a nonlinear least-squares solution has been found in which the entire base line is raised slightly with respect to the true base line. This adds background intensity to every ion, the magnitude of which is a reflection of the contribution each ion makes to the total ion current. A downward base

ANALYTICAL CHEMISTRY, VOL. 57, NO. 6, MAY 1985

1053

Table 11. Library and Calculated Mass Spectraa 4-hydroxybenzoic

background cacld

mass

100.0 45.2 10.3

73

75 193 252

7.4 14.0 2.5 1.2

267 268

269 281

calcd

lib(2)

calcd

lib(1)

lib(2)

75.4 8.2

100.0 12.7

50.5 46.1

100.0 18.5

100.0

62.7 0.0 100.0 23.7

58.6

52.2

0.9

0.6

0.0 100.0 21.9

100.0 12.9 2.4* 10.4 3.1* 0.7 0.3 9.1 3.2 1.0

0.4 19.9 0.0 0.0 0.0 21.1

9.5

14.3 4.1

15.3 4.9 3.2 6.6

282 283 296 297

(4-hydroxypheny1)acetic acid-DITMS

acid-DITMS lib(1)

10.1

99.8 21.8 8.5

2.6 27.0 6.9 0.2 0.2

2.2 28.0 6.8 0.0 0.0

9.3 2.0 26.4 6.2 0.0 0.0

12.0 1.1 0.0 0.0 12.0

3.1 1.5

15.9

5.6 2.3 26.6 7.1

2.4 1.5 "The library spectra, two for each compound, are taken from the Denver Collection (8). Two calculated ion intensities which are unreliable due to high intensity in the adjacent component are indicated by asterisks.

1

I

I

I

2

3 1 5 SCAN NUMBER

I

I

6

7

SCAN NUMBER

Figure 4. Normalized m l z 267 and m l z 296 ion chromatogram the m l z 296 ion profiles: X, the rnlz 267 ion experimental data; experimental data; , the m l z 267 ion data shifted to give coincident maxima. The two curves were generated from the interpolating cubic splines.

Figure 5. Normalized m l z 267 ion and m l z 296 ion chromatogram profiles: 0, the m l z 267 ion intensity data after applying the exponential down-scan correction; 0 , the time-corrected m l z 296 ion the first component chromatogram matrix elements; intensity data; X, the second component chromatogram matrix elements.

line shift gives rise to a background spectrum in which all but the most intense of the true background ions have a negative intensity, Bearing this in mind, we note that a satisfactory separation of the background ions has been achieved. As expected, the ubiquitous mlz 73 and mlz 75 ions are of high intensity together with the mlz 281 ion which originates from the OV1 stationary phase. An upward base line shift accounts for the high background intensity attributed to the mlz 267 ion, this species making the second largest contribution to the total ion current (Figure 1). A comparison of the library and calculated organic acid spectra (Table 11) indicates that, on the whole, a good separation of the two components has been achieved. The results do, however, highlight a fundamental limitation of the method. The level of contamination of an ion in one component increases with increasing ion intensity in the adjacent component, an inescapable consequence of the least-squares nature of the calculation. Two such entries in Table I1 are marked by asterisks. We have attempted to identify the major factors that contribute to the cross contamination of the component spectra. Shown in Figure 4 are the chromatograms of two unique ions, one derived from each component. Also shown are the curves generated by cubic-spline interpolation. The marked similarity of the two curves suggests that a violation of our assumption that the same peak-shape function and parameters can be used for two closely eluting components is not, in the present case, expected to be a major cause of error. Nevertheless, the superimpositionof one chromatogram onto the curve fitted to the other component indicates that an exact.match is not achieved. The resulting mathematical separation of the two single-component spectra is therefore

expected to be less than perfect. The chromatograms of the two unique ions, as obtained after a correction has been made for the exponential down-scan time delay, are shown in Figure 5. Also shown are the chromatogram matrix elements obtained at the nonlinear least-squares solution. The calculated chromatogram matrix elements do not exactly match the experimental chromatogram profiles. This must give rise to some distortion of the component spectra and is the cause of the failure to achieve a perfect separation of the unique ions. The cross contamination is, however, relatively small in magnitude and the component spectra certainly adequate for subsequent library matching. Furthermore, a significant reduction in the level of cross contamination is achieved by taking a data matrix containing a larger number of ions (unpublished observation). We wish to emphasize that a comparison of the experimental and calculated chromatograms is not a feature that can be incorporated into the resolution procedure itself. The present data matrix with its unique ions was selected for the purpose of making these comparisons. The method is, however, intended to be general and does not require the presence of unique ions. A requirement for a prior knowledge about such uniqueness would impose a severe limitation on the applicability of the method. Initial Estimates of the Objective Function Variables. We have found that rapid convergence to a meaningful solution is achieved only if good estimates of the retention times are supplied as a starting point for the iterative process. At present we obtain these values by visual inspection of the elements of the data matrix after making the exponential down-scan intensity corrections. The present data matrix with its unique masses is expected to yield better estimates of the

+,

+,

1054

ANALYTICAL CHEMISTRY, VOL. 57, NO. 6, MAY 1985

Table 111. Reconstructed Mass Spectra" 4-hydroxybenzoic acid-DITMS no. of saturated data matrix elements 73 75 193 252 267 268 269 281 282 283 296 297

(4-hydroxypheny1)aceticacid-DITMS no. of saturated data matrix elements

0

1

2

3

0

1

2

3

75.4 8.2 62.7

70.5 8.5 62.7

70.9 8.5 62.7

70.7 8.5 62.7

100.0

100.0

100.0

100.0

12.9

0.0

0.0

0.0 100.0

10.4

13.1 3.1 10.6 4.2

13.1 3.0 10.6

100.0

0.0 100.0

12.8 3.0 10.4

23.7 10.1 2.6 27.0 6.9 0.2 0.2

23.7 10.1 2.6 27.0 6.9 0.2 0.2

100.0 23.7 10.1 2.6 27.0 6.9 0.2 0.2

23.7 10.1 2.6 27.0 6.9 0.2 0.2

2.4

4.1

3.1 0.7 0.3 9.1 3.2 1.0 9.5

4.2

0.4

1.0 0.4

0.4

9.0 3.4

9.2 3.5

9.2 3.5

1.1 9.4

1.1

1.1

9.6

9.6

2.4

2.4

2.4

1.0

2.4

1.0

a An increasing number of saturated elements were introduced into the data matrix by a stepwise lowering of the level at which an element is treated as being saturated. The mass spectra were reconstructed as outlined in the mathematical methods section.

retention times than would be obtainable in the absence of these unique ions. Nevertheless, even in the absence of uniqueness, those ions having large intensity ratios are expected to yield reasonable initial values for the retention times. For a two-component problem, reasonable estimates can be obtained by inspecting the intensity-corrected data matrix and taking the shortest and longest maximum ion-intensity time points. We have found that providing the requirement for a good estimate of the retention times is met and a reasonable value for the variance is supplied, very approximate initial values are adequate for the remaining objective function variables. These variables determine the background slope and the asymmetry in the peak shape. In the present application a satisfactory solution is obtained by initially setting the tanh slope to 10.0 and the remaining variables to zero. The need for good initial estimates of the retention times arises because more than one solution exists, and these minima are poorly separated. To be more precise, there are several vectors x (xi being the objective function variables) for which the objective function value F(x) is a local minimum, and these solution vectors are poorly separated in their vector space. The use of a peak-shape function containing a relatively large number of variables may exaggerate this problem, a point we shall return to later. Reconstruction of Saturated Mass Spectra. To demonstrate the ability of the method to deal with saturated data elements and to reconstruct saturated mass spectra, we have simulated the effects of saturation by lowering the level at which a data matrix element is treated as being saturated. Shown in Table I11 are the resolved mass spectra obtained as an increasing number of saturated data matrix elements were introduced. The corresponding saturated base-ion chromatograms are shown in Figure 6. It should be noted that if any one data matrix element is saturated, the base-ion chromatogram must be reconstructed and that any error inherent in this process will be reflected in the intensities of all the remaining ions. The results indicate that the method works reasonably well even when little base-ion chromatographic information remains. DISCUSSION An Overall Assessment of t h e Procedure and a Comparison with Other Methods. We have been developing an automated system for extracting resolved mass spectra from CC/MS data matrices in which the spectral and chromatographic information are poorly separated and for performing subsequent library matching. As a part of this work we have been experimenting with several mass spectral resolution methods. The procedure outlined in this paper is an extension of the elegant method suggested by KTH (3). We have made

I I

1

2

I

I

I

3 L 5 SCAN NUMBER

I

6

7

Flgure 6. A diagram showing the loss of base-Ion chromatogram information as an increasing number of data matrix elements become saturated.

a number of modifications in order to adapt the method to our particular GC/MS problems. These modifications include the incorporation of a cubic-spline correction for the exponential down-scan time delay and procedures for reconstructing saturated mass spectra and for dealing with background interference. We have shown in this paper that the modified procedure can be applied to a real GC/MS problem and that it is capable of achieving a good separation of the component spectra. Although all the results given in the previous section were obtained with a rather small subset of ions, a few essentially complete data matrices have been processed. The results of these calculations are very encouraging and indicate that a better resolution is obtained with larger matrices (manuscript in preparation). An essential difference between our method and that of KTH lies in our use of a nonlinear unconstrained optimization procedure to determine all the peak-shape function parameters. KTH use previously obtained chromatographic information to determine their two peak-shape parameters, leaving only the retention times to be determined. This is not feasible in our application. The routine need to run standards with each sample was considered unacceptable in terms of the increased work it would entail. Furthermore, this approach cannot be applied to problems in which two coeluting and unidentified materials exhibit different chromatographic behavior. For some applications, however, the KTH approach might be preferred. A reduction in the number of objective function variables must lead to better overall stability and to increased convergence rates. All of the remaining features of the present method (the down-scan intensity correction,

ANALYTICAL CHEMISTRY, VOL. 57, NO. 6, MAY 1985

the reconstruction of saturated spectra, and the background subtraction) could, of course, be incorporated into the procedure. In the present work we have used the peak-shape function suggested by Cheder and Cram ( 4 ) which contains eight variables. Despite the flexibility of this function some data matrices were encountered in which incompatibility of the function and the experimental chromatogram profiles led to poor overall performance. Poor results were obtained when, for example, “heading” was evident. This behavior, with its characteristic linear leading edges, is presumably the result of poor chromatographicperformance and/or stationary phase overloading. The calculation is, of course, expected to perform badly in this situation since the present function assumes the leading edge to be Gaussian. This particular problem could be dealt with by incorporating some leading edge flexibility into the function. In general, problems of this type might be solved by making several peak-shape functions available to the program, one of which would be selected for a particular application. A second reason for looking at other functions lies in the reduction in CPU time that should be obtainable through a decrease in the number of objective function variables and the resulting reduction in the number of arithmetic operations required at each iteration. An improved convergence rate should also be obtainable together with a better separation between local minima and a consequent relaxation of the requirement for good initial estimates of the retention times. It has been suggested to us that for some applications the exponentiallyconvoluted Gaussian function used by RTH (eq 3 of ref 3), and which contains relatively few parameters, might be considered. Although we have shown that in the present case, the two component chromatogram profiles are similar in shape, this will not always be the case. The availability of functions containing fewer variables would be useful for dealing with problems in which overlapping components exhibit different chromatographic behavior. Our method assumes that it is known, a priori, how many components are present. KTH suggest that the calculation should be performed several times, increasing the number of components with each calculation until no significant decrease in the residual sum of squares is obtained. This approach is rather demanding of CPU time and the use of principal component analysis (2) might be preferred. We have pointed out that good initial estimates of the retention times are essential if the present method is to work. Although we have, to date, obtained these values by visual inspection of the intensity-corrected data matrix, the procedure could be automated along the lines discussed in the previous section and using information obtained during the cubic-spline interpolation to locate the position at which each ion reaches its maximum intensity. Dromey et al. (1) have developed a method for extracting resolved mass spectra from GC/MS data matrices which, unlike the present approach uses no iterative numerical methods and can be implemented on a modest minicomputer. Essentially, a model peak shape is extracted from the data itself. A basic assumption is that the mass spectra of two neighboring and unresolved eluants contain unique ions. These unique ions can be identified since they are expected to give rise to sharper maas chromatograms than the remaining ions. The unique ion chromatograms, once identified, are used to construct the model peak shapes. The major limitation of the method probably lies in the dominant role of the selected unique ions, since the precision of the final resolution is determined by the precision of the measurements made in the individual “unique ion” channels. For this reason the approach relies on the presence of unique ions of reasonably high in-

1055

tensity. This is an essential difference between the present method and the “Dromey” procedure. Our method uses all the information in the data matrix to generate the chromatogram profiles and is not so dependent on the measurements made in a single channel. If, however, the peak-shapefunction and the true chromatographic behavior are incompatible, our method cannot work well. On the other hand, since the Dromey method makes little assumption concerning the chromatogram shapes, their procedure is expected to be relatively unaffected by unusual chromatographic behavior. The two methods have been compared in this laboratory. Our experience to date indicates that although the iterative method described here looks promising, it is unlikely that it could be used as a replacement for the Dromey procedure as a means of processing complete GC/MS profiles. This can be achieved by using the Dromey approach on a modest minicomputer (a chromatogram containing 50 components can be processed in about 30 min; Moon, unpublished work). On the other hand, the present method requires several seconds of mainframe CPU time to process a small region of the chromatogram and a small subset of ions. In our hands, however, the Dromey procedure does not always achieve a separation of overlapping spectra although the components invariably contain unique ions. We therefore feel that the two methods are complementary and that our procedure should be useful for processing those regions of the chromatogram where the Dromey method fails. KTH point out that a fundamental limitation of the least-squares approach is that the presence of high intensity ions in one component leads to a ”contamination” of the resolved spectrum of the overlapping component. Those ions that have large amplitude neighbors must be suspected, therefore, of being in error, a point which must be noted during any subsequent library matching. This is a major limitation if these ions are diagnostically important and it may lead to a misassignment of the parent ion. Although the spectra in Table I1 exhibit only a low level of cross contamination, some spillover of the high intensity ions of 4-hydroxybenzoic acid-DITMS into the (4-hydroxypheny1)aceticacid-DITMS spectrum is evident. It is not, however, of a sufficient magnitude to be a problem during subsequent library matching. Furthermore, a few calculations have been performed with essentially complete data matrices and the results indicate that a better separation of the component spectra is obtainable with larger matrices. Sharaf and Kowalski (2) have developed a principal component/factor analytical method for resolving overlapping GC/MS components. Their procedure yields an upper and lower limit for the intensity of each ion. An inspection of their data indicates the presence of some degree of cross contamination in their spectra. The intensity range is, in general, large for those ions which have a high intensity in the neighboring component, a parallel of the unreliability of some of the entries in our spectra. Their approach does, however, offer a number of advantages. It does not employ peak-shape modeling and shares the advantage of not requiring unique ions. Their method is expected to be less demanding of CPU time than the present calculation since it does not use iterative parameter optimizationmethods. We have indicated that principal component analysis is probably the method of choice for determining the number of coeluting components. The extention of this approach, in which factor analysis is used to generate the component spectra, is useful only if the “diagnostic” ion solution bands are reasonably narrow. In this respect we are not, at present, able to say whether our optimization method is more powerful than factor analysis. Only after testing the two procedures in parallel and using a series of real data arrays will it be possible to make a reliable assessment of their relative performance.

1056

Anal. Chem. 1985, 57, 1056-1060

ACKNOWLEDGMENT We thank R. G. Dromey et al. who provided us with a copy of their “mass fragmentogram analysis” program and I. C. Moon who developed a PDP11/34 implementation of their program and supplied us with the spectra she obtained during a series of trial calculations. We also thank P. Bolard for his comments on the manuscript and B. R. Pettit for useful discussion and for providing much of the GC/MS data used during the developmental stages of this work.

(2) Sharaf, M. A.; Kowalskl, B. R. Anal. Chem. 1981, 53,518-522. (3) Knorr, F. J.; Thorsheim, H. R.; Harris, J. M. Anal. Chem. 1981, 53, 821-825. (4) Cheder, S. N.; Cram, S. P. Anal. Chem. 1973, 45, 1354-1359. (5) Strang. G. “Linear Algebra and Its Applications”, 2nd ed.; Academic Press: New York, 1980. (6) Fletcher, R. “Practical Methods of Optimization”; Wiley: Chichester, 1980; Vol. 1. (7) NAG Library Manual (Mark 9), Numerical Algorithms Group: Oxford, 1983. (8) Markey, S. P., Urban, W. G., Levine, S. P., Eds. ”Mass Spectra of Compounds of Biological Interest”; U S . Atomic Energy Commission Report, No. TID-26553.

LITERATURE CITED (1) Dromey, R. G.; Stefik, M. J.; Rindfleisch, T. C.; Duffleld, A. M. Anal. Chem. 1978, 48, 1368-1375.

RECEIVED for review October 9, 1984. Accepted January 9, 1985.

Probability-Based-Matching Algorithm with Forward Searching Capabilities for Matching Unknown Mass Spectra of Mixtures Douglas B. Stauffer and Fred W. McLafferty*

Chemistry Department, Cornell University, Ithaca, New York 14853 Robert D. Ellis and David W. Peterson

Scientific Instrument Division, Hewlett-Packard Co., 1501 California Avenue, Palo Alto, California 94304

Adjustment of the match rating based on the predicted component concentration (“contamination”) significantly improves the reliability of the probability based matching (PBM) aigorlthm for unknown spectra of pure compounds. Subtracting each best-matching spectrum from the unknown and rematching all resulting residual spectra against the bestmatching reference spectra extend this performance improvement to 60% and 30% components of mixtures. For 60% components two-thirds of the class I V (closely related compounds) wrong answers were eliminated, and for pure compounds such wrong answers were nearly halved. A small performance improvement also resulted from correcting predicted reiiabillties for the uniqueness and abundance of peaks discarded to improve matching.

by reverse searching matches fairly well (reliability 15% less than that of the correct answer) the reference spectrum of 3-octene. Forward searching, however, would show clearly that this is not correct, as the mass spectrum of 3-octene would not contain the base peak at m / z 149 of the unknown dioctyl phthalate. Probability based matching is probably the most widely used algorithm for retrieval from a comprehensive mass spectral data base (1,4,8-10). This report extends preliminary correspondence (8) in describing the addition of forward searching capabilities to PBM applicable to unknown spectra of both pure compounds and mixtures utilizing automated subtraction of the best matching reference spectra from the unknown ( 1 0 , l l ) . An improvement to the “flagging”method (4)for removing spurious peaks also makes unnecessary the use of AK,the difference between the “confidence” ( K ) value found and that for a perfect match.

Computer algorithms for matching an unknown spectrum against a reference spectral file are often classified into “forward-search” and “reverse-search” categories (1-4). The latter is less restrictive, demanding only that the peaks of the reference spectrum are contained in the unknown, not vice versa. For unknown spectra of mixtures, reverse searching has the well-demonstrated (1-4) advantage that nonoverlapping peaks from a second mixture component do not reduce the matching performance of the first. However, for unknown spectra of pure compounds forward searching is more effective as it can demand that peaks absent from the reference cannot be present in the unknown. The advantages of such “no-band” data in matching infrared spectra have been amply demonstrated (5). For unknown mass spectra, particularly those obtained by gas chromatography/mass spectrometry (GC/ MS), it is difficult to be sure that the unknown represents a pure compound; even with capillary GC there is a nonzero probability of peak overlap (6, 7). The problem has been illustrated (8)with a mass spectrum of dioctyl phthalate, which

EXPERIMENTAL SECTION Programs were written and tested on an IBM 4341 multiuser system and the H/P-1000 computer of the Hewlett-Packard 5985 GC/MS system. The data base used contained 76673 different spectra of 64 376 different compounds (12) from the Wiley-NBS collection (13),only excluding spectra containing isotopic labels. The performance of algorithm revisions was evaluated by using recall/reliability plots (14) of data from matching the list of 392 randomly selected unknowns utilized previously (8,9,12) for which other spectra of the same compound are present in the data base. Of such “duplicate” spectra, 3213 (21%) were removed because each was a close PBM match (12) for another spectrum of the same compound. Of the correct answers removed, half matched the unknown spectrum with a reliability (RL) value >92%; to represent more closely the performance of an average unknown, the data were corrected for these missing spectra (15). Groups of three of these spectra, ordered by increasing molecular weight to simulate GC effluents, were combined in 60:3010 (total ions) proportions to give the file of 130 unknown mixture spectra. Reference spectra used as unknowns were omitted from the PBM

0003-2700/85/0357-1056$01.50/00 1985 American Chemical Society