A Stepwise Regression Program for Quantitative Interpretation of Mass

pounds being grouped together varies widely. The computer input consists of the program deck, a small set of instruction cards, and the spectral data ...
0 downloads 0 Views 512KB Size
A Stepwise Regression Program for Quantitative Interpretation of Mass Spectra D. D. TUNNICLIFF and P. A. WADSWORTH Shell Development Co., Emeryville, Calif.

b A computer program that gives both qualitative and quantitative results for the sample without requiring prior detailed knowledge as to the composition has been developed for mass spectrometric analyses. The program is based on a mathematical method of choosing from among a large library of spectra a suitable small group of spectra such that the sum of the individual spectra after multiplying by the proper concentration factors gives the best leastsquares fit to the sample spectrum. The library may contain as many as 150 reference spectra with data for 110 masses for each spectrum.

S

IMULTANEOUS

DETERMINATION

Of

several components by mass spectrometry is usually based on selecting the most significant peak(s) for each compound expected in the sample and then solving the set of simultaneous equations that defines the relations between peak intensities and concentrations. This method is simple and gives good results when the qualitative composition of the sample is known. However, if the sample contains one or more unexpected components, the results for some components may be considerably in error. Such errors may be detected by a difference between the observed and predicted intensity a t a peak not used in the simultaneous equations or by a difference between the sum of the computed partial pressures and the observed sample pressure. A large negative value for the concentration of one or more components is also evidence of an error. If there is any evidence of such errors, the calculations can be repeated using a different set of components. Efficient use of the above approach requires considerable previous knowledge about the qualitative composition of the sample. Samples containing unexpected components, a t the worst, may give undetected erroneous results or, a t the best, may require additional calculation with a corresponding delay in obtaining the results. In addition, the use of this method of calculation requires considerable skill and judgment in choosing the best set of masses for the analysis of a particular sample, in 1082

ANALYTICAL CHEMISTRY

evaluating the results of the calculation, and in deciding what procedure to take if there is evidence of an error. A new computer program that avoids most of these difficulties has been developed. This program makes use of a large library of reference spectra of individual pure compounds. Up to 150 spectra may be included in the library. If the spectra of all components in the sample are present in the spectral library, the program automatically determines both the qualitative and quantitative composition of the sample without requiring any advance information as to the sample composition. Since the calculations are based on a least-squares fit to the observed intensities for as many as 110 masses, no preselection of the most significant mass for each component is required. In fact, the inclusion of the additional data both improves the discrimination between compounds with similar spectra and aids in detecting errors in the data. This new program is based on a modification of a stepwise regression program written by Efroymson ( I ) . The regression procedure is a mathematical method of choosing from the library of reference spectra a small group of spectra which, when each is multiplied by a suitable concentration factor, will give the best fit to the sample spectrum as determined by a leastsquares criterion. During the regression the calculations proceed in a stepwise fashion. At each step, an additional spectrum is selected to give the greatest improvement in the variance of the fit of the predicted spectrum to the sample spectrum. The concentration factors for all spectra previously chosen are recomputed a t each step. A spectrum chosen previously may be rejected a t a later step if the change in variance caused by removing this spectrum is less than a preset value. This stepwise procedure continues until either the ratio of the concentration to the standard error for the next spectrum to be chosen is less than a preset value; the ratio of the computed to expected standard error between the observed and computed spectra is less than a preset value; or the change for two consecutive steps of the ratio of the computed to expected standard error is less than a preset value.

General Description of the Program. The library of reference spec-

tra is loaded ahead of time on magnetic tape using a separate program. The data are ordered in such a manner as to facilitate exclusion from the calculations of those compounds that could not be present because of the lack of any significant intensity at the higher mass range: Also included on the magnetic tape are a list of the masses to be used and instructions for reporting the results. Occasionally, the spectrum of two or more compounds of interest will be so similar that it is preferable to report only the sum of the concentrations. Two alternate procedures are provided for such compounds. If the spectra are very much alike, the best approach is to prepare a composite spectrum based on the usual relative concentrations of each component. This composite spectrum is then included in the library of reference spectra. However, if the spectra are sufficiently different to permit even an approximate calculation of the individual components, it is preferable to include the spectrum of each individual compound in the library of reference spectra and to add the computed concentrations before reporting. When applicable, the second method will give more accurate results if the relative concentrations of the compounds being grouped together varies widely. The computer input consists of the program deck, a small set of instruction cards, and the spectral data for each sample. Usually these data are first recorded on punched-paper tape using a MASCOT and later converted to punched cards using an IBM 47 tapeto-card punch. Alternatively, these punched cards containing the data are prepared using a Contact Telereader. The calculations proceed sample-bysample, the results being reported for one sample before calculations are started on the next sample. The first phase consists of reading and editing the sample spectral data, picking out the intensity data corresponding to the selected list of masses, applying the appropriate galvanometer sensitivity factors, and correcting for background. The maximum mass with a significant intensity is found, and the list of com-

pounds to be considered for the regression is reduced accordingly. A weighting factor is applied to both the calibration data and the observed intensity for each mass. Except for low intensities this weighting factor is proportional to the reciprocal of the observed intensity. This assumes that the probable error in each intensity measurement is a constant percentage of the intensity. The next phase consists of forming the regression matrix and proceeding through the regression. After finishing the regression, any components with negative concentrations are removed since they do not represent a real solution. The regression is entered again and finally any new compounds with negative concentrations are removed. The last phase of the program consists of preparing the output. The first part of the output is a table of the observed and predicted (calculated from the computed concentrations) intensities for each mass. Another section is a table of the computed concentrations and standard errors for each component before normalizing and editing but after taking the sums for those compounds which are to be reported together. Compounds with concentrations less than a preset value are eliminated, the results normalized to total 100 per cent and a final report is prepared in duplicate. The reported results are all rounded off according to the following equation

C - Ai(R - 1)*' 9.0 -

where NSF

number of significant figures in the reported concentration (after truncation) CON = computed concentration ERROR = computed standard error = computed concentration C with decimal point adjusted so that there is only one digit to the left of the decimal point = ratio of computed standR ard error in the fit between the observed and predicted spectra to the expected error assuming 1% error in all intensities Ao, All A2 = adjustable constants (current values are 1.6, 0.2, and 1.5, respectively). =

The first three terms take into account the effect on the number of significant

B ; 150-

" :

e

E

F

-a

100

-

50

-

a

uE

0

50

100

50

Number of Spectra

Figure 1. Relation between number of spectra and computer time required

figures of the ratio of the computed concentration to the standard error and the size of the first non-zero digit. The last term reduces the number of significant figures if a poor fit to the sample data is obtained. This may be due either to an error in the input data or to the presence of a compound in the sample whose spectrum was not included in the library of reference spectra. The values for & and AI listed above give an accuracy of approximately 1-5 units in the last reported figure as based on the computed standard error. Options and Special Features. Considerable flexibility in the calculations is available through the use of several options. Some of these are described below. Since the amount of computer time required is dependent upon the number of compounds to be considered, there is an economic benefit in making use of any prior knowledge as to the qualitative composition of the sample. This can be done by defining subgroups of spectra which are used for the calculations. However, if the ratio of the standard error to the expected error in the fit exceeds a preset value, then the calculations are automatically repeated using a second group option (usually the entire set of spectra). Since these subgroups are defined by merely listing the serial number of the desired spectra, they are readily redefined as required. As many different magnetic tapes with different sets of reference spectra may be prepared as desired and then any two of these may be mounted on the computer a t the same time. Either tape may be specified for the calculations for a given sample. One tape may contain spectral data for a large set of commonly encountered compounds, the other tape may contain data for a set of compounds of only temporary interest. Any selected list of compounds can be omitted from the final report and the results for the balance normalized to 100%. This permits reporting results

on a water-free or air-free basis, if desired. Computer Costs. As mentioned previously, the computing time is dependent upon the number of reference spectra considered in the regression. Figure 1 shows the computer time required on an I B M 7040 as a function of the number of spectra which may enter the regression. An IBM 7094 has been found t o be about 4.8 times faster for these calculations. Discussion. This program, with minor modifications, has been used successfully for about 2 years for the daily calculation of routine mass spectrometric analyses. Also, it has been tested using data for several hypothetical samples. The data for these samples were computed using the calibration data and the desired concentration of each component, and then random errors were added to each computed intensity such that the standard deviation between the true and modified intensities is 1% of the value. Tables I to 111 show the results obtained for the analysis of hypothetical samples A , B, and C. The different computed values represent results calculated for a different distribution of random errors in the sample data. All of these results were obtained using a library of 129 reference spectra with no subgroup designation. The errors in the qualitative composition observed in Table 111 for some of the calculations are a result of marked similarity between combinations of one set of spectra and combinations of other spectra. Similarity between just two spectra is easily detected by examining the coefficients in the correlation matrix. These coefficients may range from 0.0 to 1.0 depending on the degree of similarity of the two spectra. If very nearly 1.0, then one of the two spectra should be eliminated. If the set of reference spectra should contain two identical spectra, then the first one of the two considered will be the one chosen. However, if two spectra differ only slightly, then the one chosen will be the spectrum which gives the VOL. 37, NO. 9, AUGUST 1965

1083

smallest variance even though the difference may be very small. In this case, very small differences in the experimental data can influence the choice. The possibility that the composite spectrum of one group of spectra is very similar to the composite spectrum of another group of different spectra is very difficult to detect. The effect of this condition on the calculations was investigated through the use of a separate program which predicted that

Compound

Table I. Results for Hypothetical Sample A Actual, Yo Found, %

n-Butane Isobutane Nitrogen Carbon monoxide Acetone Acetaldehyde Ethylene oxide\

Compound

40.0 15.0 15.0 20.0 10.0 ...

5.0 15.0 5.0 13.0 22.0

+

1084

, .

39.7 15.5 15.0 19.6 10.1 0.2

40.5 14.6 14.6 20.2 10.1

40.2 14.9 14.9 20.0 10.0 ...

40.2 15.1 15.0 19.8 9.9 ...

5.0 15.3 5.0 12.6 21.9

4.9 15.2 4.9 13.3 21.9

, . .

5.0 15.1 5.2 12.9 21.7

5.0 15.1 5.1 12.9 22.4

4.9 15.2 5.1 12.7 22.2

computed results for sample D were excellent. The largest error consisted of finding 0.8% 1-pentene which was not actually present. The results for sample E were not so good. Most of the errors could be attributed to finding several of the compounds listed for sample D and these errors in the qualitative composition led to errors in the quantitative composition. Some of these errors were as large as 5%. The small differences between the four different sets of data for sample E had a considerable effect on both the qualitative and quantitative results. However, considering the fact that these particular mixtures were the worst possible combinations and that manual calculations would be virtually impossible, the results were surprisingly good. Although the program endeavors to find the set of spectra which gives the best fit to the observed spectrum, there can be several nearly equivalent solutions in such cases. Since no purely mathematical solution can find the correct answer to such a problem, the true results can be obtained only by supplying additional information to the computer. This may be done by using the subgroup option to restrict the list

Table IV.

20.0 8.0 12.0

19.8 8.5 11.8

20.0 7.7 11.8

20.0 7.9 12.0

20.0 8.2 12.0

20.1 7.7 12.2

Table 111. Results for Hypothetical Sample C Actual, Found, %%

Ethene Ethane Propene Propane Butenes 1-Butene, 2.094 trans-2-Butene, 3.070 n-Butane Isobutane Pentenes 3-Methyl-1-butene n-Pentane Isopentane n-Hexane Cyclohexane Rfethylcyclo entane Meth y 1: y dofiexane Propadiene methylacetylene 1-Pentene ethylc yclopropane 1-Hexene Hexenes (branched) Furan hlethyl ethyl ketone Carbon tetrafllioride htrolein Carbon monoxide

+

40.1 14.8 15.1 20.1 9.8

Table II. Results for Hypothetical Sample B Actual, yo Found, 70

n-Pentane Isopentane n-Butane Isobutane Butenes Isobutene, 12.0% trans-2-Butene, 5.0% cis-a-Butene, 5.0% Propane Nitrogen Carbon monoxide

Compound

the composite spectrum of sample D and sample E as given in Table IV would be very similar. The computed intensities of these two mixtures agreed to about 50/, a t all the major peaks with a few slightly larger differences in peaks of lower intensity. Four different distributions of random errors were added to the computed spectrum of each sample and then these data were used as input to the regression program which used the entire library of 129 spectra for the calculations. The

2.9 1.9 5.2

10.7 5,O 1.8 2.0 5.4

9.2 5.1 2.4 1.9 5.5

10.0 4.9 2.3 2.1 4.7

10.1 5.1 2.5 2.0 4.5

15.0 10.0 5.0

15.0 9.8 4.9

15.0 8.9 4.6

15.4 9.2 5.0

14.5 11.2 3.3

15.0 10.4 5.1

5.0 20.0 5.0 3.0 2.0 10.0

4.9 20.3 5.0 2.9 2.1 10.0

4.7 20.5 5.0 2.3

5.5 19.7 5.1 3.1 1.6 10.0

. . ,

20.8 5.0 1.8

4.6 20.6 4.9 3.1 2.0 10.0

10.0 5.0 3.0 2.0 5.0

10.0 5.1

, . .

10.0

, . .

10.0

...

0.2 1.3 5.2 0.5 0.1 1.8

, . .

...

... ...

3.1 0.9 0.1

, . .

... ...

ANALYTICAL CHEMISTRY

0.2 1.1 0.5

Composition of Hypothetical Samples D and E Sample D n-Pentane 27 05% 3-Methyl-1-butene 0 96 Isobutane 10 75 Isobutene 4 19 Ethane 8 20 Nitrogen 0 09 Isopropyl alcohol 2 06 Isobutyl alcohol 0 63 Diethyl ether 4 50 Acetone 6 75 Diethyl ketone 10 48 Acetaldehyde 2 82 Propylene oxide 3 27 hlethyl formate 2 41 2-hlethylpentane 0 62 3-Nethy lpentane 14 55 1-Hexene 0 09 Cyclopentane 0 50 Methylamine 0 11 Sample E Isopentane 20 3770 n-Butane 9 70 1-Butene 2 54 Propane 5 29 Ethene 4 38 Ethyl alcohol 0 03 Allyl alcohol 1 60 n-Propyl alcohol 1 84 tert-Butyl alcohol 0 76 n-Butyl alcohol 4 39 Sec-Butyl alcohol 2 15 Methyl ethyl ketone 8 05 Propionaldehyde 5 77 Methyl acetate 6 51 n-Hexane 19 40 1-Pentene 2 10 3-hlethyl-1-pentene 0 07 Acetic acid 1 73 Nitric oxide 1 68 2,2-Dimethylpropane 1 64

of spectra considered and thus eliminate impossible components. Another approach is to use a provision which forces any selected group of spectra-preferably the major components-to enter the regression a t the beginning of the calculations and then allow the regression to find the remaining components. Table V shows the stepwise operation of the regression procedure for the first sample from Table 1. The variance starts a t a high value and the computed results are considerably in error. However, as additional spectra are chosen for the regression, the variance becomes smaller and the computed concentrations approach the true value. Another possible approach for the calculation of mass spectrometric data would be a direct least-squares solution of the set of equations formed by including all possible compounds. However, in practice this method is not as satisfactory as the stepwise regression method. In the least-squares method, the inclusion of compounds which are not present will decrease the value of the determinant and thus increase the probable error in the computed results for compounds actually present. However, with the regression procedure the probable error is based on the value of the determinant for only the com-

Table V.

Steps in Regression

%

Step 1

Step 2

Found, yo Step 3

Step 4

Step 5

40.0 15.0 15.0 20.0 10.0

64.5

62.8

53.5

40.2 14.1

23.6

22.0 13.0 17.6

23.0 12.2 11.2

40.1 14.8 15.1 20.1 9.8 1.o

Actual,

Compound n-Butane Isobutane Nitrogen Carbon monoxide Acetone Variance, yo

33.9

pounds found and not for the compounds considered. Another disadvantage of the least-squares method is that the number of compounds considered cannot exceed the number of masses for which intensity data are available. In the regression procedure, it is only the number of compounds found which cannot exceed the number of masses. The only limitation on the number of compounds considered is the amount of storage space available in the computer and the amount of computer time available. The provision in this program for 150 reference spectra on each magnetic tape is believed to be quite adequate. The program described in this report is written mostly in Fortran IV for operation under the IBSYS System. A few subroutines written in machinelanguage are also required. The authors

24.6

can supply on request a more extensive report which describes the operation of the program with complete listings of the source decks. ACKNOWLEDGMENT

The authors acknowledge the valuable advice of J. H. Schachtschneider in some of the mathematical aspects of this project. LITERATURE CITED

(1) Efroymson, 31. A., “Multiple Re-

gression Analysis” in “Mathem$cal Methods for Digital Computers, A. Ralston and H. S. Wilf, eds., Wiley New York, 1960. RECEIVEDfor review March 3, 1965. Accepted June 14, 1965. Annual Meeting of the ASTM Committee E-14 on Mass Spectrometry, St. Louis, Mo., May 16-21, 1965.

Reactor Neutron Activation Analysis by the Single Comparator Method FRANCESCO GIRARDI, GIAMPAOLO GUZZI, and JULES PAULY Servizio Chimica Nucleare, Centro Comune di Ricerche Euratom, lspra, Varese, ltaly A method of activation analysis, based on the irradiation and counting of a single comparator (cobalt) instead of standards prepared from known weights of the elements to be determined, has been critically evaluated. The influence of the variation of experimental parameters such as reactor neutron spectrum, neutron flux, and yield of y-ray spectrometers, has been studied. In a series of trial runs the accuracy and precision were found similar to those of the relative method. The method can find most useful applications in automated analysis, or when a large number of elements are determined in one sample.

N

activation analysis by the relative method, until now used nearly exclusively, requires preparation and counting of a comparator for each EUTRON

element to be determined. These operations are time-consuming and can introduce sources of error, particularly when automated systems of activation analysis are used. Recently the possibility of eliminating the comparators by using a direct or absolute method was considered (IO). The results obtained using y-ray spectrometry to measure induced radioactivity show that this procedure cannot yet compete in accuracy with the relative method. This has been attributed to the uncertainties which still exist in the knowledge of nuclear constants required for the calculations, especially cross sections and y-ray abundances of radionuclides. Although the accuracy was poor, the precision of results was generally similar to that of the relative method. This suggests that results of the same quality as those given by the relative method are possible, if the factor depending on

nuclear constants is not derived by calculation, but determined experimentally by irradiating known weights of the element under study with a neutron flux monitor. Then the unknown sample would be irradiated with a similar flux monitor, which would be used as a single comparator for different elements. The two major difficulties of the absolute method-the necessity of measuring “total” photopeak surfaces and evaluating the effective activation rates of the elements as a function of the neutron energy spectrum availablewould also be avoided, as the experimental parameters of activation and counting are not important, if they are kept constant. The accuracy of the method is strictly dependent on the validity of this assumption. The single comparator method is often applied in activation analysis, when the activating flux (neutrons or gamma or VOL. 37, NO. 9, AUGUST 1965

1085