A rule-building expert system for classification of ... - ACS Publications

Charles Evans & Associates, 301 Chesapeake Drive, Redwood City, California 94063. A rule-bulldlng expert system was devised for the classifica- tion o...
0 downloads 0 Views 1MB Size
Anal. Chem. 1989, 61, 715-719

715

Rule-Building Expert System for Classification of Mass Spectra Peter de B. Harrington, Thomas E. Street, and Kent J. Voorhees* Department of Chemistry and Geochemistry, Colorado School of Mines, Golden, Colorado 80401 Filippo Radicati di Brozolo and Robert W. Odom Charles Evans & Associates, 301 Chesapeake Drive, Redwood City, California 94063

A rulebullding expert system was devised for the classlfication of mass spectra. The expert system optimizes the numeric to symbolic conversion and develops Its own certainty factors. Thls system performed better than Hnear dlscrhnlnant analysk for classifkatlon of pyrolysis-mass spectra and laser ionization mass spectra.

INTRODUCTION The characterization of complex daterials by mass spectrometry is a rapidly growing field. Pyrolysis-mass spectrometry (Py-MS) (1)and laser ionization mass spectrometry (LIMS) (2) are just two of the many mass spectrometer configurations that have been successfully used for characterizing nonvolatile materials. In both methods, the samples are classified by the mass spectra of their degradation products which are often complex to interpret visually. Computational methods such as pattern recognition may easily be applied to mass spectra because they are usually acquired in a digitized format. These data analysis methods have been used for classifying or for transforming the mass spectral data into a representation that is amiable for visual interpretation (3). Traditional pattern recognition methods usually require several assumptions concerning the data set ( 4 ) . They may assume that the distribution of the data is known. Some methods also assume the data are linearly separable while others require a linear model. For simple data sets the violation of these assumptions may not vitiate the results, but for complex data sets these assumptions become more important. An alternative method to pattern recognition is the rulebuilding expert system which is not restricted by the assumptions stated earlier. Expert systems are computer programs that use knowledge regarding a specific problem domain, expertise, to infer their conclusions (5). These systems usually store their expertise in a database which is named the knowledge base. Extracting knowledge from an expert in a consistent and logical form is known as the knowledge acquisition bottleneck because this step is time-consuming, costly and difficult. Rule-building expert systems develop their knowledge bases from training seta of data similar to pattern recognition methods. Because expert systems rely on symbolic relationships, they differ from traditional methods of pattern recognition which rely on numeric relations of the data. Most expert systems which use numeric data contain an interface to convert numeric data to symbolic data. Commercial rule-building expert systems are available. However, they have the disadvantages of providing a classification result with little additional information and the user is often unable to modify the expert system because the software listings are proprietary (6). This paper describes a rule-buildingexpert system that accommodatesmass spectral data by optimizing the conversion from numeric data to 0003-2700/89/0361-0715$01.50/0

symbolic data (7,8).This system is compared to linear discriminant analysis for the classification of bacteria by Py-MS and the identification of polymers by LIMS.

THEORY The knowledge base of an expert system is usually constructed from antecedent-consequent rules. An example of this format is given in Figure la. Any complex rule may also be written as a series of simpler rules. Figure l b gives the same antecedent-consequentrule but in a format of a tree composed of simple binary rules. Because rules are allowed to share antecedents in tree format and trees can have several conclusions, structuring rules in tree format is more efficient than maintaining a list of complex rules. A classification tree efficiently represents a set of complex rules. If the classification tree in Figure l b was converted to the complex rule format of Figure la, it would require four complex rules, one for each conclusion. The nodes or branches of the tree contain rules that direct the path through the tree. The conclusions or class designationsare stored in the leaves of the tree, hence the name classification tree. Efficient trees contain a minimum number of rules and are minimal spanning. A minimal spanning tree has ita symmetry maximized. Information theory may be used for developing efficient classification trees. The most successful method for developing classification trees composed of simple binary rules is the ID3 algorithm (S11). The first step in any classification process is to convert the data into a suitable format for classification. Py-MS and LIMS spectra typically consist of intensity values obtained over a mass range of about 300 m u . The number of spectra in a data set may vary between ten and several hundred. For supervised methods of classification, which includes expert systems, the number of spectra should exceed 3 times the number of masses to obtain an accurate clwification function (12).An accurate classification function classifies by meaningful features in the spectra instead of by random features (noise) and will converge to its true value as the number of observations is increased in the training set. The mass spectra can be transformed into data which has 3 or more times the number of observations than variables by an eigenvector transformation, which maximizes the retained variance (information),while reducing the number of variables. A preselected number of eigenvectors (principal components) which is less than one-third the number of spectra in the training set, is calculated. The basis set of eigenvectors is saved for compression of other analyte spectra. The training set of spectra is projected onto the basis set of eigenvectors to obtain abstract mass spectra composed of principal component scores. This calculation is defined by

A=WE where A is the abstract mass spectrum composed of r components, M is the mass spectrum composed of v masses, and E is a v X r dimensional matrix of eigenvectors. A data compression is obtained whenever r is less than u. 0 1989 American Chemical Society

716

ANALYTICAL CHEMISTRY, VOL. 61, NO. 7, APRIL 1, 1989

a

Antecedent If: x contains copper ions and if: x is colored green Consequent Then: the copper ions are Cut

b

FALSE

TRUE

The copper ions are CU"

Do more t e s t s f o r copper

will be in nits instead of bits. Equation 3 is the sum of the entropies for each attribute weighted by the prior probability that the attribute occurs. M is the number of attributes and p(aj) is the number of observations with a given attribute divided by the total number of observations. Other rule-building criteria have been investigated, but the entropy of classification has always performed remarkably better (14). For attributes with the same entropy of classification, the attribute is selected that has the greatest distance between its closest scores. Using this criterion minimizes the critical region of the rule or overlapping of points across the attribute. A zero entropy will be obtained when all the samples in the training set are correctly classified. The decision tree may be evaluated by examining the number of rules that are generated. Even random data will be 100%correctly classified by a decision tree; however, the level of the tree will be large. For a binary classification tree, the total number of decision rules required for a worst case situation is simply the number of observations minus one. For a best case situation, the number of rules will be the number of classes minus one. A simple measure of the efficiency of the decision tree is

[

1.0 -

+

1

no. of rules - no. of classes 1 x 100 = no. of observations - no. of classes eff. % (4)

Cross validation is a another method for evaluating an expert system. It is useful because the goal of supervised classification techniques is to classify unknown samples based on a model developed from a training set of data. Certainty factors are measures of the reliability of the results output from an expert system. Certainty fadors are analogous to the centour scores used in discriminant analysis (15). Unfortunately, certainty factors are often arbitrary and are as difficult for a human expert to devise as the rules used in the knowledge base. It is important for expert systems to also generate their own certainty factors as well as rules. The certainty factors are generated for each attribute by cL,j(x) = e-lQ,-Xl/b

(5)

where gQj(x)is a certainty factor, ai is the attribute for rule j , e is the exponential, x is the rule variab1e;and is the average of all the variables in the training set possessing that attribute. The certainty fadors vary between 0 and 1. If more than one rule is required for a classification, the minimum certainty factor is used.

EXPERIMENTAL SECTION

N

j=I

(3)

Equation 2 gives the entropy for an attribute. N is the number of different classes, and p(cilaj) is a probability obtained by counting the number of observations of class i and dividing that number by the total number of observations of the j t h attribute. Usually entropy is calculated in units of bits using a base 2 logarithm, but the base of the logarithm only effects the units of the results. The same results are obtained if natural logarithms (In) are used, except the units

All calculations were conducted on an IBM PS/2 Model 60 computer operating under MS-DOS 3.3. All computations were performed by the RESOLVE software package, a mass spectral data analysis system (16).All graphs were obtained from screen dumps of the RESOLVE software to a Hewlett-Packard Laser Jet 500+ printer. All bacteria were grown under identical conditions in a beet molasses medium with shaking at 200 rpm for approximately 7 days. The bacteria were killed and suspended in a methanolpotassium chloride solution before shipment. The pyrolysis mass spectra were obtained on an Extrel quadrupole mass spectrometer equipped with a Curie-point pyrolysis inlet and a Fisher radio frequency power supply (1.1kW, 750 kHz). The bacterial samples were prepared by applying 5-pL portions directly from the culture tubes to rotating 510 'C Curie-pointwires and allowing the wires to dry at room temperature. The samples were pyrolyzed for 10 s. Data acquisition was begun 0.3 s after starting the pyrolysis. Fifty spectra were collected and averaged over a 50-240 amu mass range. The LIMS analyses were performed on the LIMA-2A laser microprobe instrument manufactured by Cambridge Mass Spectrometry, Ltd. (Cambridge,England). This instrument has

ANALYTICAL CHEMISTRY. VOL. 61, NO, 7, APRIL 1. I969

a

-

Pseudomonas aeruginosa II

I

717

PRINCIPAL COMPONENT

1 h p p

X

P

X X

P

X

x

wqx

X -23

-16

-ID

-3.7

7.6

n

J15

COMPONENT 1 Flgure 3. Plot of Pseulomonas (P)and XanHKmwnas (X) scores on the first two principal components.

CANONICAL VARIATE

I.

'f' d o d m

.I.

do do

d o

do

240 .io

do

do

d o

do do

3

M/Z Flgum 2. (a)Pseudomonas aemghosa pyrolysis-mass specbum. (b) Xanthorones campesms pyrolysis-mass spectrum.

been described in detail elsewhere (17) and is comprised of a focused, high irradiance, quadrupled NdYAG laser system coupled to a time-of-flight mas spectrometer. The irradiance of the pulsed laser output can be varied from -10s to 1OI2 W/cm2and the laser beam typically irradiates a 2-5 pm diameter area on the sample surface. A complete mass spectrum is obtained for each irradiated volume by employing a transient waveform digitizer in the detection circuitry. The Sony-Tektronix 390 AD transient digitizer employed in these analyses was operated at a sampling frequency of 60 MHz and a record length of 4096 channels which permitted acquisition of mass spectra over a mass range from 1 to -300 amu. Most of the polymer samples were prepared by embedding beads of commercially available polymers in Spurrs epoxy resin (medium hardness) and microforming 1pm thick sections. These thim seetion samples were mounted onto high-purity Si substrates. The poly(ethy1ene terephthalate) (Mylar 500D) sample was provided in thin film form (-0.5 mm thick) and was analyzed without any sample preparation. Multiple LIMS analyses were performed at three laser irradiances in both positive and negative ion detection modes of analysis. The irradiances corresponded to the threshold for ion detection and typically 10 and 100times this threshold value. The data presented in this paper were produced from the high irradiance (-10" W/cm2), positive ion analysis of the various polymers. The analytical maters produced at this irradiance were approximately 5 pm in diameter, corresponding to a sampling volume of -2.5 pm3. DISCUSSION OF RESULTS The Xanthomonas bacteria are plant pathogens that can cause serious damage to a number of plants including citrus trees. The Pseudomonas bacteria are not plant pathogens but are ubiquitous bacteria that are very difficult to differentiate from Xanthomonas bacteria by traditional methods

COMWW 1 Figure 4. Hktogram of the ~ u d w r o n s s (P) and Xant'wmonas (X) scores on the discriminant function.

of bacterid taxonomy. Unfortunately, biological methods of analysis may require up to 4 weeks to obtain results. The bacteria used in this study are as follows: Pseudomonas aeruginosa, Pseudomonas fluorescens, Pseudomonas alcaligenes, Pseudomonas floridana, Pseudomonas pseudoalcaligenes, Xanthomonas campestris pu. citri, Xanthomonas campestris pu. citri (Asia), Xanthomonas campestris pu. citri (Argentina), Xanthomonas campestris pu. citri (Brazil), Xanthomonas campestris pu. citri (unknown),Xanthomonas campestris pu. phaseoli. Parts a and b of Figure 2 are pyrolysis-mass spectra of Pseudomonas and Xanthomonas bacteria, respectively. All of the Pseudomonas and Xanthomonas spectra appeared to be similar and no distinguishing trends of intensities were observed by visual inspection. These spectra were normalized to unit vector length and then autoscaled. The data were compressed by eigenvector projection using correlation about the mean preproeesaing. The data set was projected onto ten eigenvectors which accounted for 80% of the cumulative variance. All further analyses treated the component scores as variables. Figure 3 is a plot of the scores on the first two principal components which accounts for 57% of the cumulative variance. There is no separation between Xanthomonas and Pseudomonas bacteria in this plot which shows that the majority of the variance is caused by variations within the groups of spectra. Linear discriminant analysis was compared to the expert system to differentiate Xanthomonas from Pseudomonas bacteria by Py-MS. Discriminant analysis produced one discriminant function. A histogram of the spectra projected

718

ANALYTICAL CHEMISTRY, VOL. 61,

NO. 7,

APRIL 1, 1989

PRINCIPAL COMPONENT

dl

i

!

4 4

444

~

Flgure 5. Classification tree for Pseudomonas (P) and Xanttwmnas (X) pyrolysis-mass spectra. The rectangles are rules, the circles are class designations. V Is the principal component number and the A is the rule attribute. -22

PRINCIPAL COMPONENT 4

X

I

X

t-

-17

-11

-59

-0.5

4.9

10

COMPONENT 1 Flgure 7. Plot of the scores of the LIMS polymer spectra on the first two principal components. See text for legend.

CANONICAL VARIATE

2

4

P

2

P -7.1

-49

-2.7

-0.46

1.7

4

6.2

COMPONETiT 5 F b r e 6. Plot of fseudomonss (P) and Xanthomones (X) scores on the principal components, 5 and 8.

onto this function is given in Figure 4. The separation on this discriminant function is poor. There is considerable overlap between the two classes of bacteria. Furthermore, this is the best possible case because all the data were used for calculating the discriminant function. The analysis was evaluated by a cross validation procedure which removes one spectrum from the training set, calculates the discriminant function, and uses the function to predict the class of the spectrum which is removed. The number of correctly predicted spectra is used to evaluate the procedure. Discriminant analysis correctly predicted 77.6% of the spectra. Figure 5 is a decision tree obtained from the expert system. The rectangles contain the rules that were derived from the training set of data. The circles contain the classes. The first rule classifies all spectra with scores on the fifth component less than -1.93 (-1.93 is the rule attribute) as Pseudomonas, class P. The second rule classifies all scores greater than -1.93 on the fifth component and greater than 2.56 on the eighth component as Pseudomonas,class P. The scores that are less than 2.56 on the eighth eigenvector and greater than -1.93 on the fifth eigenvector are classified as Xanthomonas, class X. Figure 6 shows the scores projected onto the fifth and eighth principal components. This two-dimensional space is the space of the calculated components where the entropy of classification is minimized. The same cross validation procedure used for evaluating the discriminant analysis was also used for the expert system. The expert system correctly classified 98.3% of the bacteria. The spectrum that was misclassified had a relatively low certainty factor of 12.3%. The average certainty factor for cross validation of the entire data set was 47% which had a standard deviation of 23%. Furthermore, the classification tree was 98% efficient because only two rules were derived. For this study the expert system was much more effective than discriminant analysis. To further investigate the expert system, another study was conducted with a larger data set

-1D

-7.5

-4.8

-2

D.74

3.5

6.2

COMPONENT 1 Figure 8. Plot of the scores of the LIMS polymer spectra on the first two discriminant functions.

See text

for legend.

which consisted of seven categories of LIMS polymer spectra. Replicate LIMS spectra from 15 different sample locations were obtained for each polymeric thin film. The seven polymers used in this study were as follows: 1, Nylon 6; 2, Nylon 12; 3, poly(1,dbutyleneterephthalate); 4, polycarbonate; 5, polystyrene; 6, Spum epoxy; 7, poly(ethy1ene terephthalate). The 105 spectra in the data set were normalized to unit vector length and then autoscaled. The spectra were compressed by eigenvector projection using correlation about the mean preprocessing. Thirty components were calculated which accounted for 90% of the cumulative variance. Figure 7 is a plot of the scores for the seven polymers on the first two principal components. The scores are nicely grouped showing that the major variations are caused by differences among the groups as opposed to differences within groups which were observed for the bacteria spectra. Figure 8 shows the separation of the polymers was improved when they were projected on the first two discriminant functions. Discriminant analysis using the cross validation procedure correctly identified 85% of the polymer spectra. Figure 9 is the classification tree derived by the expert system. Seven rules were obtained which used the first five component scores. The cIassification tree was 99% efficient. The cross validation procedure correctly identified 95% of the polymers. The authors are not aware of any publications that used standardized data sets for evaluating software. However, standardized and accessible data are imperative for the evaluation of chemometric software. Two available data sets were chosen for test standards. Fisher’s Iris data is a classic and simple data set (4,18) while the chemical analysis of crude

ANALYTICAL CHEMISTRY, VOL. 61, NO. 7, APRIL 1, 1989

719

This system may have potential use for classification of other condensed phase mass spectral data obtained from secondary ion mass spectrometry, fast atom bombardment mass spectrometry,plasma desorption mass spectrometry,and field desorption mass spectrometry. The rule-building expert systems offers the specific advantage of insensitivity to outliers. Outlying spectra which may be caused by mislabeling of the samples, glitches in the instrumentation, or contamination in sample preparation cause additional rules to be developed in the classification tree. Future expert systems will be developed that synergistically incorporate linear discriminant analysis. Future research goals are the investigation of other knowledge representation formats and the development of adaptive knowledge bases. Adaptive knowledge bases have the ability to acquire new knowledge efficiently without having to be re-created. They also can self-organize to efficient structures. Figure 9. Classification tree for the LIMS polymer spectra. The rectangles are rules, the circles are class designations. V is the principal component number and the A Is the rule attribute.

oil by Gerrild and Lantz provides a practical chemical example (4,19). No normalization was applied to either data set. The variables were autoscaled in both data sets and the rulebuilding expert system and linear discriminant analysis were applied directly to the data. Compression by eigenvector projection was not needed because the number of observations was sufficiently larger than the number of variables. The same cross validation procedure was used to compare methods. For the Iris data, discriminant analysis correctly classified 81.3% of the samples and the expert system classified 84.0%. For the crude oil data, discriminant analysis classified 82.1% of the samples and the expert system classified 83.9%. The performance was approximately the same. The expert system tends to perform better than discriminant analysis when the data are not normally distributed or the classes are not linearly separable. Furthermore, expert systems are resistant to outliers caused by mislabeling of samples or instrument malfunctions. Outlying spectra will cause additional rules to be incorporated in the classification tree.

CONCLUSION Expert systems that build their own rules will have a portentous effect on analytical chemistry. The hindrance to the wide spread acceptance of expert systems by chemists is the difficult, time-consuming, and costly step of extracting knowledge from an expert to build a knowledge base. The rule-building expert system complements linear discriminant analysis as a method of taxonomy. Cases have arisen where linear discriminant analysis performs superior to the expert system. Using the cross validation method, the expert system correctly classified 98.3% of the bacteria and 95% of the polymers. Linear discriminant analysis correctly classified 77.6% of the bacteria and 85% of the polymers.

ACKNOWLEDGMENT Steven Muskal is acknowledged for his assistance in the software development and David Updegraff for his valuable comments.

LITERATURE CITED Meuzebar, H. C. L.; Haverkamp, J.; Hileman, F. D. Pyrolysis Mass Spectrometry of Recent and Fossll Biomaterials; Elsevier Scientiflc Publishing Company: Amsterdam, 1982. Lindner, B.; Seydei, U. J. Gen. Microbioi. 1983. 129, 51. Voorhees, K. J.; Tsao, R. Anal. Chem. 1985, 57, 1830. Johnson, R. A.; Wlchern D. W. Applied Multivariate Statistical Analysis; Prentice-Hall: Englewood Cllffs, NJ, 1982; Chapter 5. Wolfgram. D. D.; Dear, T. J.; Galbraith, C. S. Expert Systems for the Technical Professional;John Wlley & Sons: New York, 1987. Derde, M. P.; Buydens. L.; Guns, C.; Massart. D. L.; Hopke. P. K. Anal. Chem. 1887, 59, 1888. Harrington, P. B.; Voorhees, K. J. Presented at the 1988 Pktsburgh Conference, New Orleans, LA, March 1988. 948. Voorhees, K. J.; Harrlngton, P. B.; Street, T. E.; Hoffman, S.; Durfw, S.L.; Bonelli, J. E.; Flrnhaber, C. S. Presented at the 2nd Hldden Peak Conference, Snowbird, UT, June 1988. Qulnian, J. R. Machine Lsarning: An Aftlficial Intelligence Approach; Michalski, R. S.,Carbonell. J. G., Mitchell, T. M., Eds.; Tioga Publishing Co.: Palo Alto, CA, 1983; p 483. Thompson, B.; Thompson, W. Byte 1986, 1I , 149. Schlleper, W. A.; Marshall, J. C.; Isenhour, T. L. J. Chem. Inf. Comput. Scl. 1988, 28, 159. Foley, D. H. IEEE Trans. Infofm. Theory 1972, 78, 818-828. Eckschlager, K.; Stepanek. V. Information Theory as ApplM to Chemical Analysis; John Wiley & Sons: New York, 1979. Harrington, P. 6.; Voorhees, K. J., Colorado School of Mines, Golden. CO, unpublished results. Llndeman, R. H.; Merenda, P. F.; Gold. R. 2. Introduction to Bivariate and Munivariate Analysis ; Scott Foresman and Company: Glenview, IL, 1980; p 203. Harrington, P. B.; Voorhees. K. J., unpublished results. Dingle, T.; Grlfflths, B. W.; Ruckman, J. C.; Evans, C. A., Jr. Microbeam Analysis-1982; Heinrich, K. F. J., Ed.; San Francisco Press: San Francisco, CA, 1982; p 385. Fisher, R. A. Ann. Eugenics 7 1936, 7 , 179-188. Gerrild, P. M.; Lantz, R. J. Open-Fib Rep.-US. Geoi. SUN. 1989.

RECEIVED for review September 19,1988. Accepted December 27, 1988. Charles Evans and Associates acknowledge the support of NSF SBIR Grant No. IS14760431 for the LIMS analysis. The Colorado School of Mines research was supported by Somatogen, Inc.