Hierarchical nonhierarchical clustering strategy ... - ACS Publications

this strategy by carrying out a clustering of the complex Iron meteorite data base (major and trace element concentra- tions), which has been Intensel...
0 downloads 0 Views 962KB Size
91 1

Anal. Chem. 1982, 5 4 , 911-917

Hierarchical Nonhierarchical Clustering Strategy and Application to Classificatiorr of Iron Meteorites According to Their Trace Element Patterns Desk6 L. Massart * and Leonard Kauifman VrJe Universiteit Brussel, Laarbeeklaan 103, 13- 1090 Brussels, Belgium

Kim H. Eebensen Technical llniversliy, Lyngby, Denmark

The usefulness of a new advanced clustering strategy (MASLOC) as a general tool In Interpretative an(alyt1caichemistry Is demonstrated. The central algorithm is of a nonhlerarchlcal type but 41 hlerarchlcal output Is provided. The algorithm selects what are called robust (“signlfloant”) clusters and detects outliers. Comparison with the hierarchical algorithms that are most often used in anaiytlcal chemistry shows that better results are obtalned with MASLOC. The algorithm Is used In a multistep interactive ciasslflcatlon str,ategy. We ewaluate thls strategy by carrylng out a clusterlng of the complex Iron meteorlte data base (major and trace c?iement conccentratlons), whlch has been Intensely studied and whlch has a universally agreed upon, genetic classification agalnst whlch we compare our results. We reproduce all essential foatures of this standard and even find evidence of posslble structure among formerly “ungrouped” Iron meteorntes. I t Is concluded that the proposed strategy offers real advantages for the treatment of large bases of anaiytlcai diata.

The interpretation of analytical data forms an integral part of the task of the analytical chemist. More than one variable is almost (alwaysmeasured on the same samples. In recent years more attention has been paid to the multivariate aspect of analytical chemistry. Some techniques such as factor analysis and pattern recognition have become well established and many good examples can be found in the literature. There are too many puhlicatiom to cite them all but four books (1-4) and two articles (5, 6) give a good picture of the nuinerous possibilities. Other techniques, such as unsupervised pattern recognition or clustering, have been ritudied to a lesser degree. There are applications in the literature, but usually they treat very simple examples using equally simple algorithms of the hierarchical type and applying them on nicely structured data sets. Other analytical chemists trying to apply these algorithms to their own problems are often very disappointed by the results. There are two main points which are usually not taken into consideration in trying to introduce clustering methods in analytical chemistry: (1)The analytical chemist is rarely required to carry out a classification by clustering without any a priori knowledge of the classification to ble expected. In fact, usually a classification does already exist and clustering is carried out to prove the correctness of the classification in a formal way or to suggest corrections or alterations to this a priori classification. It is necessary to compare the existing classification with the clustering results and to incorporate this knowledge in the study. 0003-27O0/82/0354-09 I1$01.25/0

Simple clustering algorithms do not allow this: one needs a clustering strategy. To illustrate the possibility of using clustering in conjunction with existing knowledge, we carried out the present study on the 1978 version of the data base of iron meteorite chemistry (see below). We wanted to stucly the validity of the present clustering technique in relation to ongoing research within the meteoritical realm. In the ensuirig 3 years much news has been learned about iron meteorite classification. This provides a rather unique possibility to study the interaction of the present clustering strategy anid resulting classifications with independent research. In a sense, we simulate the progress of iron meteorite classification since 1978 by comparing the results of our clustering strategy with subsequent theoretical improvements in meteorites. In this way, we obtain a completely independent evaluation of the merits of the clustering technique. (2) The other point is that most clustering algorithms clo not give an answer to the questions asked by the chemist. The clustering algorithms usually employed permit one to obtain any number of clusters desired but they do not distinguisih which are best. In fact, there are some methods in the literature which are designed to yield the “correct” number d classes. These methods are usually based on plots of some statistical or semistatistical measure, such as the averagewithin-cluster distance, in function of the number of clusters. Breaks in this curve are interpreted as indicating a ”correct” number of clustms. As noted by Everitt experience shows that these methods do not yield very good results and, in his recent review, Everitt calls the detection of the correct number a “formidable” problem, which apparently has not yet been resolved. A solution to this problem was proposed by us recently (8,9) and this method is applied here. It is implemented with a program called MASLOC (see PROGRAM). The method is designed primarily to select those clusterings that optimally reflect the real, but unknown clusters of data, i.e., the objective data structure. Such clusters could be termed “significant”. To avoid misunderstanding of this term as meaning “statistically significant”,these clusters are called here “robust”. Three additional problems merit some discussion, namely, outliers, scaling, and transformation of data. The first is discussed in the section on the MASLOC technique and the second under Phase 111. Concerning the third problem one observes that it is not always easy to decide whether uintransformed or transformed data (usually log-transformeld, sometimes I’-transformed data, etc.) should be used. In the present case there is a meteoritical consensus that a log transformation should be applied to the data. We wanted to simulate the position of the analytical chemist who is not sure about the kind of data transform to be used. In such instances, it is in our opinion best at least to start the investigation with untransformed data (except for scaling) ,in

(a,

0 1982 American Chemical Society

812

ANALYTICAL CHEMISTRY, VOL. 54,

NO. 6, MAY 1982

order to avoid a possible initial bias in the conclusions. For this reason, all the calculations were carried out on data untransformed except for scaling (see below). CLASSIFICATION OF IRON METEORITES We have chosen the field of iron meteorite classification because of its complex nature and because current theoretical command as to the origin and evaluation of the iron meteorites has led to a genetic classification of these objects, the general lines of which are almost universally accepted by meteoriticists and which will serve as a reference for the various statistical evaluations. The understanding of the origin of such complex materials as meteorites requires a genetic classification system, e.g., a system that groups together objects formed from the same genetic event(s). Observation of the formation of iron meteorites is a matter of deduction exclusively. Appraisal of the validity of a specific classification system relative to these genetic processes can thus never be unambiguous but will be determined, a t least in part, by the current theoretical consensus pertaining to the genesis of the iron meteorites. In the absence of a truly unambiguous genetic classification, resorting to an optimal descriptive system is almost universally the case in scientific research. The classification of iron meteorites along these lines has been the subject of many studies. Buchwald (11) gave a thorough description of essentially every known iron meteorite focusing on metallurgical and physical aspects complemented by the review by Scott and Wasson (12) on the chemical classification of these objects. Together these two references outline a complete survey of taxonomic parameters of which the present study makes use of a selected set, as defied below. The almost universally accepted classification system in use is the chemical grouping of Wasson and co-workers which has been presented in a series of nine systematic papers, so far culminating with ref 10, in which can be found the pertinent references. A four-dimensional array of the iron alloy elements Ni, Ga, Ge, and Ir is used; three two-dimensional combinations are generally held to be sufficient to classify any meteorite following the delineations in Figure 1. Although the classification of the iron meteorites is a field in which a multivariate approach seems preferable, formal multivariate methods have been applied only recently and partly. Moore et al. (13) applied a hierarchical clustering procedure to the classification of about half of the iron meteorites of which sufficient data are present, using seven variables. The results are subject to criticism mostly because no information is given about the specific algorithm used. Esbensen and Wold (14) solved a specific classification problem of some of the nonmagmatic iron meteorite groups IAB and IIICD using the SIMCA methodology of Wold (15). MASLOC-A NEW CLUSTERING STRATEGY The central clustering algorithm is based on an operations research model and belongs to the centrotype group of algorithms (8). Its technical characteristics are given under PROGRAM. This means that p samples are selected in such a way that they function as centrotypes for p clusters. Each sample is assigned to the centrotype to which it is nearest. The centrotypes are selected in such a way that the sum of the distances of all the samples to the nearest centrotype is minimal. The clustering algorithm used here is technically superior to other such methods because no initial assumptions have to be made as is usually the case with this kind of algorithm. A dimension is added to this procedure in the second step in which the sequence of clusters formed when going from p = 2 t o p = N is considered. This means that first an optimal

.

100

'

10 I

E

.

-n E -2

'

n

0

TV A

1 '

5

7

10

li

20

25

N i c k e l l%l

&

O,I/

t 5

.

. 7

10

li

2C

25

N i c k e l (Oh)

Figure 1. Bivariate plot of the meteoritical taxonomic groups as identifiedin the reference classiflcation, here using three elements, Ge, Ga and Ni. Ungrouped meteorites are not shown (adapted from ref 12).

clustering with p = 2 clusters is obtained and then a p = 3 clustering, etc. until a p = N clustering, where N is the number of samples (of course, the latter clustering is trivial because each cluster consists of one sample). From this sequence one obtains: Outliers, these are present as clusters with only one member at low p . Robust clusters, their elements do not intermingle with elements from other clusters at higher p . The lower the p stage a t which they are formed for the first time, the more significant they are. Tight clusters, their elements keep together in one cluster until a high p value is reached. This second step was added by us to the centrotype sorting algorithm of the first step. It may, in fact, also be used to complete the nonhierarchical methods of the more common centroid sorting type such as FoRGY (16), RELOCATE (In,or McQueen's k means method (18). The clustering algorithm is described in part and in general terms in ref 8. Formalization of the study of significance of the clusters can be found in ref 19. Very diverse classifications have been carried out with success (organic molecules according to their interaction with GLC phases (ZO),fatty acids according to their metabolic pathway (9),photographic material in a process control situation (19)). In this article, we complete this sequence with the following step: starting from a p value, where all clusters are robust

ANALYTICAL CHEMISTRY, VOL. 54, NO. 6, MAY 1982 L

8

12

1L

9

13

10

2

1

3

6

7

1

913

P

1 P’10

9 8

I

7

__ Figure 2. Hiisrarchlcal plot of the nonhierarchlcal clustering of 14 objects.

Table I. Results of Steps 1 and 2 of MAElLOC on the Data Set of Ref 9 (reprinted with permiss~ion)~ P 2

3 4 5 6 7 8 9 10 a

clusters 1,3,6,7,10,11/2,4,5,8,9,12,13,14

-1 3,6,7,1_1/10,5,9,13/2,4,8,12,14

2.4 %), while 3.6% of these meteorites were considered outliers. This means that over these six coarse “supergroups”, a 94% correct classification is obtained. Considering the unfavorable circumstances, such as the use of untransformed data and a few incorrect assignments in the 78 reference classification (see for example Winburg in phasle 111),this is considered as a good agreement with the reference classification at this stage. Phase 11. Results. In phase I 100 centrotypes (50 even, 50 odd) were selected. After elimination of centrotypes whiclh represent only themselves (outliers), 56 centrotypes, eaclh representing at least two meteorites, remain. In phase I1 these 56 centrotypes are clustered.

+

+ +

+

+

+

gl6

ANALYTICAL CHEMISTRY, VOL. 54, NO. 6, MAY 1982

Figure 3 is the clustering obtained by MASLOC-step 3. Figure 3 therefore represents the relationships between 56 meteorites each representing a phase I cluster of meteorites. At p = 15, the complete data set is divided into robust clusters. One notes that the following “wrong” classifications are made: One cluster (35,42,44,45,48) contains 4 IIIA centrotypes and 1 IIIB. This is not surprising since centrotype 48, meteorite Kinsella is representative in the preliminary phase for a group of 7 IIIA meteorites and 8 IIIB meteorites. In the even data set it represents the border zone between IIIA and IIIB. It should be noted that according to meteoretical theory IIIA and IIIB form a single genetic group IIIAB. One cluster consisting of the ungrouped centrotype 1 , 2 IIIB centrotypes 47 and 49, and the 2 IIIC centrotypes 50 and 51. Furthermore 1 is representative for a small cluster containing the ungrouped meteorite Victoria West and one IIIC meteorite, 50 represents 1IIIC and 3 ungrouped, 51 represents 4 IIICs and 1ungrouped, 47 represents 8 IIIBs and 1IIIC and 49 represents 4 IIIBs. Clearly this cluster represents the overlap between IIIB and IIIC and some ungrouped meteorites which chemically resemble IIIC. One cluster consisting of centrotypes 4, 5, 6, 7, 8, 10, 11 (all IA) and 39 (IIIA). The latter is meteorite Lucky Hill, which is representative for itself and a group of 5 even ungrouped meteorites. The anomalous position of Lucky Hill has been reinvestigated since then and data errors have been detected and corrected. One cluster consisting of all the IIA centrotypes (18-22) and two IIBs (23 and 24). Not surprisingly 24 represents 7 IIBs and 5 IIAs and 23 represents 5 IIBs and 12 IIAs. These two centrotypes are responsible for the entire overlap between IIA and IIB observed in the preliminary phase. One should also remember that according to present theory IIA and IIB form a single cogenetic group. Figure 3 completes the information and, for instance, shows the relationship between the IC, IIA, and IIB apparent from Figure 1. It also gathers in a correct way the IA cluster and the IB outlier, so that the only important remaining discrepancy is that IIIA (+IIIB) is apparently cut in two. We conclude that the results of msLoc-phase I1 are very consistent with the findings of phase I and, on the whole, give a very acceptable classification. Discussion of Phase 11. MASLOC yields much better results than hierarchical classification. Indeed the same set of 56 centrotypes was clustered with 6 hierarchical methods, using programs taken from Anderberg (24). The results of MASLOG and the 6 hierarchical methods were evaluated against the reference classification in the following way. A “bad point” was given for each wrong link in a cluster and for each missing link. For example, the cluster 18, 19, 20, 21, 22, 23, 24 costs 10 “bad points” because according to the reference classification 23 and 24 are part of one group (IIB) and 18 to 22 of another (IIA), so that there are 10 wrong links, namely, 18-23, 18-24, 19-23, 19-24, 20-23, etc. The fact that the IICs are divided over two clusters (28 29 and 27) adds another two “bad points” because the links 28-27 and 29-27 are missing. The clusterings were compared at the p = 16 level, because at this level the MASLOC clustering is completely significant. The result is given in Table VI. The classification of the hierarchical methods is remarkably consistent with the literature valuation of these methods. Single linkage is usually considered as the worst method, while Ward’s method is usually given as best with the average linkage methods as second (see, for instance, ref 7). It may therefore be thought that the valuation procedure used here is correct, which shows that MASLOC is the best of the seven methods employed.

+

Table VI. Comparison of MASLOC and Hierarchical Clustering Methods of the Phase I1 Centrotypes

method MASLOC Ward’s method complete linkage av linkage within groups av linkage between groups centroid linkage single linkage

bad points due to wrong missing links links 27 44

73 99

95 116 122

total 122 160

115

195 214

155

72

227

157 187

83 83

24 0 270

Phase 111. As a result of phases I and 11,and a comparison with the reference classification, one concludes that there are four major groups, namely, IA + IB (+IIIC IIID), IIA + IIB + IC, IIE + IIIA + IIIB + IIIE, and IIIF + IVA and two minor well separated groups (IIC and IID). Portions of the data set classified entirely according to expectation are removed from further inspection, so that IIC and IID are not considered further. Unclear classification was obtained for two small reference groups (IIIC and IIID). These four major groups called hereafter A, B, C, and D were subjected to a new clustering step. There are two reasons for phase 111: An economical reason, in contrast with phase I, which was intended as a rough, preliminary step, this step is intended to achieve final classification. Therefore a detailed clustering, leading to robust clusters, is necessary. This may require the determination of all p clusterings, from p = 2, to N,N being the number of objects in the data set. On a data set of 493 objects the cost would be prohibitive. A scaling reason, in phases I and 11, the scaling was carried out on the whole data set and on otherwise untransformed data. Meaningful differences at the low end of the scale between some groups were therefore not taken into account, leading, for example, to the combined IIIF + IVA “group” (see Figures 1 and 2). Rescaling in each group is now carried out. In this way one selects a portion of the original multivariate space and looks at it as if with a magnifying glass. The effect is that small separations between clusters become more meaningful and that outliers originally incorporated in a cluster are recognized more clearly. In each subspace one considers the groups as they were obtained in phases I and I1 together with the original grouping in the reference classification, thereby comparing “new” knowledge from phases I and I1 with the reference classification. Since the object of the present article is to discuss the methodology and the results as far as they are of interest to analytical chemists in general and not to meteorite specialists, we will discuss only two major groups, namely, B and C. Group B. This group consists of 70 meteorites, namely, all IIA, IIB, and IC meteorites as specified by the reference classification and 2 ungrouped meteorites 2-Elton and 63Winburg classified as IIB in phase I. At a very low p level 0, = 7) all IC’s are removed from the rest of the data set together with 63-Winburg. At p = 10 179-Murnpeowieis split off from the other IC’s, and at p = 17 63-Winburg (the ungrouped meteorite) is also made an outlier. One concludes that 179-Murnpeowie is more anomalous than 63-Winburg. If 179-Murnpeowie, which was catalogued as IC-AN retains that status, then 63-Winburg should be classified as IC-AN or IC. I t is very gratifying to note that in the 81 version of the data set this has been done and Winburg is classified there as IC-AN. Esbensen and Wold (14) found it to be a regular

+

ANALYTICAL CHEMISTRY, VOL. 54, NO. 6, MAY 1982

IC member! The IIA + IIBs form a set of clusters which become robust only at levels exceeding p > 30 and at that level 5 IIIAs are still found to be clustered together with IIBs. One concludes that IC exists as a separate entity in which 63Winburg should be included, IIA and IIB are not significantly separated and should be rather one group IIAB, and 2-Elton belongs to the ungrouped. These are exactly the conclusions at which the 81 edition of the data base arrives. Group C. Group C consists of all the members of the groups IIM, IIIB, IIE, and IIIE of the reference classification. All the group IIIP members were also added because two of them were classified with a IIIA centrotype in phase I, the three members of IA, IC, and IIIC classified by us with IIIA or IIIB were also added (see Table IV). Finally, group C contained also the 12 ungrouped meteorites (see Table 11) which were classified by us as IIIA or III[B. Two other ungrouped meteorites were added by error. The following results were obtained. Thirteen of the fourteen ungrouped meteorites are separated a t p = 40. This means that 13 of the 14 ungrouped meteorites are found to be outliers and therefore returned to their ungrouped status. Only one ungrouped meteorite is found to be a mernber of a robust cluster in phase 111,namely, Piedade do Bagre. The significant cluster is formed between p = 31 and 40 and consists of eight members (centrotype : Chambord) - all IIIA + l’iedade do Bagre. Between p = 41 and 50, Piledade do Bagre is removed from the other seven members. One could therefore consider to add Piedade do Bagre to tlhe IIIA group as IIIA-An. Group IIIE. Is separated completely and significantly at p = 12 and remains as oine entity until p = 50. It is clearly a homogeneous group. The IA, IC, and IIIC nnisclassificationai. These three meteorites are isolated before p = 30 as outliers. There is therefore no reaaon to kleep them in IIIA or IIIB. The IIE group. Before p = 20, 3 of its 11 elements are isolated as a robust cluster. Between p = !:O and 30, two other clusters of 3 elements are isolated and ait p = 50, all of its elements are isolated from the rest. At that level 5 of them are outliers. One concludes that IIE is a separate entity from IIIA and IIIB but that it is, at best, an inhomogeneous cluster, exactly the findings of Scott and Wasson (25). Indeed Esbensen and Wold (14) recommend that group IIE be abandoned because of significantly less coherence than any group! The 11117 group. Before p = 10, one element, and before p = 20, twlD elements (of 6) are separatedl and considered to be outliers. Two other elements form a robust cluster at that stage. At p = 50, all elements are separated, four of them are outliers. Ohe may conclude that the IIIF group is a separate entity from IIIA + B, but that it is inhomogeneous. IIIA and IIIB. Even at p = 50, only a few elements of IIIA and B are present in robust clusters. It must be concluded that IIIA and IIIB do not form separate entities. This is a correct prediction of current meteoritical consensurJ (12), namely, that IIIA and B form one group IIIAB. The most anomalous member is Lucky Hill (outlier a t p = 40). Since all ungrouped meteorites are separated at p = 40, Lucky Hill could be considered ungrouped. It was found to contain data of poor quality (25), so that in this sense the 78 version of Lucky Hill should indeed be considered ungrouped. Again, the detection of this error is very gratifying since it demonstrates the power of the clustering method. Equivalent results are found for groups A and D, so that one may conclude that excellent agreement is obtained with the reference classification. The major dlisagreementssuch as the coalescence of some individual groups do not reflect any real discrepancies since the most recent meteoritic classifications arrive a t the same conclusions. Some isolated

91’7

misclassifications remain, but in nearly all cases (see Winburg and Lucky Hill), the reference classification or the data errors have been located. Phase IV. The major remaining disagreement between our classification and the meteoritical, genetic classification after phases I to I11 is that we find more structure among the ungrouped meteorites than could perhaps be expected. In fact, some “groups” thus obtained are as right as the clusters representing some of the less homogeneous resolved groups in the reference classification. For this reason we clustereid all the ungrouped meteorites plus some of the less homogeneous groups (IIWC, IIID, IB, IC). These were added to serve as a reference. If groups are found among the formerly unigrouped meteorites at the same level as the recognized groups, then one may conclude that there may be some significance in the newly detected groups. One observes that, the recognized groups are found as clusters although some of the groups are found in more than one cluster. We find clusters at the same level among the ungrouped so that some structure is certainly present. This agrees with recent conclusions by Kracher et al. (10) who introduced the newest group, IIIF. It consists of five meteorites and Kracher et al. (10) note a “relatively large compositional hiatus” dividing the five meteorites in exactly the two clusters found by us. These findings make it probable that at least some of the structure found among the ungrouped is not fortuitous. This is supported by Scott (22) who concludes that groupletsmust be present among these meteorites. He identifies some pairs of trio’s of resemblant meteorites. Some of these are among the clusters identified by us. The findings of phase IV, although less conclusive than those of phases 1-111, again show that the clustering method does predict some of the 81 conclusions which were not evident in 78.

LITERATURE CITED (1) Mallnowski, E. R.; Howery, D. G. “Factor Analysis In Chemlstry”; Wiley: New York, 1980. (2) Varmuza, K. “Pattern Recognition in Chemistry”; Springer: Berlin, 1980; Vol. 21 In Lecture Notes In Chemlstry. (3) Jurs, P. C.; Isenhour, T. L. “Chemical Applications of Pattern Recognition”; Wiley-Interscience: New York, 1975. (4) Massart, D. L.; Dljkstra, A.; Kaufman, L. “Evaluation and Optimization of Laboratory Methods and Analytlcal Procedures. A Survey of Statistical and Mathematlcal Techniques”; Elsevier, Amsterdam, 1978. (5) Kowalski. B. R. Anal. Chem. 1975, 4 7 , 1152 A. (6) Coomans, D.; Massart, D. L.; Kaufman, L. Anal. Chim. Acta 197’B, 112, 97. (7) Everitt, B. S. Biometrics 1979, 35, 169. (8) Massart, D. L.; Kaufman, L.; Coomans, D. Anal. Chim. Acta 19810, 722, 347. (9) Massart-Leen, A. M.; Massart, D. L. Biochem. J. 1981, 796, 611. (10) Kracher, A.; Willis, J.; Wasson, J. T. Geochim. Cosmochim. Acfa 1980, 44, 773. (1 1) Buchwald, V. F. “Handbook of Iron Meteorites”; Center for Meteorite Studies and University of California Press: 1975. (12) Scott, E. R. D.; Wasson, J. T. Rev. Geophys. Space Phys. 1975, 7 9 , 527. (13) Moore, C. B.; Pratt, D. D.; Parsons, M. L. Meteoritics 1977, 72 (3), 314. (14) Esbensen, K. H.; Wold, S., manuscript in preparation. (15) Wold, S.Pattern Recognition 1976, 8 , 127. (18) Forgy, E. Biometrics 1985, 27, 768. (17) Wlshart, D.. CLUSTAN, Computer Centre University College Londoin. (18) MacQueen, J , 1988, 5th Berkeley Symp. on Math. Statistics arid Probabllity Procedures, 1986; p 281. (19) Plastrla, F.; Massart, D. L.; Kaufman. L.; Report CSOOTW/137, Vrije Universltelt Brussel, 1980. (20) De Clercq, H.; Despontin, M.; Kaufwn, L.; Massart, D. L. J. Chromatogr. 1978, 122, 535. (21) Esbensen, K. H.; Buchwald, V. F. Meteroitics 1979, 74 (4), 573. (22) Scott, E. R. D. Miner. Mag. 1979, 4 3 , 415. (23) Massart, D. L.; Kaufman, L.; Coomans, D.; Esbensen, K. H. Bull. Soo. Chim. 6 e g . 1981, 90, 281. (24) Anderberg, M. R. “Cluster Analysis for Applicatlons”; Academic Press: New York, 1973. (25) Scott, E. R. D.; Wasson. J. T. Geochlm. Cosmochim. Acta 1978, 4 0 .

RECEIVED for review September 17,1981. Accepted December 21, 1981.