Archaeological Chemistry IV - American Chemical Society

of mass, in the elemental concentration space. Structure within ..... V. A. A. A. A. A. X. X x. XX. X x x x y. -0.6. -0.4. +. Chinautla/Sacojito Mediu...
0 downloads 0 Views 4MB Size
4 Compositional Data Analysis

Downloaded by UNIV OF PITTSBURGH on June 1, 2014 | http://pubs.acs.org Publication Date: July 1, 1989 | doi: 10.1021/ba-1988-0220.ch004

in Archaeology Ronald L. Bishop and Hector Neff Conservation Analytical Laboratory, Smithsonian Institution, Washington, DC 20560

As compositional analysis has become more routine in archaeological investigations, deficiencies in the numerical techniques used for data reduction and summary have become more apparent. A brief overview of techniques commonly used in the analysis of compositional data is presented as well as an example illustrating how data modeling (as opposed to data summary) can facilitate both the recognition of relevant data structure and inferences from data structure to underlying natural and cultural processes.

THE APPLICATION OF CHEMICAL ANALYTICAL TECHNIQUES

to archaeological questions has a l o n g history that extends back to the late 1700s (I). M a n y of the earlier investigations dealt w i t h the compositional characterization of objects to elucidate aspects of t h e i r properties, such as color. Yet, as H a r bottle (2) has n o t e d , b y the e n d of the 19th c e n t u r y , chemists l i k e D a m o u r a n d H e l m v i e w e d the c h e m i c a l analysis of artifacts as a means of d o c u m e n t i n g long-distance traffic i n p a r t i c u l a r materials. T h e basic approach of d e t e r m i n i n g a c h e m i c a l c o m p o s i t i o n for an object a n d t h e n c o m p a r i n g that profile to others s i m i l a r l y d e r i v e d has b e e n elaborated since that t i m e . Today, the c h e m i c a l characterization of artifacts constitutes a basic archaeological approach that can be u s e d to address not only p r o b l e m s p e r t a i n i n g to l o n g distance exchange b u t to intraregional p r o d u c t i o n a n d d i s t r i b u t i o n (3), d e v e l o p m e n t of craft specialization (4), and typological r e f i n e m e n t (5, 6), a m o n g other issues. D e s p i t e the v o l u m e of data generated a n d the variety of applications, d e v e l o p m e n t a n d testing of data-handling techniques have lagged. T h e r e is 0065-2393/89/0220-0057$08.50/0 © 1989 A m e r i c a n C h e m i c a l S o c i e t y

In Archaeological Chemistry IV; Allen, R.; Advances in Chemistry; American Chemical Society: Washington, DC, 1989.

58

ARCHAEOLOGICAL CHEMISTRY

Downloaded by UNIV OF PITTSBURGH on June 1, 2014 | http://pubs.acs.org Publication Date: July 1, 1989 | doi: 10.1021/ba-1988-0220.ch004

not a " c o o k b o o k " approach to data analysis any more than there is some i d e a l group or n u m b e r of e l e m e n t a l concentrations to d e t e r m i n e for a l l applications. C o m p l e x natural a n d c u l t u r a l interactions can account for m u c h of the o b s e r v e d c o m p o s i t i o n a l v a r i a t i o n , a n d one m u s t b e aware of these interactions to achieve a greater u n d e r s t a n d i n g of the data. I n the discussion to follow, w e w i l l b e c o n c e r n e d w i t h aspects of m u l t i v a r i a t e data analysis that l e a d us t o w a r d the position that many of the questions b e i n g addressed i n a c o m p o s i t i o n a l investigation r e q u i r e m o d e l i n g rather than m e r e l y s u m m a r i z i n g the data.

Background T h e d e v e l o p m e n t of increasingly sophisticated analytical i n s t r u m e n t a t i o n that allows n u m e r o u s e l e m e n t a l concentrations to be d e t e r m i n e d i n a r e l atively short t i m e a n d increased t h r o u g h p u t of specimens has h a d a d e c i d e d i m p a c t o n archaeology. It has e v e n b e e n c l a i m e d (7) that the availability of analytical capability is partially responsible for concentrated archaeological attention to m a t e r i a l exchange d u r i n g the 1970s. W i t h the interest a m o n g archaeologists a n d the technological advances has c o m e a staggering a m o u n t of analytical data. N u m e r i c a l s u m m a r i z a t i o n of these data, assisted b y the increasing speed of the c o m p u t e r a n d availability of general-purpose statistical packages, is a v i t a l l i n k b e t w e e n the generation of data a n d its i n t e r pretation w i t h i n the archaeological context. D a t a analysis has not b e e n neglected b y archaeologists a n d "archaeom e t r i c i a n s . " N u m e r o u s papers have d e s c r i b e d various techniques a p p l i e d to specific sets of data. O t h e r s have d e s c r i b e d h o w particular options of readily available c o m m e r c i a l programs are u s e d (8-11). A t times, r o u t i n e n u m e r i c a l procedures a p p l i e d to w e l l - d e t e r m i n e d data have b e e n i n t e r p r e t e d i n a m a n n e r that fails to c o n t r i b u t e to increased archaeological u n derstanding. I n m a n y of these efforts there is an inappropriate use of statistics, failure to u n d e r s t a n d the c o m p o n e n t nature of the m a t e r i a l b e i n g analyzed, or a failure to b r i d g e f r o m the analytical data to the archaeological context. I n m o r e general terms, the speed of data p r o d u c t i o n and c o m p u tation has outpaced the logic of the investigation. Because w e recognize that the m e r g e r of archaeological investigation w i t h p h y s i c o c h e m i c a l analysis is still e v o l v i n g , w e w i l l t r y to a v o i d reference to specific applications w h e r e w e b e l i e v e basic mistakes w e r e made. Instead, w e w i l l discuss p r o b l e m s a r i s i n g i n the compositional analysis of archaeological materials i n an abstract or generic m a n n e r . T h i s approach m a y s u b d u e the i n c l i n a t i o n some investigators may feel to engage i n v i t r i o l i c r e b u t t a l l i k e that w h i c h f o l l o w e d Thomas's (12) general c r i t i q u e of statistical practice i n archaeology (13). A l t h o u g h m a n y of the c o m m e n t s i n this chapter are applicable to situations e n c o u n t e r e d d u r i n g the analysis of data from diverse types of ar-

In Archaeological Chemistry IV; Allen, R.; Advances in Chemistry; American Chemical Society: Washington, DC, 1989.

4.

BISHOP & N E F F

Compositional

Data Analysis

59

chaeological materials, w e w i l l illustrate o u r points i n a later section w i t h examples d r a w n f r o m c e r a m i c compositional systems. O u r examples w i l l incorporate v e r y w e l l - u n d e r s t o o d a n d artificial (or " d u m m y " ) data, as a p p r o p r i a t e i n a discussion of methodology.

Goals of Data Analysis

Downloaded by UNIV OF PITTSBURGH on June 1, 2014 | http://pubs.acs.org Publication Date: July 1, 1989 | doi: 10.1021/ba-1988-0220.ch004

C o m p o s i t i o n a l analysis of archaeological materials entails a series of n o n d i s crete steps of research design: • problem formulation, • • • •

sample selection, analytical approach, data analysis, a n d data integration.

T h e nature of the specified p r o b l e m w i l l suggest w h i c h samples a n d h o w m a n y w i l l b e c o n s i d e r e d , w h e t h e r r a w source materials w i l l b e i n c l u d e d i n the investigation, the spatial a n d t e m p o r a l extent of s a m p l i n g , etc. C e r tainly, s a m p l i n g of an i n t e r r e g i o n a l investigation w i l l differ considerably f r o m the m o r e d e m a n d i n g r e q u i r e m e n t s for an intraregional focus (14). O n c e the p r o b l e m is f o r m u l a t e d a n d samples are specified, selection of an appropriate analytical t e c h n i q u e i d e a l l y d e p e n d s u p o n the sensitivity a n d p r e c i s i o n r e q u i r e d to address the p r o b l e m at h a n d . O n a m o r e practical l e v e l , one cannot dismiss considerations of i n s t r u m e n t a l availability a n d cost. A data m a t r i x p r o d u c e d b y compositional analysis c o m m o n l y contains 10 or m o r e m e t r i c variables (elemental concentrations) d e t e r m i n e d for an e v e n greater n u m b e r of observations. T h e b r i d g e b e t w e e n this m u l t i d i m e n sional data matrix a n d the d e s i r e d archaeological i n t e r p r e t a t i o n is m u l t i v a r iate analysis. T h e purposes of m u l t i v a r i a t e analysis are data exploration, hypothesis generation, hypothesis testing, a n d data r e d u c t i o n . A p p l i c a t i o n of multivariate techniques to data for these purposes entails an a s s u m p t i o n that some f o r m of structure exists w i t h i n the data matrix. T h e n o t i o n of structure is therefore f u n d a m e n t a l to compositional investigations. S t r u c t u r e w i t h i n a compositional data set is the differential o c c u r r e n c e of data points i n the n-space d e f i n e d b y e l e m e n t a l concentrations. O n e s i m p l e k i n d of structure consists of points g r o u p e d a r o u n d two centroids, or centers of mass, i n the e l e m e n t a l concentration space. S t r u c t u r e w i t h i n a c o m p o sitional data set is assumed, i m p l i c i t l y or e x p l i c i t l y , to reflect the u n d e r l y i n g process responsible for the data. T h u s , i n the case of the t w o - c e n t r o i d structure j u s t m e n t i o n e d , an u n d e r l y i n g process, such as p r o c u r e m e n t of clay from two sources, is assumed. Different operational levels m a y exist for the inference of process from structure. F o r e x a m p l e , p r i n c i p a l - c o m p o n e n t s analysis (a m e t h o d d e s c r i b e d

In Archaeological Chemistry IV; Allen, R.; Advances in Chemistry; American Chemical Society: Washington, DC, 1989.

Downloaded by UNIV OF PITTSBURGH on June 1, 2014 | http://pubs.acs.org Publication Date: July 1, 1989 | doi: 10.1021/ba-1988-0220.ch004

60

ARCHAEOLOGICAL CHEMISTRY

later i n this chapter) p e r m i t s compression of m u l t i v a r i a t e data into a few d i m e n s i o n s a n d y i e l d s a scatterplot of w h a t appear to be two groups ( F i g u r e 1). T h e two groups are r e a d i l y recognizable b y u s i n g several different k i n d s of cluster analysis; the group separation is r e a d i l y c o n f i r m e d w i t h d i s c r i m inant analysis. I n fact, one group was f o r m e d f r o m the other b y m u l t i p l y i n g all e l e m e n t a l concentrations b y 0.66. T h i s example approximates the effect of a r e l a t i v e l y p u r e t e m p e r (e.g., quartz sand) o n the clay composition of ceramics (16). P a r t i c u l a r l y i n c e r a m i c p r o d u c t i o n , an o b s e r v e d c o m p o s i t i o n a l profile m i g h t relate not o n l y to the natural r e a l m (source rocks, w e a t h e r i n g , erosion, transportation, etc.), b u t also to the c u l t u r a l r e a l m (social a n d i n d i v i d u a l patterns o f materials p r o c u r e m e n t a n d preparation). T h e search for structure proceeds a c c o r d i n g to some m a t h e m a t i c a l m o d e l that can organize a n d represent the i n f o r m a t i o n i n a data matrix. P a r t i c u l a r k i n d s of associations b e t w e e n data entities or variables may be e x a m i n e d — b u t always relative to the p a r t i c u l a r m o d e l u s e d (17). T h e s e models are at the same t i m e s t r u c t u r e - r e v e a l i n g a n d s t r u c t u r i n g . T h i s c o n cept is illustrated b y three natural groups s h o w n relative to concentrations of F e a n d Sc i n the scatterplot i n F i g u r e 2. Because of i n t e r e l e m e n t a l correlations, the groups f o r m elongated ellipses, yet are fully separable at the 9 5 % confidence i n t e r v a l . I f a h i e r a r c h i c a l cluster analysis based on E u c l i d e a n distances calculated f r o m logged F e a n d Sc values w e r e c a r r i e d out, the

=5: z Ixl Z

oa. o o
Q. O z ££ CL

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -0.1 -0.2 -0.3 -0.4 -0.5 -0.6 -0.7 -0.8 -0.9 -1

-

CP a



C



cP

-0.6

a

+



+ +

+ -0.4

a

Non-diluted

-0.2

0.2

PRINCIPAL COMPONENT #2

0.4

+ Diluted

Figure 1. Example showing how proportional dilution may create compositional subgroups. "Diluted" specimens were created by multiplying 17 elemental concentrations in the nondiluted specimens by 0.66.

In Archaeological Chemistry IV; Allen, R.; Advances in Chemistry; American Chemical Society: Washington, DC, 1989.

0.6

In Archaeological Chemistry IV; Allen, R.; Advances in Chemistry; American Chemical Society: Washington, DC, 1989.

Figure 2. Scatterplot of Fe and Sc values for three distinct groups.

Downloaded by UNIV OF PITTSBURGH on June 1, 2014 | http://pubs.acs.org Publication Date: July 1, 1989 | doi: 10.1021/ba-1988-0220.ch004

5

o

w

o

2?

Downloaded by UNIV OF PITTSBURGH on June 1, 2014 | http://pubs.acs.org Publication Date: July 1, 1989 | doi: 10.1021/ba-1988-0220.ch004

62

ARCHAEOLOGICAL CHEMISTRY

r e s u l t i n g s p h e r i c a l clusters w o u l d confound the group m e m b e r s h i p . If, h o w ever, a h i e r a r c h i c a l cluster analysis w e r e c a r r i e d out b y u s i n g E u c l i d e a n distances calculated from standardized p r i n c i p a l - c o m p o n e n t scores, the r e s u l t i n g clusters w o u l d c o r r e s p o n d to the groups e v i d e n t i n the scatterplot. W i t h o u t any p r i o r k n o w l e d g e of the structure i n the data set i n F i g u r e 2, one m i g h t b e g i n a search for structure w i t h an analysis of the straightl i n e distances b e t w e e n points i n the data set. H i e r a r c h i c a l cluster analysis of E u c l i d e a n distances i m p l e m e n t s a systematic approach to the analysis of straight-line distances. (Euclidean distance is the n - d i m e n s i o n a l g e n e r a l i zation of straight-line distance, a n d is discussed i n greater d e t a i l later. Cluster analysis is a m e t h o d of r e p r e s e n t i n g the E u c l i d e a n distances i n two d i m e n sions, a n d is also discussed later). H o w e v e r , the cluster analysis approach i n this case fails to represent the k n o w n relationships a m o n g the k n o w n groups, a l t h o u g h groups of closely similar samples are f o r m e d . T h e p r o b l e m lies i n the m o d e l . T h e E u c l i d e a n distance calculation is i n a p p r o p r i a t e for use w i t h correlated variables because it is based o n l y o n pairwise comparisons, w i t h o u t r e g a r d to the elongation of data p o i n t swarms along p a r t i c u l a r axes. I n effect, E u c l i d e a n distance imposes a s p h e r i c a l c o n straint o n the data set (18). W h e n correlation has b e e n r e m o v e d f r o m the data, (by d e r i v a t i o n of standardized characteristic vectors) E u c l i d e a n distance a n d average-linkage cluster analysis r e t u r n the three groups. M o s t of the t i m e , w e do not have absolute a p r i o r i k n o w l e d g e r e g a r d i n g the n u m b e r of groups i n a data set, or the relationships a m o n g the variates. S e v e r a l rather distinct populations w i t h differing patterns of i n t e r e l e m e n t a l correlations m a y b e r e p r e s e n t e d . I n such cases, i n s p e c t i o n of correlations whose p a t t e r n is p o o l e d o v e r a l l samples m a y not be informative (although, as i n the example j u s t p r e s e n t e d , i n s p e c t i o n o f scatterplots alone may p r o v i d e information o n the n u m b e r of groups l i k e l y to be found). A s s u m i n g that a data set has some natural or o p t i m u m structure a n d that a g i v e n m u l t i v a r i a t e approach w i l l be able to reveal it is a b l i n d approach to data analysis. Because m a t h e m a t i c a l p a t t e r n - r e c o g n i t i o n techniques not only reveal structure b u t may impose structure as w e l l , more i n f o r m e d application of multivariate techniques is n e e d e d . T h e choices a m o n g data analytical approaches m u s t b e m a d e w i t h reference to the stated research p r o b l e m a n d an awareness of the r e q u i r e m e n t s a n d assumptions of the v a r ious m u l t i v a r i a t e techniques. Different groups w i l l be f o r m e d or different aspects of the data investigated d e p e n d i n g u p o n specific p r o b l e m f o r m u l a t i o n . F r o m this p e r s p e c t i v e , one may reject outright naive notions o f u n i f o r m methodology i n v o l v i n g m u l t i v a r i a t e data analysis (11).

Multivariate Techniques and the Search for Structure T h e literature d e a l i n g w i t h the m u l t i v a r i a t e techniques of p a t t e r n r e c o g n i t i o n , n u m e r i c a l taxonomy, g r o u p evaluation, etc. is extensive (e.g., reference

In Archaeological Chemistry IV; Allen, R.; Advances in Chemistry; American Chemical Society: Washington, DC, 1989.

4.

BISHOP & N E F F

Compositional

Data Analysis

63

Downloaded by UNIV OF PITTSBURGH on June 1, 2014 | http://pubs.acs.org Publication Date: July 1, 1989 | doi: 10.1021/ba-1988-0220.ch004

19). T h i s discussion provides o n l y a n o u t l i n e o f the techniques that have b e e n u s e d to search for structure i n compositional data matrices generated b y t h e analysis o f archaeological materials. Before m a n y of t h e t e c h n i q u e s are u s e d , h o w e v e r , some p r e t r e a t m e n t of t h e data m a y b e necessary.

Data Representation. Transformations can b e a p p l i e d to t h e data so that they w i l l m o r e closely follow the n o r m a l d i s t r i b u t i o n that is r e q u i r e d for c e r t a i n procedures or for r e m o v i n g (or lessening) u n w a n t e d influences. C e r t a i n l y for data analysis i n w h i c h major, m i n o r , a n d trace e l e m e n t a l c o n centrations are u s e d , some f o r m o f scaling is necessary to k e e p t h e variables w i t h larger concentrations f r o m h a v i n g excessive w e i g h t i n t h e calculation of m a n y coefficients o f similarity. A n o t h e r f o r m of scaling involves e q u a l i z i n g the extent of variation a m o n g the variables. I n some instances, t w o transformations o f t h e data are of interest (e.g., transforming t h e data to l o g - n o r m a l d i s t r i b u t i o n s a n d t h e n e n s u r i n g e q u a l w e i g h t t h r o u g h additional scaling). U s i n g transformations that equalize t h e m a g n i t u d e of the measurements a n d the a m o u n t of variation is i n k e e p i n g w i t h one of t h e premises of n u m e r i c a l taxonomy: that i s , no i n d i v i d u a l variable s h o u l d assume m o r e w e i g h t t h a n another i n a n analysis i n v o l v i n g the calculation of resemblance (19, 20). H o w e v e r , w h e n o n e is m o d e l i n g , rather than m e r e l y s u m m a r i z i n g the data, variable contributions m a y b e adjusted according to different c r i t e r i a . O n e example n o w u n d e r investigation is w e i g h t i n g c h e m i c a l determinations as a f u n c t i o n of t h e i r analytical errors. T y p i c a l transformations i n c l u d e calculation o f logarithms; standardizat i o n (mean of 0, standard d e v i a t i o n of 1); p e r c e n t range; a n d p e r c e n t of t h e m a x i m u m value. A different t y p e o f transformation has b e e n u s e d for s u m m a r i z i n g c h e m i c a l data from t h e analysis of steatite or soapstone. A r g u i n g from p r i n c i p l e s o f geochemistry, A l l e n a n d coworkers (21-23) n o r m a l i z e d the rare earth concentrations relative to abundances i n c h o n d r i t i c meteorites. F o l l o w i n g Sayre (24) a n d H a r b o t t l e (20), w e use base-10 logarithms i n most of t h e examples discussed later i n this chapter. A p e r c e n t range transform a t i o n is also e m p l o y e d for one operation.

Ordination. O r d i n a t i o n procedures place a sample data p o i n t i n a variable space to represent some t r e n d o r variation. N o assumptions n e e d to b e made r e g a r d i n g the n u m b e r of groups. A s i m p l e type of o r d i n a t i o n w o u l d b e to p l o t t h e coordinates o f a sample relative to t w o variables as i n F i g u r e 2. F o r p-variables, h i g h e r d i m e n s i o n a l i t y p r o h i b i t s easy i n s p e c t i o n , so most o r d i n a t i o n techniques attempt to s u m m a r i z e t h e i n f o r m a t i o n w i t h i n a data set a n d r e d u c e the d i m e n s i o n a l i t y (e.g., F i g u r e 1). D i m e n s i o n a l i t y r e d u c t i o n a n d ordination have h a d three m a i n uses i n compositional investigations. T h e y have b e e n u s e d

In Archaeological Chemistry IV; Allen, R.; Advances in Chemistry; American Chemical Society: Washington, DC, 1989.

64

ARCHAEOLOGICAL CHEMISTRY

1. to inspect the data to see i f a general size c o m p o n e n t , one s t e m m i n g from a p r o p o r t i o n a l rather an absolute relationship, is present i n the data (16, 25) 2. to project the data into a standardized space that offers a different, possibly m o r e appropriate, perspective o n i n t e r p o i n t distances (26)

Downloaded by UNIV OF PITTSBURGH on June 1, 2014 | http://pubs.acs.org Publication Date: July 1, 1989 | doi: 10.1021/ba-1988-0220.ch004

3. to f o r m a set of reference axes of r e d u c e d d i m e n s i o n a l i t y for graphical display of g r o u p e d sample distributions d e t e r m i n e d b y some other t e c h n i q u e (27) T h e most w i d e l y u s e d o r d i n a t i o n methods are based o n extracting e i g e n values a n d eigenvectors (also c a l l e d characteristic roots a n d characteristic vectors) from a m i n o r p r o d u c t matrix, X ' X , or major p r o d u c t matrix, X X ' , of a data matrix, X (28). I f the data matrix is centered b y columns before calculating X ' X , the m i n o r p r o d u c t matrix is a variance-covariance matrix. I f the data matrix is first not o n l y centered b u t standardized, the m i n o r p r o d u c t matrix is a correlation matrix. W h e t h e r the starting point of the analysis is a v a r i a n c e - c o v a r i a n c e matrix or a correlation matrix, the e i g e n vectors of X ' X are usually called principal components. T h e eigenvectors of this matrix may be m u l t i p l i e d b y t h e i r c o r r e s p o n d i n g eigenvalues to p r o d u c e factors that m a y be rotated to enhance t h e i r i n t e r p r e t a b i l i t y (in this case, the analysis is c a l l e d a factor analysis). I n p r i n c i p a l , n e i t h e r c e n t e r i n g nor standardization is necessary i n e i genvector analyses, a n d each m a y be u n d e r t a k e n w i t h o u t the other. O r l o c i (29) a n d N o y - M e i r (30) discuss the effects of c e n t e r i n g i n ecological a p p l i cations. To o u r k n o w l e d g e , there has b e e n no careful consideration of the effects of c e n t e r i n g a n d standardization o n compositional data matrices. T h e p o p u l a r i t y of standardized, centered eigenvector analyses has m o r e to do w i t h the availability of software than w i t h the appropriateness of the assumptions. T h e first p r i n c i p a l c o m p o n e n t accounts for the d i r e c t i o n of m a x i m u m variance t h r o u g h the data, w i t h each successive c o m p o n e n t accounting for the m a x i m u m of the r e m a i n i n g variation. T h e l e n g t h of each vector is d e t e r m i n e d b y the square root of the associated eigenvalue. T h e d e r i v e d c o m ponents constitute a n e w set of reference axes that are l i n e a r combinations of the o r i g i n a l m e a s u r e m e n t , b u t that n o w are orthogonal; that is, the v a r i ance of the o r i g i n a l data is p r e s e r v e d b u t the covariance has b e e n e l i m i n a t e d . D e p e n d i n g o n h o w m u c h of the variance one wishes to preserve i n the analysis, the n u m b e r of c o m p o n e n t s may be truncated, b u t o n l y at the loss of some information. F a c t o r analysis includes an explicit statistical assumption that a l l m e a n i n g f u l variation i n the data is accounted for b y m u n d e r l y i n g factors, w i t h m