5 Application of Pattern Recognition to High-Resolution Gas Chromatographic Data Obtained from an Environmental Survey John M. Hosenfeld and Karin M. Bauer Downloaded by UNIV LAVAL on July 12, 2016 | http://pubs.acs.org Publication Date: November 6, 1985 | doi: 10.1021/bk-1985-0292.ch005
Midwest Research Institute, Kansas City, MO 64110
The application of pattern recognition to a complex chromatographic data base is described. Soil sample extracts were analyzed by high resolution gas chromatography/flame ionization detection (HRGC/FID). The peak retention times were converted to a peak in dex which was then examined by principal component analysis. Several linear combinations of the peaks were identified as factors which separated the sludge -treatedand untreated garden soils. Vector of change plots were constructed that showed the effect of sludge treatment. This data interpretation was achieved with out prior knowledge of chromatogram peak identity for either compound class or type. In a t y p i c a l environmental survey, a l i s t of target analytes i s usually defined i n the design phase of the study and p r i o r to sample c o l l e c t i o n . These analytes may have been chosen through knowledge about the system being studied (1,2), through related environmental situations, or perhaps even by using the analytes currently i n vogue, such as p r i o r i t y pollutants (3) or PCBs (4). Each of these approaches, although i t may meet the immediate needs of the study at hand, advances the knowledge of the environmental system being studied only to a limited extent. The use of predesignated analytes res t r i c t s the information that can be obtained from the samples c o l lected. I f indeed the study i s designed so that the samples are collected i n a s t a t i s t i c a l l y determined manner and yet only a small number of target compounds are included for analysis, then the results and probably the study conclusions w i l l r e f l e c t this narrow approach. An alternative approach i s to analyze the samples using procedures or instrumentation that w i l l give the maximum amount of data for each sample. For example, recent advances i n atomic spectroscopy, i . e . , inductively coupled argon plasma emission spectroscopy (ICP-AES), allow 20 to 30 elements to be detected simultaneously. 0097-6156/ 85/0292-0069506.00/ 0 © 1985 American Chemical Society Breen and Robinson; Environmental Applications of Chemometrics ACS Symposium Series; American Chemical Society: Washington, DC, 1985.
ENVIRONMENTAL APPLICATIONS OF CHEMOMETRICS
Downloaded by UNIV LAVAL on July 12, 2016 | http://pubs.acs.org Publication Date: November 6, 1985 | doi: 10.1021/bk-1985-0292.ch005
70
Another means of greatly increasing the amount of data on the organic compounds present i n samples i s through the use of universal techniques such as high resolution gas chromatography (5,6) combined with flame i o n i z a t i o n detection (HRGC/FID). The problem i s to sort through and retrieve information from the large amount of quantitative data produced with c a p i l l a r y chromatography. After those peaks that contain the most information that describes the sample have been determined, directed and s p e c i f i c confirmation analysis by GC/MS may occur. In order to i l l u s t r a t e this concept, the use of pattern recogn i t i o n approach on gas chromatographic data w i l l be presented. This paper w i l l focus on an environmental survey of sewage sludge usage on home vegetable gardens. The analysis of the organic content of the s o i l s collected on t h i s survey was an opportunistic study since the o r i g i n a l purpose was to monitor trace metal levels i n the treated and untreated gardens. The addition of the HRGC/FID analysis w i l l hopefully add to the knowledge base. Experimental S o i l samples were collected from 92 gardens as part of a nationwide survey of the usage of sewage sludge on home vegetable gardens. In each designated county, s o i l was collected from each of two garden types, i . e . , sludge treated and untreated. Samples of sludge, when available, were also collected from the garden s i t e s . In the laboratory, each s o i l sample (40 g) was transferred to a centrifuge b o t t l e . Since the o r i g i n a l purpose of the s o i l c o l l e c t i o n was to monitor s p e c i f i c organic compounds i n the sludge-amended garden s o i l s , a set of surrogate compounds was added to the s o i l p r i o r to extraction to assess the extraction and cleanup recovery. The surrogate compounds were mono-, t e t r a - , octa-, deca- C-PCBs, dg-naphthalene, C-PCP and C-phenol. The s o i l samples were dried with Na S0 (60 g) and then Soxhlet extracted with hexane: acetone (9:1) for 16 h. The extract was dried with sodium sulfate, concentrated, and s p l i t . While one portion was held for other analyses, the other portion was placed on a 3% deactivated s i l i c a gel column and eluted with increasing solvent p o l a r i t y systems [hexane, f o l lowed by methylene chloride :hexane (1:1), and then methylene chloride:acetone (95:5)]. The extracts were combined and reduced to 1 mL, s p l i t and two internal standards added (tetrafluorobiphenyl and di2 chrysene). The extracts were chromatographed on a 15-m DB-5 fused s i l i c a c a p i l l a r y column and detected with flame i o n i z a t i o n (FID). Sludge samples were extracted according to the EPA sludge protocol (7) developed at Midwest Research I n s t i t u t e . The output from the FID was captured by a Nelson Chromatographic Data System and stored on floppy disks. The algorithm i n the data system processed the raw chromatograms and stored the peak retention times and areas i n data tables which were subsequently transferred to a D i g i t a l Equipment Corporation (DEC) 11/23+ for further processing. A r e l a t i v e retention index (8,9) was developed on the i n t e r n a l standards added to each sample. An a r b i t r a r y chromatogram was chosen to act as a reference against which the other chromatograms would be 13
13
2
13
4
_
Breen and Robinson; Environmental Applications of Chemometrics ACS Symposium Series; American Chemical Society: Washington, DC, 1985.
5. HOSENFELD AND BAUER
71
High-Resolution GC Data
compared and the peak numbering system developed. This procedure consisted of l i n i n g up the two i n t e r n a l standards respectively across a l l chromatograms. Within a given chromatogram, each retention time (X) was transformed to obtain a retention time (Y) using the simple l i n e a r equation: Y = aX + b
(1)
such that aS
n
+
b = S
r l
(2)
aS.
2 +
b = S
r 2
(3)
Downloaded by UNIV LAVAL on July 12, 2016 | http://pubs.acs.org Publication Date: November 6, 1985 | doi: 10.1021/bk-1985-0292.ch005
and
where
S
and S
= retention times of the two i n t e r n a l standards i n the reference chromatogram
and
S.- and S. = untransformed retention times of the two i n t e r n a l standards i n chromatogram i .
2
9
1
1
The system of equations 2 and 3 y i e l d s : S
the
slope (a) S
and
„u the
S
S
r2 " r l i2 S
"
S
(4)
i l S
S
· «. . ^ rl i2 " r2 il intercept (b) = = i2 ' i l
(5)
b
Thus, within each chromatogram, each retention time (X) was l i n e a r l y transformed using Equations 1, 4 and 5 to obtain the adjusted reten t i o n time (Y). Next, the retention time of the f i r s t i n t e r n a l stan dard was renumbered as peak index 1. Peaks occurring p r i o r to the f i r s t i n t e r n a l standard were deleted i n each chromatogram because they were on the solvent peak. Using a 4-sec peak retention window, each retention time i n subsequently adjusted chromatograms was num bered based on the window i n which i t occurred. When two peaks i n a given chromatogram were less than 4 sec apart and within the same window, the peaks were assumed to be unresolved and therefore summed (this happened 14 times out of a t o t a l of over 10,000 peaks i n the entire data s e t ) . Pattern recognition, i . e . , p r i n c i p a l components analysis, was attempted on the data matrix of 92 chromatograms χ 364 peaks. How ever, the mathematical requirements of the S t a t i s t i c a l Analysis Sys tem (SAS) specify that the number of observations (chromatograms) be greater than the number of features (peaks) f o r matrix inversion computations. To solve this problem we considered (1) dividing the chromatogram into three or four sections containing an equal number of peaks or (2) considering sets of 91 randomly selected peaks i n an i t e r a t i v e process. However, a s i g n i f i c a n t drawback of these two ap proaches i s that any interrelationships which may exist between d i f ferent portions of the chromatogram are not taken into account.
Breen and Robinson; Environmental Applications of Chemometrics ACS Symposium Series; American Chemical Society: Washington, DC, 1985.
ENVIRONMENTAL APPLICATIONS OF CHEMOMETRICS
Downloaded by UNIV LAVAL on July 12, 2016 | http://pubs.acs.org Publication Date: November 6, 1985 | doi: 10.1021/bk-1985-0292.ch005
72
An alternate approach (10) was to a r t i f i c i a l l y increase the 92 χ 364 matrix to a 368 (4 χ 92 chromatograms) χ 364 (peaks) so that the matrix could be inverted and a l l 364 peaks considered simultaneously. This was done by r e p l i c a t i n g the o r i g i n a l 92 observations and by s l i g h t l y modifying the peak areas i n each replicate (the area values were multiplied by a random number between 0.99 and 1.01). The en t i r e chromatogram was analyzed rather than portions of i t and thus the correlations between peaks were preserved. In the f i r s t pattern recognition step, a p r i n c i p a l component analysis using SAS was per formed on the combined data set of 368 chromatograms with 364 peaks each. This procedure yielded 364 factors, with the f i r s t three ex p l a i n i n g 16.2%, 10.5%, and 6.9% of the t o t a l variance, respectively. Within these three factors, those peaks with high loadings were kept u n t i l a maximum set of 91 peaks was retained, such that factor 1 contributed 66 peaks, factor 2, 31 peaks and factor 3, 6 peaks. Then using these 91 peaks only, the o r i g i n a l data set was reexamined by p r i n c i p a l components analysis. Eigenvalues greater than one were plotted to determine how many factors should be retained. After varimax rotation, the factor scores were plotted and interpreted. Results and Discussion Typical chromatograms of s o i l extracts are shown i n Figure 1. I t can be seen that the chromatograms are complex and that the sludgetreated s o i l sample has a greater number of peaks (~ 150 vs. ~ 50) and higher detector response than the untreated s o i l sample. One might anticipate that there i s a structure i n the data set of treated and untreated s o i l s and that t h i s structure might be resolved by ap p l i c a t i o n of pattern recognition techniques. The analysis of chromatographic data i s usually performed on normalized chromatograms, which i s an attempt to account for the mass injected. However, the closure of a n a l y t i c a l data i s a problem with normalized data which has been described elsewhere (11) . We examined our data for t h i s problem by p l o t t i n g the grand mean v a r i a t i o n over a l l 368 peaks versus the standard deviations of these peaks. Clo sure did not occur i n the unnormalized data. The p l o t of the decreasing sequence of eigenvalues of the 91 p r i n c i p a l components i s shown i n Figure 2. Components 1, 2 and 3, with eigenvalues of 42.7, 22.4 and 8.8, respectively, explained 47.0%, 24.6% and 9.7% of the t o t a l variance, respectively, a t o t a l of 81.3%. The fourth component with an eigenvalue of 2.5 accounted for only 2.7% of the t o t a l variance, and thus only the f i r s t three p r i n c i p a l components were selected to be further explored. (Note that only 9 of the 91 components had eigenvalues greater than 1.0, explaining together 92.3% of the t o t a l variance.) After varimax rotation, the eigenvalues of the f i r s t three components were only s l i g h t l y changed to 42.1, 20.9 and 10.8, respectively; thus a strong f i r s t factor remains followed by two factors approximately half as important as t h e i r precedent. Next, within each factor, those fea tures (peak numbers) with loadings representing at least 2% (about twice the average of 1/91 · 100%) of the variance of t h i s factor were kept and ordered with respect to these percentages. By t h i s method,
Breen and Robinson; Environmental Applications of Chemometrics ACS Symposium Series; American Chemical Society: Washington, DC, 1985.
Downloaded by UNIV LAVAL on July 12, 2016 | http://pubs.acs.org Publication Date: November 6, 1985 | doi: 10.1021/bk-1985-0292.ch005
5. HOSENFELD AND BAUER
High-Resolution
GC Data
73
Time (Minutes)
Time {Minutes)
Figure 1. Typical gas chromatograms of s o i l from an untreated garden (top) and sludge treated garden (bottom). Conditions: 15 m DB-5, 0.25 mm ID capillary column operated at 100 C (2 min) then programmed at 10 C/min to 310 C (7 min hold).
Breen and Robinson; Environmental Applications of Chemometrics ACS Symposium Series; American Chemical Society: Washington, DC, 1985.
Downloaded by UNIV LAVAL on July 12, 2016 | http://pubs.acs.org Publication Date: November 6, 1985 | doi: 10.1021/bk-1985-0292.ch005
ENVIRONMENTAL APPLICATIONS OF CHEMOMETRICS
4567 0L
L Ο
8901234 -L 10
56789012345678901234567890123456789012345678901234567890123456789012345678901 1 1 1 I I I I Ι 20 30 40 50 60 70 80 90 Number
Figure 2. Plot of the eigenvalues of the c o r r e l a t i o n matrix.
Breen and Robinson; Environmental Applications of Chemometrics ACS Symposium Series; American Chemical Society: Washington, DC, 1985.
Downloaded by UNIV LAVAL on July 12, 2016 | http://pubs.acs.org Publication Date: November 6, 1985 | doi: 10.1021/bk-1985-0292.ch005
5.
HOSENFELD AND BAUER
High-Resolution GC Data
75
Factor 1 was characterized by 25 peaks, each having about equal loadings (0.99 to 0.93 or equivalently 2.3% to 2.0% of the variance accounted by this f a c t o r ) . The f i r s t plot i n Figure 3 shows that these 25 peaks are spread across the t o t a l peak range with a somewhat higher concentration of peaks toward the end of the range (276 to 351). In contrast, Factor 2 contained 24 peaks with loadings ranging from 0.97 to 0.71, representing proportions of variance of this factor from 4.5% to 2.4%. These 24 peaks were somewhat clustered between peak numbers 31 and 80, as shown i n the second plot i n Figure 3. Factor 3 (3rd plot i n Figure 3) i s the most s t r i k i n g i n comparison to the previous two factors. A small number (14 out of 91) of peaks account for 84.8% of the variance explained by t h i s factor, with loadings ranging from 1.0 to 0.6 (equivalent to high percent variances ranging from 8.8 to 3.3%). Although these 14 peaks broadly cover the whole range of the o r i g i n a l 400 peaks, a minor cluster occurs i n the 192 to 248 section. I t i s i n t e r e s t i n g to note that only one peak, number 161, i s duplicated i n any of the factors (2 and 3), thus underlining the orthogonality of these three factors to each other. These loading variance percentage plots indicate that a structure may be present i n the data set due to the above discussed dissimilarities. In order to determine the scope of the hidden structure, factor plots of the observations were made. A plot was made of the factor scores for each garden s o i l , i . e . , sludge-treated (T) or untreated (U). No clear pattern emerged from these p l o t s of the factor scores and so another approach was taken. The treated and untreated scores were replotted (Figure 4) with a l e t t e r code substituted for each county from which a s o i l sample was collected. However, these plots were only of minor use i n providing insight into the data structure. From these factor plots of the observations, secondary plots were constructed to determine the e f f e c t of treating garden s o i l with sludge compared to untreated garden s o i l . These vectors of change plots are shown i n Figures 5, 6, and 7. I t i s important to emphasize again that the treated and untreated s o i l sample came from at least two separate gardens within a county, i . e . , no experimental design of adding sludge to untreated gardens occurred. Figure 5 presents the plot of factor 1 versus factor 2. The comparison of sludge to untreated s o i l s M-Z, G-T, B-0, and A-N shows an equal and p o s i t i v e combination of factors 1 and 2. Site G-T i s profoundly affected by sludge treatment, as evidenced by the large response i n these two factors. S o i l s J-W, L-Y, K-X, E-R are negatively affected by factor 2. In Figure 6, a similar e f f e c t i s seen for s o i l s G-T, A-N, and M-Z; however, s i t e C-P i s reversed from Figure 5 because of a strong contribution from factor 3. Sites D-Q, I-V, F-S have the same reversal as s i t e C-P. Sites E-R, K-X, and L-Y are strongly affected by a p o s i t i v e factor 3. Figure 7 shows changes similar to those occurring i n Figure 6. I t i s apparent that the relationship among factors i s 3 > 2 > 1. However, i t i s important to recognize that these eigenvector projections were made without knowledge about the class assignments of the i n d i v i d u a l s o i l s . The resulting separation i s therefore a strong i n d i c a t i o n of r e a l differences between the two garden s o i l types i n a given county. Similar vectors of change were
Breen and Robinson; Environmental Applications of Chemometrics ACS Symposium Series; American Chemical Society: Washington, DC, 1985.
ENVIRONMENTAL APPLICATIONS OF CHEMOMETRICS
87
2H
50
114
100
304 319/// 344
131
150
200
250
300
350
250
300
350
Peak Number
Downloaded by UNIV LAVAL on July 12, 2016 | http://pubs.acs.org Publication Date: November 6, 1985 | doi: 10.1021/bk-1985-0292.ch005
5-i
FACTOR 2
IH I
100
150
200
400
Peak Number
ΙH I r
50
100
150
200
250
300
350
400
Figure 3. P l o t of the features (peak numbers) compared to the loading variance percent f o r the f i r s t three factors.
Breen and Robinson; Environmental Applications of Chemometrics ACS Symposium Series; American Chemical Society: Washington, DC, 1985.
Breen and Robinson; Environmental Applications of Chemometrics ACS Symposium Series; American Chemical Society: Washington, DC, 1985.
4l
3k
2k
-0.3
OG V S Y
-0.1
0 FACTOR 2
0.1
0.2
15 scores not plotted because of superposition.
(Letters A-M are untreated s o i l while N-Z are the corresponding sludge treated s o i l such that A-N are both types of s o i l from a given county.)
Figure 4. Factor score plots of sludge treated and untreated garden s o i l .
-0.2
X IQRUGX BPHE I G A
Downloaded by UNIV LAVAL on July 12, 2016 | http://pubs.acs.org Publication Date: November 6, 1985 | doi: 10.1021/bk-1985-0292.ch005
0.3
0.4
Breen and Robinson; Environmental Applications of Chemometrics ACS Symposium Series; American Chemical Society: Washington, DC, 1985.
Downloaded by UNIV LAVAL on July 12, 2016 | http://pubs.acs.org Publication Date: November 6, 1985 | doi: 10.1021/bk-1985-0292.ch005
C/3
m Ο m Η 70 Ο
Χ
Ο τι Ο
C/3
δ ζ
η
r > "Ό r
m ζ
ζ
< ο
m Ζ
00
Breen and Robinson; Environmental Applications of Chemometrics ACS Symposium Series; American Chemical Society: Washington, DC, 1985.
Figure 6.
Vector of change f o r factors 1 versus 3.
FACTOR 3
Downloaded by UNIV LAVAL on July 12, 2016 | http://pubs.acs.org Publication Date: November 6, 1985 | doi: 10.1021/bk-1985-0292.ch005
Breen and Robinson; Environmental Applications of Chemometrics ACS Symposium Series; American Chemical Society: Washington, DC, 1985.
Downloaded by UNIV LAVAL on July 12, 2016 | http://pubs.acs.org Publication Date: November 6, 1985 | doi: 10.1021/bk-1985-0292.ch005
C/3
Ô
Ο m Η
m
π
•π
Ο
C/3
δ ζ
ο
r > •Ό "Ό r
ζ m ζ
ο
m ζ