Characterization of simple carbohydrate structure ... - ACS Publications

Apr 1, 1982 - Rachhpal S. Sahota and Stephen L. Morgan. Analytical Chemistry ... of bacterial cell walls. Joseph R. Hudson , Stephen L. Morgan , Alvin...
1 downloads 0 Views 766KB Size
Anal. Chem. iga2, 5 4 ,

741-747

741

Characterization of Simple Carbohydrate Structure by Glass Capillary Pyrolysis Gas Chromatography and Cluster Analysis Stephen L. Morgan" and Chrlstopher A. Jacques' Department of ChemlsPy, (Jnlverslv of South Carollna, Columbla, South Carollna 29208

The potentlai of high-resolution glass caplliary pyroiysls gas chromatography for the qualitative analysis of carbohydrates is investigated by uslng a set of 15 simple carbohydrates as model compounds. Pyrograms of the mono-, dl-, and trlsaccharides sampled from water and aqueous boric acld soiutlons are compared. A chromatographlc simllarlty measure based upon the comparison of peak ratlos Is applied to the dlscrlmlnation of the carbohydrates. Computer asslsted nonlinear mapping and hierarchical ciusterlng are used to dlsplay slmilarlty relatlonshlps.

Pyrolysis gas chromatography (PGC) involves thermal fragmentation of an analytical sample a t elevated temperatures in the absence of oxygen, followed by gas chromatographic separation of the fragments (pyrolysates). The formation of specific patterns of degradation products and their relative amounts present in the chromatogram (in this case, pyrogram) can provide quantitative and qualitative structural information about the parent molecule. Often, the identification of specific peaks in the pyrogram can lead to partial or complete elucidation of the parent molecular structure. Our interest in PGC is focused on the development of analytical methods for mediium-to-high molecular weight organic compounds of biomedical interest, particularly those that are not volatile or that decompose a t normal GC temperatures. Analytical methods for carbohydrates usually consist of hydrolysis, methylation, or other derivative formation steps, isolation by chromatography, and finally identification by IR, NMR, X-ray diffraction, or mass spectrometry (1-3). Many of these procedures require substantial amounts of sample and are time-consuming. PGC offers an attractive complementary approach for the rapid characterization of carbohydrates. The application of analytical pyrolysis to the analysis of molecules of biological origin, including carbohydrates, has been reviewed by Irwin and Slack ( 4 ) . Previous work (5) in our laboratory demonstrated that (1 -* 4) glycosidic linked disaccharides could be distinguished from (1 6) linked disaccharides using PGC with packed columns. Within each of these isomeric groups, however, the disaccharides with epimeric monomers or with anomeric glycosidic linkages were indistinguishable from one another. In this paper, we present the results of further efforts to enhance discrimination between simple carbohydrates usiing PGC. First, the packed columns of our initial studies were replaced with high-resolution glass capillary columns, providing increased separation of pyrolysates and better structural discrimination. Second, the carbohydrates were sampled from boric acid solutions to take advantage of the reactions of carbohydrates in these solutions (6). The secondary focus of this article is to discuss innovative techniques for chromatographic pattern recognition in the absence of selective detectors that provide identification of peaks. CHROMATOGRAPHIC PATTERN IRECOGNITION The decision whether a chromatogram is significantly different from one previously obtained, or whether a pattern

-

'Present address: Analytical Research and Development Division,

Amway Corp., Ada, MI 49355.

0003-2700/82/0354-074I$Oi .25/0

belongs to a specific class, often depends on the recognition of a subtle difference or trend that is obscured by the wealth of background information. The desired information is in the chromatogram, but it may not be easy to spot. The difference between several samples might not be a simple inflation or reduction of a single peak; there might be a combination of several such changes, each change seemingly within normal variation but the over& change being significant. The manual comparison of more than a few chromatograms can be an exceedingly difficult task, especially when the number of peaks becomes very large. It is desirable, however, to use as many peaks as possible to discriminate among complex patterns. Statistically significamt discrimination requires a ratio of samples to features (peaks) greater than 2 or 3 (7). Attempts to reduce the number of features by discarding some peaks may eliminate useful information. Additionally, there is usually no prior knowledge regarding the statistical distribution of the chromatographic data. For some of these reasons perhaps, pattern recognition has not been extensively applied in chromatography. These difficulties are not widely dissimilar to those found in applying pattern recognition to the interpretation of other data in analytical chemistry. Simple fingerprinting or matching of unknowns to standards, without the use of quantitative measures of similarity, is common in chromatography. Not surprisingly, many of the reported uses of pattern recognition in chromatography have involved PGC. Computer matching of unknown chromatograms with a library of known chromatograms was first discussed by Menger et al. (8). Their prototype program compared two bacteria pyrograms on the basis of retention times and peak heights, witlh rejection of similarity if disagreement was greater than 5%. Applications of pattern recognition have since appeared at a variety of different levels of sophistication in chromatography (9, 10) and in analytical pyrolysis (11-22). EXPERIMENTAL SECTION Apparatus. A Hewlett-Packard 5831A gas chromatograph (Avondale, PA) with flame ionization detectors was used. The instrument was fitted with a Hewlett-Packard 18835B capillary inlet system coupled to a Chemical Data Systems 120 Pyroprobe in. stainless steel ribbon pyrolyzer (Oxford, PA) by a 1in. X tube connected to the pyrolysis interface with a Cajon fitting (Cajon Co., Solon, OH) and silver soldered to the capillary inlet insert weldment. Gas chromatographicseparations were obtained on a 20 m X 0.25 mm Carbowax 20M WCOT glass column (Supelco, Inc., Bellefonte, PA). Nitrogen carrier gas was employed throughout the system. The carrier gas flow scheme was designed to allow independent control of the pyrolysis interface flow, the split ratio, and the column flow (23). The temperatures in the pyrolysis interface, the inlet splitter, and the column oven were also separately controlled. Chemicals. Figure 1 shows the names and structures of the 15 carbohydrates selected for this study. All were purchased in highest purity from P-L Biochemicals (Milwaukee, WI) except for trehalose and turanose which were obtained from Pfanstiehl Laboratories (Waukegm, IL). Boric acid and sodium hydroxide (Fisher Scientific, Pittsburgh, PA) were used to make up the borate solutions. Distilled deionized water was the solvent in all solutions. Procedure. The concentration of all carbohydrate solutions was 10 fig/fiL. Two sets of solutions were prepared, the sugar 0 1882 American Chemical Soclety

742

ANALYTICAL CHEMISTRY, VOL. 54, NO. 4, APRIL 1982

A ,.

D

the ribbon to 100 "C. Following insertion of the probe into the pyrolysis interface (120 "C), the system was purged with carrier gas for 30 s. The pyrolysis ribbon temperature was then ramped at 75 "C/ms to the final temperature of 600 "C with a time interval setting of 10 s. Pyrolysis fragments were swept into the capillary inlet splitter by a carrier flow of 50 mL/min through the pyrolysis interface. The capillary column flow rate was 0.7 mL/min at the initial column oven temperature (50 "C). These conditions E produced a sample split ratio of approximately 701. The oven D-CELLOBIOSE temperature program, initiated at the pyrolysis step, consisted of a 2-min isothermal delay at 50 "C, a 30 "C/min ramp to 110 HO H " W H h "C, andw a 15 "C/min ramp from 110 to 190 "C. With this temperature program, analysis times under 15 min were achieved for all samples without noticeable losses in resolution. Data Treatment. The data set for carbohydrates sampled F from water solutions consisted of 35 pyrograms of the 15 different sugars. Of the 30-40 peaks usually present in each pyrogram, /-LACTOSE 16 reproducible peaks with measurable heights were selected. This set of 16 peaks was sufficient to differentiate between several sugar types. The heights of these peaks were stored for off-line processing on the university's Amdahl47O/V6 interactive computer system. The data set for carbohydrates sampled from boric acid solutions consisted of 37 pyrograms of the same 15 sugars. These J pyrograms were preprocessed in the same manner as before t o TREHALOSE produce a reduced set of patterns with 16 peak heights. For the two data sets, a total of 1152 peak heights were recorded. The heights of 21 peaks were off scale and were estimated by calibration on the GC digital integrator results. A listing of the data is available from the authors on request. MALTOSE

D-MANNOSE

. G H , O H H H

B D-GALACTOSE

C D-GLUCOSE

G ISOMALTOSE

K

H ..

--LACTULOSE

GENTIOBIOSE H

I

L

9:

TURANOSE

.*-MELIBIOSE @H

H H

H

H&*n20H,oH

"H

H

H

M M ALTOTRIOSE

N MELEZITOSE

RESULTS AND DISCUSSION Measures of Chromatographic Similarity. Any approach to measuring chromatographic similarity must first consider how to best represent a chromatogram in the internal memory of a computer. Because storage of digitized chromatograms will rapidly fii available space as well as complicate later processing, only retention times and peak areas or heights are stored. With complex chromatograms a select number of peaks or features might be chosen (9) or the chromatogram could be divided into arbitrary segmenb within which the peak areas are summed (IO). Judgement should be exercised during this preprocessing of the chromatogram to ensure that information relevant to the classification is retained. In any case, the chromatographic pattern is transformed into a data vector with components (peak areas or heights) in a number of dimensions equal to the number of features (peaks or segments). Peak intensity normalization is employed to compensate for differences in the chromatogram due to variations in sample size. Peak area or height normalization is accomplished by dividing all peak areas or heights in a chromatogram by the sum of those intensities

H

0 RAFFINOSE

Flgure 1. Structure of 15 carbohydrates. Each structure will be referenced by the letter above it.

in pure water and the sugar in aqueous borate buffer solution with a boric acid to sugar molar ratio of one. The pH of these solutions was adjusted to 7.8 by titration. The pyrolysis of carbohydrate samples required several steps. A thin layer of sugar was placed on the pyrolyzer ribbon by applying a 2.5-wL aliquot of the appropriate solution evenly on one side of the ribbon. The water was then removed by heating

where Pk is equal to the peak intensity of the kth peak of n peaks. Other normalization procedures used with mass spectra are also applicable to chromatograms (24). Retention time normalization may be necessary for comparison of retention times between chromatograms (25). To avoid biasing similarity measures when the variation in each pattern dimension is not of the same magnitude, autoscaling is often used. Each pattern dimension is simultaneously scaled to have a mean of zero and a standard deviation of one Xik

= (Pik - Pk)/sk

(2)

where Pik is the kth peak intensity for the ith chromatogram,

4 is the mean peak intensity over d l chromatograms for the

kth peak, sk is the standard deviation over all chromatograms of the kth peak intensity, and Xik is the autoscaled value for

ANALYTICAL CHEMISTRY, VOL. 54, NO. 4, APRIL 1982

the kth peak of the ith chromatogram. The comparison of chromatographic patterns using quantitative measures of similarity has been of wide interest in PGC (18-22). Since each pattern vector can be considered a point in n-dimensional pattern space, the simplest measure of similarity between two pattern vectors X i and .Xi is the Euclidean distance, d,, between them

Table I. Numerical Calculation of Chromatographic P e a k Ratio Similarity Test Data Set peak intensities chromato- -gramno. 1 2 3 4 5 1 1 1 1 1 1

(3) When the two pattern points are close to one another in the n-dimensional space, this number will be near to zero and when the patterns are completely different i t will be indeterminately large. Eucliclean distance is dominated by those peaks for which there exists the greatest absolute difference. Autoscaling, which equally weights all features, will compensate for this. The most straightforward way to assess the similarity of two chromatograms is tho direct visual comparison of the two patterns. Visual comparison of more than two or three chromatograms becomefi difficult and the automatic calculation of a quantitative measure is necessary. The choice of a similarity measure is heuristic in the sense that there is no theoretical justification other than that patterns from similar objects should be close to one another in the pattern space. Stack et al. (I@, for example, used as a similarity measure the average of the results of dividing, in each peak dimension, the smaller peak intensity by the larger peak intensity. With this measure, 15 peaks were used to classify 138 oral bacteria cultures; duplicate cultures gave a similarity value above 0.9, with 1.0 representing complete similarity. Peak normalization drastically affects the behavior of this measure when the two patterns have a different number of peaks of nonzero intensity. The same criticism can be applied to the Euclidean distance measure. The chromatographic similarity measure developed for this study was designed to be insensitive to sample size and ko not require peak intensity normalization. In comparing two chromatograms with peak intensities in n dimensions, two row vectors for each chromatogram are initially formed. The first vector, the peak ratio vector R, consists of the (n2- n)/2peak intensity ratios for all pairs of peaks, i C j and i # j . If any peak ratio element is greater than one, that element is replaced by its reciprocal and the corresponding element of the second vector (the flag vector I.") is set to minus one; otherwise, no change is made to the first vector and the corresponding element of the second vector is set to plus one. There are three exceptions when one and/or the other of the peaks being ratioed have zero intensity: (1)If the first peak intensity is zero and the second is nonzero, the peak ratio element (R,) is set to zero and the flag vector element ( F , ) is set to plus one. ( 2 ) If the first peak intensity is nonzero and the second is zero, the peak ratio element (R,,) is set to zero and the flag vector element (lil,) is set to minus onti. (3) If both peak intensities are zero, the peak ratio element (R,,) is set to one and the flag vector element (F,) is set to zero. The peak ratio vector and the flag vector are of equal dimensionality and together describe the interrelationship of peak intensities within each chromatogram by itself. The calculation of chromatographic peak ratio similarity is performed by comparing peak ratio and flag vectors for two chromatograms in the following manner: (n2-n)/2

SCJ= (

k=l

I FLk(1 - R l k ) - Fjk(l - R,k)1)/(n2- n,

(4)

where S, is the chromatographic peak ratio similarity between chromatograms i and j , FLkand FJkare the kth flag vector elements for patterns i and j , and Rckand Rlk are the kth peak

743

2 3

0

1

1

1

1

1

1 1

6

0 0 0

7

1

1

1 1 1 0 0

1

4 5

0 0 0 0

1

1 0

0 0

0

1

Chromatographic Peak Ratio Similarity chromatogram

chromatogram number

no.

1

2

3

4

5

1

0

0.2 0

0.2

0.4

0.4

0.2

0

0.2 0

0.6 0.4 0.4 0.2 0

2 3 4 5

6 7

6 0.8 0.6 0.6

7 0.6 0.8 0.8

0.4 0.6 0

1.0 0.8 0.6 0

ratio elements for patterns i and j . There is a further modification to this calcul.ation when either flag vector element for a given comparison is zero: (1)If only one of the flag vector elements for comparirion k (FiKor Fjk) is zero and the other peak ratio vedor element (Rjk or Rik) is nonzero, the numerator of eq 4 is set to plus two (total dissimilarity). (2) If only one of the flag vector elements for comparison k (Fik or F j k ) is zero and the other peak r-atio vector element (Rjk or Rik)is zero, the numerator of eq 4 is set to plus one (partial dissimilarity). The absolute value of the numerator in eq 4 for each peak comparison has a range from zero to two. Finally, the denominator normalizes the similarity measure to range from zero for complete similarity to one for complete dissimilarity. Although the chromatographic peak ratio similarity appears cumbersome a t first, it correctly handles all possible combi.. nations of peak ratio comparisons between two chromatograms. A numerical eximple is provided in Table I where peak intensities are given for seven different chromatograms having five peaks. These test chromatograms do not all have the same number of nonzero peak intensities and, for convenience, peak intensities are allowed only two levels, zero and one. The algorithm is applicable, of course, to chromatograms having a continuum of peak intensities. Examination of the triangular similarity matrix in Table I reveals that the peak ratio measure rates the similarity between chromatograms 1 and 2, 1 and 3 , 2 and 4, and 4 and 5 at the same level; this is expected since these comparisons yield an identical difference in only one of five peak features. In the same manner, comparison of chromatograms 1and 4 , 2 and 3 , 2 and 5 , 3 and 5, and 4 and 6 involve differences iin two peak features and yield a higher dissimilarity. "he chromatographic peak ratio similarity measure correctly classifies the interrelationships among these test patterns; the Euclidean distance similarity also ranks these relationships correctly. Peak intensity normalization within each chromatogram, or autoscaling of peak features, causes the Euclidean distance similarity to yield a different classification. The chromatographic peak ratio similarity is insensitive to such scaling. A short BASIC computer subroutine for calculating the peak ratio similarity is available on request from the authors. PGC of Carbohydrates Sampled from Water Solutions. Figure 2 shows two high-resolution glass capillary pyrograms of the monosaccharide epimers, galactose and glucose. The overall appearance of ithe two pyrograms is very similar and,

744

ANALYTICAL CHEMISTRY, VOL. 54, NO. 4, APRIL 1982

1~ 1

GALACTOSE

[1 1 !

RAFFINOSE

MALTOSE

ISOMALTOSE

Figure 3. Pyrograms of raffinose (OI), ( G l ) sampled from water solutions.

maltose (Dl), and isomaltose

within experimental uncertainty, they are indistinguishable. Attempts to eliminate the broad unresolved envelope were unsuccessful, even with changes in column length and stationary phase. Figure 3 compares 3 carbohydrates sampled from water solution that are easily differentiated. Raffinose (see Figure l),a trisaccharide with a (1 6) glycosidic linkage between two monomer units, appears more similar to isomaltose (with a (1 6) linkage) than to maltose (with a (1 4) linkage). Although there are more differentiating peaks obtained on capillary columns, these results correspond well with those obtained previously on packed columns (5). The complete set of 35 chromatograms from this study were examined and 16 peaks were chosen for peak height measurement. The broad unresolved envelope was not included among these 16 features. The chromatographic peak ratio similarities (eq 4) for the 595 different pairs of chromatograms were calculated. Although these calculated values can be displayed in a form similar to that shown in Figure 1, they are far too numerous to present here. The peak ratio similarity value for the two pyrograms in Figure 2 is 0.06. The average similarity value of replicate samples is about 0.04. The peak ratio similarity values for the pyrograms in Figure 3 are 0.25, 0.31, and 0.14 for the comparison of 01 to D1, D1 to G1, and 0 1 to G1, respectively. The high dimensionality and large number of comparisons prevent adequate visualization of the interrelationships among this set of chromatograms and computer-assisted techniques

-

-+

E = ( 1 / C d*,)(C((d*, - d J 2 / d * J )

GLUCOSE

Flgure 2. Pyrograms of galactose (B2) and glucose (C2) sampled from water solutions. See Figure 1 for structures. The number after the letter code refers to the replicate analysis number.

1

for data display must now be considered. Nonlinear mapping (26) is an effective technique for the analysis of multidimensional data that maps a set of n-dimensional pattern vectors into a two-dimensional space in such a way that the inherent structure of the original data is approximately preserved. An advantage of nonlinear mapping is that it requires no prior assumptions concerning the structure of the data. The two-dimensional representation is achieved by adjusting the set of two-dimensional coordinates for the data set to minimize the error function

-

W

(5)

L