pattern recognition to the

May 20, 1984 - ApplicationofPyrolysis/Gas Chromatography/Pattern. Recognitionto the Detection of Cystic. Fibrosis Heterozygotes. Judith A. Pino and Jo...
0 downloads 0 Views 961KB Size
Anal. Chem. 1985, 57, 295-302 (12) Tanford, C. "Physical Chernlstry of Macromolecules"; Wlley: New York, 1961; Chapters 3 and 9. (13) de Clerk, K.; Buys, T. S.; Pretorius, V. Sep. Scl. 1971, 6 , 759. (14) Glueckauf, E. Trans. Faraday SOC. 1955, 57, 34. (15) Jones, W. L.; Kleselbach, R. Anal. Chem. 1958, 30, 1590. (16) Giddlngs, J. C. Anal. Chem. 1980, 32,1707. (17) Christophe, A. B. Chromatographia 1971, 4 , 445. (18) Rony, P. R. Sep. Sci. 1968, 3 , 239. (19) Rony, P. R. Sep. Scl. 1988, 3 , 357. (20) Rony, P. R. Sep. Sci. 1970, 5 , 121. (21) Rony, P. R. J . Chromatogr. Sci. 1971, 9 , 350. (22) de Clerk, K.; Cloete, C. E. Sep. Scl. 1971, 6 , 627. (23) Stewart, 0. H. Sep. Scl. Technol. 1978, 13, 201. (24) Corry, W. D.; Seaman, G. V. F.; Szafron, D. A. Sep. Sci. Techno/. 1982, 77, 1469.

295

(25) Nagels, L. G.; Creten, W. L.; Vanpeperstraete, P. M. Anal. Chem. 1983, 55, 216. (26) de Rycke, G. These de Docteur-IngBnieur, University of Paris 6, 1983. (27) Schrnltter, J. M. These de Doctorat d'Etat, University of Paris 6, 1983. (28) Stolyhwo, A,; Colln, H.; Martin, M.; Guiochon, G. J . Chromatogr. 1984, 288, 253. (29) Hunt, J. M. "Petroleum Geochemistry and Geology"; Freeman: San Francisco, CA, 1979; Chapter 3.

RECEIVED for review June 4, 1984. Accepted September 4, 1984. Part of this paper was presented at the 8th International Symposium on Column Liquid Chromatography, New York, May 20-25, 1984.

Application of Pyrolysis/Gas Chromatography/Pattern Recognition to the Detection of Cystic Fibrosis Heterozygotes Judith A. Pino and John E. McMurry* Department of Chemistry, Baker Laboratory, Cornell University, Ithaca, New York 14853

Peter C. Jurs* and Barry K. Lavine Department of Chemistry, The Pennsylvania State University, University Park, Pennsylvania 16802

Alice M. Harper* Biomaterials Profiling Center, University of Utah, Salt Lake City, Utah 84112

Hlgh-resolutlonpyrolysls/gas chromatography/pattern recognltlon methods have been used to develop a potential method for the detectlon of carrlers of the cystlc flbrosls gene. The test data consisted of 144 pyrochromatograms (Py/GCs) of cultured human skin flbroblasts from obligate cystlc flbrosls heterozygotesand from normal controls. A two-stage pyrolysis procedure using a modlfled Chromatographic Inlet ylelded well-resolved reproducible profiles. Mlcrocomputer-controlled Instrumentation enabled transmlsslon of pyrochromatograms to a host facility, where data-condltlonlng software provided for peak alignment and data set optlmlratlon. Each Py/GC contained 214 peaks correspondlng to a set of standardlred retention-tlme wlndows. Discrimlnants were developed by nonparametrlc pattern-recognition procedures that could classlfy these Py/GCs Into the proper group based on chemical differences. A dlscrlmlnant based on six of the Py/GC peaks correctly classifled 136 of the 144 Py/GCs (94%), and a dlscrlmlnant based on nine prlnclpal components formed from the Py/GC peaks correctly classlfied 134 of the 144 (93%).

Cystic fibrosis (CF) is the most common life-threatening genetic disorder in Caucasians. With an occurrence rate of one in every 16o(r2000 live births, CF appears to be inherited as an autosomal recessive trait and to have a gene frequency of 0.05. In spite of intensive effort, the underlying genetic defect(s) has not been identified (1). One of the most critical problems in CF research is the development of a method for detecting carriers (heterozygotes) of the CF gene. There is currently no method available for identifying carriers and there is no reliable method of prenatal

diagnosis. We wish to report our work evaluating the use of pyrolysis/gas chromatography/pattern recognition (Py/ GC/PR) as an analytical technique for detecting carriers of the CF gene. Pyrolysis/gas chromatography (Py/GC) is an analytical technique that consists of rapid thermal fragmentation of a sample in the absence of oxygen, followed by separation of the volatile fragments on a gas chromatograph (2-4). The chromatographic record of pyrolysates forms a reproducible fingerprint of the parent material, while the individual peaks and their relative intensities provide both qualitative and quantitative information about the original sample. Pyrolytic analysis was first applied to complex materials in 1952 by Zemany who showed that reproducible decomposition patterns could be obtained from biopolymers such as albumin and pepsin (5). In 1960, the application of Py/GC to amino acids was reported (6),and in 1965 the use of Py/GC for characterization of bacteria was first published (7). In addition, applications have been reported in the past decade for tissue pathology (8,9), forensic science (lo),microorganism taxonomy (11,12),and carbohydrate chemistry (13). It now seems well established that Py/GC is suitable for the analysis of complex biomaterials that are nonvolatile or for which derivatization is not feasible. Two major problems have plagued investigators in the Py/GC field. The first has been reproducibility: Minor variations in sample preparation may affect fragmentation pathways, and minor variations in analytical conditions may affect retention times, thus hindering the comparative identification of peaks between pyrochromatograms. To a considerable extent, this reproducibility problem has been minimized by improved instrument design. Better control of pyrolysis conditions has been achieved by the design of lowmass pyrolyzers with minimal dead volume and with tem-

0003-2700/85/0357-0295$01.50/0 0 1984 American Chemical Society

296

ANALYTICAL CHEMISTRY, VOL. 57, NO. 1, JANUARY 1985

perature rise times of better than 1 OC/ms. Even more important has been the advent of microprocessor-controlled gas chromatographs and fused-silica capillary columns, which allow precise control over the analysis and which have high resolving power to separate the maximum number of fragments. With these advances, reproducible pyrochromatograms containing 150-200 well-resolved peaks can now be obtained. The second problem in PyjGC is that much data interpretation to date has been subjective. Although many investigators have demonstrated that pyrochromatograms of different materials show features allowing qualitative visual discrimination to be made, this simple visual analysis may fail when complex and closely related biological materials are analyzed. Often, the discriminatory information sought may consist of subtle variations in relative intensities distributed across several peaks in the pyrochromatogram. The solution to this second problem lies in the application to the pyrochromatographic data of computer-based peakmatching and pattern-recognition techniques. Peak matching has generally been approached by a combination of retention-time scaling and time windowing (14) or by signal-processing techniques. For the most part, published methods have concentrated on automation at the expense of accuracy; although hand matching of peaks between pyrochromatograms is clearly inefficient, the option of operator input is nevertheless a desirable feature when dealing with large and complex data sets such as the one we have generated. Pattern recognition (15-1 7), a loose and synergistic collection of parametric and nonparametric statistical techniques, seeks to elucidate relationships in multidimensional data sets. The major underlying principles are that the significant data may be represented in some reduced dimensionality and that the distribution of the data points in this space will reveal a continuum or discontinuity that may be quantified on the basis of some property of interest. In binary classification studies such as ours, the expectation is that a few orthogonal features derived from the (correlated) pyrochromatogram peaks will describe a space where data from each class will cluster in a separate region. Application of pattern-recognition techniques to pyrochromatographic data sets has been reported by several workers: For example, classification of Pseudomonas (18)and Enterobacteriaceae (19) at the species level has been achieved, three strains of Penicillium have been correctly classified (20), and Clostridium botulinum has been distinguished by physiological group (21). If Py/GC/PR is to be applied to the problem of CF heterozygote detection, it is first necessary to define the specific biological sample to be analyzed. Cystic fibrosis is characterized by a dysfunction of exocrine (secretory) glands, resulting in chronic pulmonary disease, pancreatic insufficiency, and elevated sweat electrolyte levels. Although the primary clinical manifestations in CF homozygotes appear centered in the pancreas and in mucus-producing cells, numerous studies have indicated that biochemical abnormalities are also present in skin fibroblast cells. For example, studies on cultured homozygous CF skin fibroblasts suggest, among other things, that these cells exhibit reduced efficiency in ascorbate-induced collagen synthesis (22), that they exhibit a diminished colchicihe-binding activity and tyrosyl tubulin-ligase activity (23), that they exhibit an increased mitochondrial oxygen uptake (24), that they have an abnormal monosaccharide composition of peripheral cell-surface glycopeptides (25), that they exhibit a reduced membrane magnesiumcalcium ATPase activity (26), that they secrete a factods) capable of inhibiting ciliary motility (27), and that they show a decreased thermostability of a-mannosidase (28). Skin fibroblasts are also an appealing model system for their ease

of culture and their immunity to the transient metabolic status of the donor. Preliminary studies (29-31) have indicated that CF homozygous fibroblasts may be differentiated from normal controls by using Py/GC/PR. We surmised that the genetic abnormality might also be expressed in heterozygous cells with sufficient specificity to distinguish them from normal controls, and we therefore undertook a study to compare the pyrochromatograms of CF heterozygotes with those of normal controls. In this study, cultured skin fibroblasts from 24 cystic fibrosis obligate heterozygotes and from 24 presumed normal controls were analyzed in triplicate by Py/GC/PR. A new two-stage pyrolysis procedure was developed for the analysis, an interactive data preprocessor was developed for peak matching the pyrochromatograms, and pattern-recognition methods were applied to the data to obtain discriminant models that yielded an encouraging prediction rate in distinguishing CF heterozygotes from normal controls.

EXPERIMENTAL SECTION Tissue Culture. Full experimental details are available from ref 32. Fibroblasts were grown from skin biopsies using explant techniques or were obtained from the Human Genetic Mutant Cell Repository (Camden, NJ). Skin biopsies were performed at the Upstate Medical Center, Syracuse, NY, using a drill biopsy procedure (33) and following protocols approved by the Human Subjects Committee. Forty eight cell lines were established and analyzed: 24 samples were donated by parents of children with cystic fibrosis (4 male, 20 female obligate heterozygotes), and 24 by controls (16 male, 8 female presumed normals). Fibroblasts were cultured in modified Eagle's Minimum Essential Medium, supplemented with 15% fetal bovine serum (Sterile Systems, Logan, UT) and gentamicin (Schering Corp., Kenilworth, NJ). Cell lines were established in 25-cm2flasks, serially passaged three times to 300 cm2, harvested at confluency, centrifuged, rinsed briefly with distilled water, lyophilized, and stored under argon at -90 O C . A low-passage subsample of each line was cryopreserved against future need, and sufficient material was harvested for at least four Py/GC analyses. Standard precautionary measures were taken to monitor cell viability and the absence of contamination. Instrumentation. A Hewlett-Packard 5880A gas chromatograph equipped with a flame ionization detector was used; the inlet was modified to accept a Chemical Data Systems 120 Pyroprobe platinum-coil pyrolyzer. The inlet gas flow lines were modified so that the instrument always functioned in split mode, and a microprocessor-controlled solenoid valve was installed to route carrier flow either directly to the column or through the pyroprobe interface. Separation of the volatile pyrolysates was achieved on Carbowax 20M fused-silicacapillary columns (25 m long by 0.2 mm i.d.; Hewlett-Packard). These were replaced when performance deteriorated after approximately 100 runs. An Apple I1 microcomputer was interfaced to the GC data communications board so that operating parameters and timetables for analysis, bakeout, column conditioning, etc. could be downloaded from diskette. The results of all analytical runs were uploaded from the GC to diskette, and selected files were transferred by modem to a DEC-2060 for statistical processing. The communications programs, GCLINK and DECLINK, were written in Pascal and ran under the Apple Pascal operating system. Procedure. For Py/GC analysis of fibroblasts, a 230 f 1 pg (Mettler M5 Electrobalance) sample was centered in a quartz tube (14 mm X 2.4 mm 0.d. X 2.0 mm i.d., Wilmad Glass, Buena, NJ), and placed in the heating coil of the pyroprobe. The probe was inserted into the pyrolysis interface at 200 "C, and the system was allowed to equilibrate. Pyrolysis at 400 "C was initiated with a temperature rise time of 20 "C/ms and a duration of 10 s, while carrier-gas flow was directed through the interface to sweep pyrolysates onto the column. After 60 s, the interface flow was discontinued, effectively isolating the chamber and preventing entrainment of spurious volatiles. The column was temperature programmed from 45 "C to 200 "C, and data were collected for 56 min. The results of this run (the HP Numeric and Cardinal Point files) were transferred to the Apple while the system cooled.

ANALYTICAL CHEMISTRY, VOL. 57, NO. 1, JANUARY 1985

297

0

Figure 1. A

representative pyrochromatogram. The peak identities indicated are those assigned by the 4P peak-matching software.

The 700 OC pyrolysis was then carried out on the same sample in a similar manner; data were collected and transferred to the Apple. The sample and tube were discarded, and a blank run performed before the subsequent analysis. All 48 samples were analyzed in triplicate. Numeric data fides for the high-temperature pyrolyses were uploaded from the Apple to a DEC-2060 for processing.

NORMAL CONTROLS 70

-.

60

..

yi

50.. !-

RESULTS AND DISCUSSION

40..

Data Preparation. The multivariate procedures of pattern recognition require a fully populated data matrix such that, for all pyrogram row vectors, each column corresponds to values for the same known peak or feature across the entire data set. We developed a pyrogram preprocessing package (4P) for assigning a formal identity to each peak using a cumulative reference file (34). Each pyrogram is divided into a set of “intervals”bounded by “majors”-large peaks observed in all pyrograms. Screen-driven software linearly scales intervals under assignment and juxtaposes them with the corresponding reference interval. Although most assignments are made automatically, all queries are resolved by visual inspection of the original chromatogram. It is relatively easy to coalesce peaks after peak matching has been completed, and several rarely occurring peaks in the final data set were combined with their neighbors. We believe that the option of operator control is particularly important in applications with diagnostic potential. The 4P software supports entry of new peaks into the reference file as necessary, as well as the option of using any matched sample as a reference, which considerably simplifies the matching of replicates. The package was written in FORTRAN on a DEC-2060 and further information is available on request. Three replicate pyrolyses of 24 samples in each class yielded 144 pyrograms for analysis. The two-stage pyrolysis technique resulted in well-resolved, reproducible chromatograms from the high-temperature run, as indicated by the typical pyrochromatogram profile in Figure 1. The resolution, reproducibility, and consistently flat base line are superior to comparable analyses using only a single-stagepyrolysis. Poorly defined early peaks and sporadic trailing peaks outside the reference range (5 min < retention time < 35 min) were excluded from further consideration. The peaks in the pyrochromatograms are numbered, with the formal identities assigned by the peak-matching software. We selected a reference scheme comprising 12 intervals bounded by 13 majors. These are identified 100,200,...,1300 and were present in all pyragrams. The 4P software allowed reliable operator-controlled generation of a coherent data set

301

0

6

12

18

24

30

36

42

54

48

60

66

72

Frequency of Assignment

HETEROZYGOUS SAMPLES

50

8

40

Frequency of Assignment

Peak-matching assignments for heterozygous and control samples. Frequency of assignment is the number of samples in which a given formal identity was assigned to any pyrochrornatogram peak. Figure 2.

for multivariate analysis. Our final reference chromatogram contained 214 formal identities, though not all peaks were present in all pyrograms. In the heterozygote group the number of peak assignments was 115 f 5; in the control group, it was 131 f 16. Figure 2 shows the distribution of peak assignments. Both normal and heterozygous groups are bimodal, with a significant number of rarely seen peaks. The broader low-end distribution of the normal group may reflect phenotypic microheterogeneity of the control population; a similar effect is noted in cluster analyses of the normal data.

298

ANALYTICAL CHEMISTRY, VOL. 57, NO. 1, JANUARY 1985

Pattern-Recognition Analysis. The data were analyzed with the pattern-recognition package ADAPT (35). The primary objective was to develop a binary classification of the data set on the basis of cystic fibrosis heterozygosity. This is the most basic pattern-recognition analysis, and may be summarized as follows. Each peak-matched pyrogram is initially represented by a data vector Xi = (xl, x2, ...,x i , ...,xn), where x j are the peak areas for n peaks. This data set may be normalized and autoscaled to remove any bias arising from differences in magnitude, so that each pyrogram and all features have equal weight in the analysis. The data may now be considered as a set of points in n-dimensional space. If there exists a hyperplane that separates the two classes in this space, then W.Xi> 0 for Xidescribing one class W-Xi< 0 for Xi describing the other class where W = (wl, w2, ...,wj,..., wn),the weight vector, is the surface normal vector of the dividing hyperplane. Pattern recognition is a set of methods for investigating data represented in this manner to assess the degree of clustering and general structure of the data space. The three main types of pattern-recognition methodology are mapping and display, discriminant development, and clustering (15-1 7). The ADAPT computer software system has routines in all three areas, and most were used in this study. The classification will be significant only if the ratio of samples to analytical dimensionality is three or greater (36). Similar criteria affect the significance of bivariate regressions, where the solution is to accumulate more data. The approach in pattern recognition, given an fixed number of samples, is to reduce the dimensionality. Primary feature selection may be accomplished either a priori, based on some measure of the contribution of each feature to the intraclass variance, or a posteriori, by discarding the least significant elements in hyperplane classifiers developed from an unbiased initial selection of features. Peaks arising from a common macromolecular moiety of the sample will be correlated, and factor analysis may be used to transform the data to a reduced dimensionality while preserving most of the variance information. Classifiers may be developed by using the majority of the data (the training set) and validated against the remainder (the test set). Routines to implement these analyses form a subset of ADAPT. The pattern-recognition analyses were directed toward three specific goals: (1) finding discriminants that could separate the 72 heterozygote Py/GCs from the 72 normal Py/GCs on the basis of chemical differences between the two groups; (2) studying the structure of the Py/GC data to seek obscure relationships with mapping-and-display and clustering methods; (3) developing the ability to predict class membership of unknowns. This set of data-144 Py/GCs of 214 peaks each-was transferred on magnetic tape from the Cornel1 DEC-2060 to the Penn State Prime 750 where it was entered into the disk storage of ADAPT. The data were standardized and autoscaled so that each variable (peak) had a mean of zero and a standard deviation of unity within the entire set of 144 Py/GCs. T o apply pattern-recognition methods to this overdetermined data set, the necessary first step was feature selection. The number of peaks per Py/GC must be reduced to less than about one-third the number of members of each class to avoid chance separations (36,37). For the final results of the analysis to be meaningful, this feature selection must be done objectively without using the class membership labels of the Py/GCs. For fingerprinting experiments of the type that we are considering it is inevitable that there will be relationships

4.00 V

2.13

0

0 0

o

6 E

m 0

8.25

X

- I .63

-3.50

-5.m

I -2.50 x

I

I

8.00

2.50

5.00

la3

Flgure 3. A principal-components representation of the pattern space defined by the Py/GC peaks of interval three. Pattern groupings indicative of batch and column effects are evident. The squares represent the CF heterozygotes, and the inverted triangles are the normal controls.

between sets of conditions used in generating the data and the patterns that result. One must realize this in advance when approaching the task of analyzing such data. Therefore, the problem is to utilize the information specific to the genetic alterations characteristic of CF heterozygotes without being swamped by the large amount of qualitative and quantitative data due to experimental conditions that is also contained in the complex Py/GCs. In our studies, we have observed that experimental variables such as cell-culture batch number, passage number, donor gender, and column identity can all contribute to the overall classification process. Other workers have recognized and commented on some of these difficulties in similar situations (38-40). We investigated the confounding of the chemical information by experimental effects through the following experimental sequence. One way of looking a t the confounding used eigenvector projections of the data. We plotted the 144 Py/GCs in a two-dimensional map using the first two principal components derived from the n-dimensional data. Two examples are shown in Figures 3 and 4, where some subgroupings related to batch and column identity are apparent. A second way of studying the confounding involved reordering of the data set in terms of experimental variables rather than the CF heterozygote vs. normal classes. In such experiments, the data set was reordered so that one class contained only those Py/GCs analyzed on a particular capillary column and the other class contained all other samples. A discriminant was then developed using Py/GC peaks from the reordered data, and the degree of classification success was noted. Similar reordering experiments were carried out for batch number, passage number, and gender. We learned from these studies that several sets of peaks useful for discriminant development in the CF heterozygote vs. normal classification problem also supported discriminants that could differentiate among Py/GCs grouped according to experimental variables. However, descriptor sets for discriminant development that only yield favorable classification results for the CF heterozygote vs. normal classification problem have also been found. Regression-analysis experiments were carried out to determine the influence of experimental variables on the overall

ANALYTICAL CHEMISTRY, VOL. 57, NO. 1, JANUARY 1985

6.00

3.63

I

i

D

0

V

0

0

D

an

On O

v V

299

D

P

v 0

V

VV

V

V

t -3.50

-3.50

v v v v

I - I .75

e

V

n

m

V

Q

I

V

8.80

I I .75

-6.00

3.50

-3 90

-I

80

0 30

2.40

FIRST P R I N C I P A L COMPONENl

to3 Flgure 4. Principal-components representation of the pattern space defined by principal components generated from the first 60 Py/GC peaks. Column effects are evident. The squares represent the CF heterozygotes, and the inverted triangles are the normal controls.

classification problem. For example, a set of Py/GC peaks used for discriminant development was regressed against an indicator variable constructed for gender. An indicator variable is a variable of zeroes and ones where the zeros correspond to one class and ones to the other. The residuals from this regression analysis were stored and ostensibly only gender information was removed. Discriminants were then developed using these residuals to represent the Py/GCs. The classification success of a discriminant developed using the residuals was compared to the classification success of a discriminant developed using the corresponding Py/GC peaks themselves. In some cases, the discriminants based on the residuals exhibited a &lo% reduction in classification success rate compared to decision functions developed using the corresponding Py/GC peaks as descriptors. Such a marked difference in classification power suggests that gender has a strong effect. Regression studies were carried out for passage number as well, and similar results were obtained. Notwithstanding the effects of the experimental variables described above, several discriminants have been developed that differentiate between the 72 Py/GCs from the CF heterozygotes and the 72 Py/GCs from the presumed normal subjects essentially on the basis of chemical difference. Two such discriminants are now described in detail. Case A. In the first case to be discussed, we used all the Py/GC peaks that were present in at least 90% of the pyrochromatograms as a starting point for the analysis. We assessed the ability of each of these 65 Py/GC peaks alone to discriminate between pyrochromatograms from CF heterozygotes and normals. We then assessed the ability of each of these 65 Py/GC peaks alone to discriminate between pyrochromatograms with respect to gender, passage number, and column identity. The 12 Py/GC peaks that had larger classification-success rates for the CF vs. normal classification than for any other dichotomy were selected for further analysis. This procedure identifies those peaks that contain the most information about the CF vs. normal problem as opposed to the experimental variables. The 12 peaks were not chosen for CF vs. normal classification success alone, for this could have resulted in chance separation. These 12 peaks come from all regions of the pyrochromatogram. Variance feature selection (41) combined with the linear learning machine and the adaptive least-squares method (42)was used

CASE A

Flgure 5. A plot of the first two prlncipal components of the six Py/W peaks for case A. The squares represent the CF heterozygotes, and the inverted triangles represent the normal controls.

to remove the 6 of the 12 peaks not relevant to the classification problem. A discriminant that only misclassified eight of the pyrochromatograms (136 correct of 144, 94%) was developed by use of the final set of six Py/GC peaks (see Table

I) The contribution to the overall dichotomization power by experimental parameters of the decision function that is based on just six Py/GC peaks was assessed by reordering experiments. In one study, the set of pyrochromatograms was first reordered in terms of donor gender, and poor classification results were obtained. Next, the Py/GCs were arbitrarily assigned to one of two classes, and the discriminatory power of the decision function for this random data set was then determined. There was little difference in the classification success of the decision function for these two tests. That is, reorderings of the pyrochromatograms by donor gender or randomly are essentially equivalent in terms of separability of the data into classes. Similar studies were done for passage number and column identity, and comparable results were obtained. The results of the reordering tests suggest that the decision function based on the six Py/GC peaks mainly incorporates chemical information in separating the pyrochromatograms of the CF heterozygotes from those of the normals. An eigenvector projection corresponding to the first two eigenvalues of the 144 Py/GCs each represented by these six peaks is shown in Figure 5. The first two principal components account for 60% of the total variance. Although some pattern grouping indicative of batch and column effects is observed, we do not believe that such effects are important in the classification process. The two main classes are substantially, but not completely, separated in this two-dimensional approximation to the six space of case A. The ability of the decision function to predict the class of a simulated unknown sample was tested by using a procedure known as validation. Twelve sets of Py/GCs were developed by random selection, where the training set contained 44 triplicates and the prediction set contained the remaining 4 triplicates. Any particular triplicate was only present in one prediction set of the 12 generated. Discriminants were developed by using the training sets and were tested on the prediction sets. The average correct classification for the prediction set members was 87%. This same experiment was

300

ANALYTICAL CHEMISTRY, VOL. 57, NO. 1, JANUARY 1985

repeated except that members of the prediction set included triplicated samples analyzed on the same column or grown in the same batch of growth media. The average correct classification for the prediction set in this set of runs was 82%. Although the predictive ability of the decision function was diminished when we took into account these confounding effects, favorable results were still obtained. Case B. The second case to be discussed used a set of principal components to represent the entire Py/GC. Each Py/GC was divided from left to right into seven intervals of 28 peaks each, and a final interval of 18 peaks. A principal-components analysis was carried out for the peaks of each interval, and the two principal components with the largest eigenvalues were retained for each interval. The fractional variance explained by the two principal components ranged from 0.30 to 0.61 with a mean of 0.37. Principal components that were capable of yielding favorable classification results by themselves were discarded since they convey not only chemical information but probably unwanted experimental noise as well. This fact was later confirmed by reordering experiments. A discriminant was developed by using the remaining nine principal component descriptors, and it misclassified only ten of the Py/GCs (134 of 144 correct, 93%). Data-reordering experiments for passage number and gender were performed, and the classification power of the discriminant for these parameters was noted. The degree of classification success for these experimental variables was also calculated for a hypothetical discriminant that encoded only chemical information and could successfully categorize every Py/GC into its respective CF heterozygote vs. normal class. We constructed an indicator variable of zeroes and ones that yielded perfect classification when used as a decision function (that is, just contains the class membership information for the CF heterozygote vs. normal classes). The data set can then be reordered with respect to any experimental variable such as passage number, and the dichotomization power of the indicator variable can be assessed for the reordered data set. This experiment determines the classification power of a discriminant that only encodes chemical information for any one of the experimental parameters. A comparison of the classification results for these two decision functions (passage number and donor gender) indicated that the principal-component-based discriminant encoded little if any information about these two experimental variables. However, reordering experiments for batch number and principal component representations of the pattern space did reveal a batch effect. One group of normal samples grown in the same batch of growth medium are clustered and distinctly separated from the other samples of the normal and CF class. In is our opinion, however, that the remaining carriers and normals are being separated largely on the basis of a desired chemical difference between the two classes. A summary of the classification results for the decision functions described in cases A and B is presented in Table I. Throughout these discriminant-development studies, we have noticed that several Py/GCs have been misclassified more often than most. Several were from the CF-carrier class and several were from the normal class. They are outliers in the sense that they are different from the majority of the members of their class. For example, a CF obligate heterozygote that was often misclassified was an olive-complected woman, whereas the other heterozygote donors were fairer complected. It may also be that one of the presumed-normal controls could have been a CF heterozygote. One normal sample was often misclassified, and we cannot account for this fact using the clinical information available to us. The probability of finding a CF heterozygote in a random control population of 24 is greater than 70%.

ANALYTICAL CHEMISTRY, VOL. 57,

a

40.00

a

NO. 1, JANUARY 1985 301

F-

170.0

30.00

v)

5

P W

k

2 0 . 00

4 42.5

10.00

0

B 0

0 0

0.00 0.0

50.0

100.0

150.0

O.B

2 0 0 .0

IRf3.0

b

648, D

369.0

720.0

MEAN

RSD

RAU DATA

40.80

3 0 . 00 127.5

In

5

W

LL

t

20.00

0

P

n

0:

10.00

0

-

0.00

0.8

00

50.0

100.0

150.0

h

200.0

RSD

c

40.00

0.0

r

0

i8a.a

268.0

I

NORMALIZATION

c

170.U

127.5

v)

w 4 LL

540

a

720.0

MEAN

'

30.80

0

0

Q

20.08 -

FAClOR ONE

I

2

-

0 0

,

P W

-

10.00

0.00

I 0.0

I

I

50.0

100.0

J 150.0

2 0 0 .0

RSD

Flgure 6. Plots of the distributions of the relative standard deviations for (a) unnormalized data and (b and c) normalized data. The normalization factor used in (b) was the summation of the peak areas appearing In the 214 standard retention-time windows. The factor used in (c) was the summation of the total peak areas as calculated by the GC peak integrator.

We have observed during these studies that normalization of the data has little effect on the results. This has also been reported by other workers (40).The grand mean variation over all 214 peaks and the distribution of the mean variation for each of the 48 samples for the unnormalized data were nearly identical to the normalized chromatograms. In Figure

8.R

ISC).cI

s6o.a

54B.R

728.0

MEAN Il