Exploratory data analysis using inductive partitioning and regression

exploratory data analysis that can be used to scan the process data to suggest relationship among ... variable, Y (for example, quality of product), a...
2 downloads 0 Views 1MB Size
Ind. Eng. Chem. Res. 1992,31, 1989-1998

1989

Exploratory Data Analysis Using Inductive Partitioning and Regression Trees Don Shyan-Shu Shieh and Babu Joseph* Chemical Engineering Department, Washington University, St. Louis, Missouri 63130-4899

Improved control of product quality can only be achieved through increased understanding of the causes for quality variations. Usually this is achieved through a detailed analysis of the process variables and their interrelationships. This step is very time consuming because large volumes of data are generated during the operation of a process. In this article we present a method for exploratory data analysis that can be used to scan the process data to suggest relationship among variables. The method is based upon a combination of inductive techniques (developed by computer scientists for machine learning) and classical regression methods. The method is illustrated using two example problems. The results show the power and efficiency of the proposed algorithm. The method should prove useful to engineers and scientists who are interested in analyzing past operational history of a process to understand causes of quality deviation. 1. Introduction Recently there has been much emphasis in the manufacturing sector on implementing statistical quality control techniques. The basic concept emphasized here is proper monitoring of the process through statistically rigorous methods to detect off-quality product. While detection of off-spec product is an important first step, real productivity gains are obtained when the cause of this deviation is identified and corrected. Sometimes this may require minor adjustment of the operating conditions. Other situations may require more drastic action such as improving the quality of the feed material. The identification of processing conditions that lead to quality deviations is a challenging task for the process engineer. The objective of this paper is to present some methodologies available from statistics for analyzing historical log of process data to discover possible causes of product quality deviations. In addition we propose an algorithm for exploratory data analysis based on inductive methods from artificial intelligence and classical regression techniques. Application of the algorithm to a simulated autoclave curing process illustrates ita potential as a tool for exploratory data analysis. The algorithm can serve as another diagnostic tool for the process engineer who is interested in examining the data for the purpose of identifying the underlying causes for quality variation. The quantitative nature of the output resulting from the algorithm is useful for determining the remedial actions needed and for determining what further experimentation is needed to build a processing model. The scope of this paper is limited to data that are not time dependent. We primarily focus on data from batch processes where quality varies from batch to batch depending on raw material quality and proceasing conditions. The methodology is also applicable to analysis of steadystate data from continuous processing plants. The wideapread installation of procees control computers facilitates the acquisition and storage of historical operational data in machine-readable form. This combined with the access to very powerful computers should make it more attractive to examine operational history to identify and correct quality control problems associated with a process. 2. Background 2.1. Nature of the Problem. The problem can be started mathematically as follows. Given a dependent variable, Y (for example, quality of product), and a set of

* To whom correspondence should be addressed.

independent variables, XI, X2,X3,..., X,,representing variables that are measured and logged on the process, derive a relationship Y ;&,X2v**,Xn)

1

Typically an operational log consists of a table of Y and X. Some important characteristics off, Y,and X in this problem are the following: (i) The vector X may be incomplete (some important variables which affect Y may not be measured). (ii) Y and X are both subject to noise. (iii) f is typically nonlinear. (iv) The X i may be correlated with each other. (v) The variations in X i may not be totally random and data are usually bunched up around an operating point. (vi) Some of variables in X and Y may be categorical (i.e., qualitative in nature such as high, low etc.) Classical approach would be to seek a regression relationship between Y and x. This generally does not work with routine data for a number of reasons, including (i) nonlinear nature of the relationship, (ii) the data are not totally random so that the validity of the regression is in question, and (iii) the X i can be correlated with each other, in which case the regression may not work. Furthermore, regression assumes Y and X are both numerical variables whereas in practice some of the X i and Y may be categorical. Some of the problems can be overcome by data pretreatment. Correlated variables can be removed using prior knowledge or using principal component analysis (Draper and Smith, 1981) and partial least squares (Geladi and Kowalski, 1986) techniques. However, the problems of nonlinearity and interaction among variables still remain. Lack of randomness, however, does not restrict the use of exploratory tools which explore the potential structure of the operational data (Bernstein et al., 1988). Other approaches to obtain hidden relationships among Y and X in the data include classification and induction methods. There are covered in detail in the books by James (1985) and Webs are Kulikowski (1991) and briefly discussed next. 2.2. Statistical Classification Methods. There are three major statistical classification methods, namely, the Bayes rule, the linear discriminant function, and the knearest-neighbor (KNN) method. The classifier based on the Bayes rule is used for categorical data. Categorical data are composed of a set of finite values which are orderless. Sometimes numerical data and categorical data are distinguished as quantitative

0888-588519212631-1989$03.00/00 1992 American Chemical Society

1990 Ind. Eng. Chem. Res., Vol. 31, No. 8,1992

and qualitative. Application of the Bayes rule requires knowledge of prior probabilities and class conditional probability density functions (pdf 8). The calculation of these probabilities, especially pdf s, needs a large volume of data. This makes the Bayes rule very impractical even though it usually has the beat performance of classification. The linear discriminant function is usually applied to problems which have a categorical dependent variable and numerical independent variables. The linear discriminant function method derives a set of linear equations to be used as a classifier. There are several methods to obtain discriminant functions. Three popular methods for the calculation of linear discriminant equations from a training set are the Perceptron learning algorithm, the Fisher method, and the regression method (James, 1985;Young and Calvert, 1974). The major drawback of linear discriminant methods is that the hyperspace of a training set is not always linearly separable. In such cases, a high misclassification rate can be expected. For the k-nearest-neighbor method, a training set of n objects (or data records) are stored. When a new object is observed, the distance between this new object and each object in the training set is calculated. Based on the distances, k objects which are nearest to the new object are found. Accordingly, the new object is classified as the class of majority of these k nearest neighbors. The classifier in this method is simply a storage of a training set in memory. This method is reported to have a very low misclassification rate. However, it also suffers the highest expense in terms of computation and memory. These classifiers have been widely used and their performances have been justified in the area of pattern recognition where a major concern is to search “how”to assign an object correctly to one of the classes. However, in some areas, an explanation of “why” is no less important than a method of “how”. For example, a good classifier in the diagnosis of a disease is unlikely to be accepted unless it can provide a meaningful explanation to the physician other than a raw indication of discriminant scores. The search for comprehensible classifiers led to the development of induction methods. 2.3. Induction Methods. The first inductive learning algorithm for generating a decision tree (also called a classification tree) using a training set is the Conceptual Learning System presented by Hunt and co-workers and fully documented in their book (Hunt et al., 1966). However, this algorithm is considered computationally expensive, and Quinlan (1983,1987)presented an efficient algorithm (called ID3) using concepts derived from information theory. The algorithm selects an attribute which has the largest informationgain to partition the training set (called the root node) into subsets (called branches). This process is repeated until a tree is generated. The partioning of a node terminates when the class of objects in the node is unique. Induction methods attracted a great deal of attention from the artificial intelligence (AI) community. The terminology used in AI literature differs from that used by statisticians. To avoid ambiguity, these terms are defined as follows. An attribute is an independent variable; a class uariable (sometimes just class for short) is a dependent variable; a record or an object is called an instance; and a decision tree is a classifier. ID3 uses Shannon and Weaver’s theory of information (Shannon and Weaver, 1964). The amount of information in a message is I(PJ = -Pi log, Pi (2.1) where Pi is the probability of delivered message. Consid-

Table I. An Examde of a Training Set S ~

~~

value of attributes

item of instance 1 2 3

A

B

C

D

class

0

0 0 1

1 0 1

0 0

P P P

1

0 1

1 0 1 1 1

4 5 6

1 1

1

1 1 1 0

n n n

ering the training set S (Wong and Wong, 1987)shown in Table I, S includes six instances, each of which consists of four attributes (A,B, C, and D)and a class. All of the attributes and the clasa (which is to be classified) have two possible values. A complete decision tree is regarded as a source of messages of knowing when the class is “p” and when the class is “n”. Given the training set S, the information content of the complete decision tree can be calculated as Inf(S) = - C Pilog, Pi (2.2) i=p,n

where Piis the probability of class i in the training set S. For the present training set in Table I 3 3 3 3 Inf(S) = -- log, - - - log, - = 1 6 6 6 6 The ID3 algorithm selects an attribute which delivers the largest amount of information to partition the training set into subsets. Suppose A is chosen in the root node to branch the data set into two subjects; one subset is grouped by the instances (objects) having A = 0 while the other subset is grouped by the instances having A = 1. The information content of the two subtrees (called branches) based on these two subsets is calculated by Inf(S.4) =

Ai=O,l

PAi h f ( S A , )

(2.3)

where PAiis the weight of the subtree where all instances have attribute A = Ai and Inf(SA,) is the information content of the subtree under A = Aicalculated as above. Therefore, for the case that A is a selected attribute Inf(SA) = PA-O Inf(SA=o)+ P A = 1 Inf(SA=l)

-(--2 log, ;) + ;(-;log,

=2

6

2 = 0.540852

The information gained when branching a root node into two subtrees using attribute A is G(A)= Inf(S) - Inf(SA) (2.4) = 1 - 0.540852 = 0.459148 Similarly, Inf(SB) = 0.540852 G(B)= 0.459148 Inf(Sc) = 1.0 G(C)= 0 G(D)= 0.081704 Inf(SD)= 0.918296 Therefore, choosing A or B provides the largest information gain; Le., it maximizes G. In other words, A or B delivers the greatest amount of information in the root node; either one may be selected. T w o subtrees are subsequently formed, each of which is treated as another individual tree and is further divided recursively until a terminal node is reached. At the terminal node, the class of instances are identical and the information content in

Ind. Eng. Chem. Res., Vol. 31, No. 8,1992 1991 include the Gini index and the Twoing index (Breiman, et al., 1984): Gini index G(Xij) = Inf(S) - Ps, Inf(SL)- PSRInf(SR) (2.7a)

I

'P'

I

B=o

Inf(S) = -EPiPj

'n'

'P'

Twoing index

G(Xij) = p s L p s R t ~ ~ ( w i l S L-)P ( O ~ ~ S R ) (2.8) ~]~

Figure 1. A decision tree.

The generation of decision trees is insensitive to the choice of the indices used for calculating G (Breiman et al., 1984, Mingers, 1989; Mantaras, 1991); this point has been confined in our preliminary study. To date, no theoretical comparison or analysis of the advantages and disadvantages of these indices has been published. The advantages of these inductive algorithms are as follows: (1)The algorithms are very simple and can be easily coded. (2) A large amount of data can sometimes be condensed into a small decision tree. (3) The resulting decision tree is fully comprehensible and can be easily verified by the experts in the problem domain. (4) The decision tree format can be readily converted to the production rules format (e.g., for implementation into an expert system). There are also several disadvantages associated with the induction methods. First, the resulting decision trees are sometimes large and impractical especially when a training set is corrupted with noisy data. In this case, either the data should be pretreated or the decision tree should be pruned. If the data are meaningful, a comprehensible decision tree is almost always obtained. If the data are incremental and noisy, we may get a decision tree from the first batch of data and subsequently a slightly different decision tree from the second batch of data. This leads to the second problem: the reconciliation of decision trees resulting from different batches of data. The combination of these two characteristicsseverely restrict the application of ID3 to many problems that we are interested in. In the following section, a composite algorithm called inductive partitioning and regression tree (IPRT),which combines induction and regression, is proposed to overcome these limitations. 2.4. Feature Selection. In a commercial chemical proceee environment, up to hundreds of measurements are taken. The direct application of data analysis such as regression analysis usually needs tens of thousands of data points. For instance, a quadratic regression analysis with 100 independent variables will estimate 5151 parameters in the regression equation. A rule of thumb (Draper and Smith, 1981) estimates that the number of data points needed is roughly 10 times the number of parameters. If the size of independent variables to be examined can be reduced to 10, then only 66 parameters need to be estimated. This is a 78-fold reduction of data size. Other factors that favor the reduction of feature dimension before conducting knowledge acquisition include the following: 1. The redundant information from highly correlated variables can be eliminated without the loss of information (problem of collinearity). 2. Hand (1981) has shown that while using linear discriminant functions, the misclassification rate initially decreases as the number of features increases, but the method begins to degrade with even more features. The same thing happens in the regression analysis. If the number of data points is fixed, then the increase of the

l 5

Xi%,

SR

SL

Figure 2. Branch in the CART algorithm.

the terminal node is zero. The final decision tree, shown in Figure 1,can be converted to the following production rules: 1. If A = 0 then class = p. 2. If A = 1 and B = 0 then class = p. 3. I f A = 1 and B = 1then class = n. ID3 deals with data that are restricted to having categorical attributes and a categorical class. To apply ID3 to the case of operational data which have numerical attributes, it is necessary to convert these numerical data into nominal categories. The conversion is not only convenient but also arbitrary. The accuracy of the data is also lost in this process. CART (Classification and Regression Tree). Breiman et al. (1984) developed another induction method, CART,which uses an ID3-like algorithm to generate a decision tree with a training set whose attributes can be numerical. To find the attribute which gives the maximum information gain G (used to partition a node), Breiman and co-workers proposed calculating G ( X i . )along the interval of each attribute X i (=[a, b ] ) with a kxed step size (e.g., 0.1) and choosing the attribute whose G at point Xi* among all attributes is the largest. For example, consider a problem with four attributes X 1 , ..., X4 in the training set and let us say X 1 spans the range [0, lo]. Using a step size of 0.1,99 points are examined to find the best information gain on the interval X 1 , G*(X,). This process is repeated for the rest of the attributes. The best information gain among G*(Xl),..., G*(X4),is chosen for partitioning the data set. Using this approach, eqs 2.2 and 2.3 can be written as follows; the resulting tree is shown in Figure 2. G ( X i j )= Inf(S) - PsLInf(SL) - PsRInf(SR) (2.5) Inf(S) = -EPi log, Pi, i = wl,..., w, i

(2.7b)

i#j

B=l

(2.6)

where Ps is the probability of subinterval X i C X i j , i.e. NSJNs; jSF, is the probability of Subinterval X i 1 Xij,i.e., NsR/Ns;Pi 18 the probability of class mi, i.e., Ni/Ns;Ns is the number of instances in the training set S; Ns is the number of instances in the subset SLwhere X i < kij;NsR is the number of instances in the subset S, where X i 1 Xij; Ni is the number of instances whose class is wi in the training set S; and wl,...,w, are the finite set of categorical values for the class variable. Besides eqs 2.2 and 2.3, which were suggested by Quinlan, other indices which can be used to calculate G

1992 Ind. Eng. Chem. Res., Vol. 31, No. 8, 1992

dimension of a feature vector leads to the problem of overfitting. 3. The fewer the number of independent variables, the easier it is for the scientist or engineer to analyze the results. Therefore, the selection of the relevant independent variables becomes an important issue in this study. In order to choose the best subset of features among the original set of features for the classifier, it is necessary to have an index on which feature selection is based. The direct and most reliable index is the misclassification rate (or called the error rate). The misclassification rate is the percentage error in classification when a test data set is classified by a classifier obtained using a training set. However, a meaningful classifier may not be obtainable until the irrelevant variables are sorted out from a large set of independent variables by feature selection. Instead of the error rate, some other measures are typically used. A commonly used measure is called Wilks’ lambda (James, 1985), which measures the ratio of the dispersion of objects within a group to the dispersion of the total population for a selected feature subset of d dimensions. Wilks’ lambda is defined as J = lSwl/ITl T = Sw + s b

Sw = CP(Wi)EiI(X-~i)(x-pJT)

= cP(ai)Ei{(pi-p)(C(t-p)T) where Swis the within-class covariance matrix, Sbis the between-class covariance matrix, pi is the mean vector in class oi,p is the mean vector of population, x is the vector of points in the hyperspace, and P(q) is the probability of class oi. Another popular index used for feature selection is the conditional F ratio. It is an F test from an analysis of covariance to test the difference between the group means (conditional on the variables which are already selected). The conditional F ratio is commonly used as a stopping rule. The feature selection process stops when none of the conditionalF ratio of the remaining variables is signifcant at some level (in SAS, a popular package for statistical analysis, the default value is 0.15). One of the most effective and efficient methods for feature selection is stepwise selection (Hand, 1981). It starts with no features in the selected subset. Features are added one at a time to this subset. The feature having the smallest J (or the largest conditional F ratio) is first added to the empty subset. The next feature is selected from the remaining set such that when it is paired with the first selected feature, it gives the smallest J. This is called the forward selection. Then start the backward selection procedure which examines the decrease in goodness of measure, when one of the features in the selected subset is deleted. If the decrease is below a specified threshold, then delete this feature. Repeat the cycle of forward and backward selection until the measure of goodness cannot be significantly increased with further inclusion of features. The fmal subset of features presents the dimensions of hyperspace in which the data are well separated into groups according to their classes. There is another procedure for feature extraction in statistical pattern recognition which reduces the number of dimensions by linearly transforming original dimension D of features into a smaller dimension, d. The algorithm is principal component analysis. Because the resulting d transformed features are not physically meaningful, this type of feature extractor will not be considered in this sb

Table 11. Characteristic of Statistical and Induction Methods multiDle regression induction 1. assumes linear systems no such assumption needed 2. assumes randomly generated no such assumption needed data 3. quantitative data qualitative (or mixed) data 4. results are mathematical results are production rules formulas 5. insensitive to noise sensitive to noise 6. compact representation decision trees can be large

study. Recall that the primary objective is exploratory data analysis, and hence we need to obtain relationship among physically meaningful variables. However, this method may have applications in the use of artificial neural networks to model the process using operational data (Joseph et al., 1992). One assumption made in the application of these feature selection algorithms is that attributes are randomly distributed within their range of operation. In a typical operation log, this may not be true since a particular input variable may have remained constant during that period. If this is the case, feature selection will (incorrectly)reject this attribute as being irrelevant. Hence, the conclusions drawn from the feature selection should be limited to the range of the input data set provided. 3. Inductive Partitioning and Rsgression Trees

(IPRT) Both induction methods and regression analysis have characteristicsthat are desirable when processing routine operational data. Chemical process systems are usually nonlinear and have strong interactions among the process variables. These characteristics favor the use of induction methods over regression analysis. However, operational data are usually quantitative. Also, the most desirable knowledge for the use in process control is in the form of quantitative formulas which can be used for estimation or prediction. These characteristics render the use of induction algorithms less appealing. Table I1 compares the characteristics of induction methods with those of regression techniques. In this section, we propose a preprocessing algorithm based on Breiman’s inductive classification algorithms to divide the operational data into groups within which we have a greater chance of finding a numerical relationship among the variables since there is less extent of nonlinearity and interaction. Very often, one independent variable has a more significant impact on the output quality than the other independent variables. Nonlinearity and extent of interaction can vary with the range of the independent variables. Therefore, it is meaningful to separate data into groups using the most significant independent variable to do the partitioning. Consequently, the range of variation of the output variable in each divided group is shortened and the skewness is smoothed. There is ale0 better chance of obtaining a good linear or quadratic model within a smaller range of variation in the variables. It is important to point out that the partitioning is done on the range of the independent variables rather than the output variable. In order to implement inductive partitioning, i.e., the calculation of information gain by either eq 2.5,2.7, or 2.8, it is necessary to convert quantitative dependent variables into categorical dependent variables, if operational data consist of quantitative dependent variables. In the proposed algorithm, the multiple quadratic regression is suggested instead of the commonly used linear regression because the operational data are nonlinear. Let us first

Ind. Eng. Chem. Res., Vol. 31, No. 8, 1992 1993 consider problems involving only quantitative data. Extensions to problems with categorical data will be considered later. The IPRT algorithm can be stated as follows: Step 0. Assume a set of operational data is given. Divide the data into two sets a training set and a test set. Conduct feature selection to find the relevant independent variables to the output variable. Perform multiple quadratic regression on the training data set. If the fit is good, then stop; otherwise proceed the following steps using the training set. Step 1. Convert the dependent variable, Y, from its numerical values into categorical values. Some guidelines for this step are suggested below. Step 2. Find the feature leading to the largest information gain as outlined earlier in the CART algorithm. Group the training set into two subsets using this feature. Step 3. For each subset, perform multiple quadratic regression analysis. Step 4. Test goodness of fit on the training set using R2 as a measure. If R2 indicates that the training set data are well-fitted, then go to step 5; otherwise, repeat step 1 for each subset. Step 6. Test goodness of fit on a test set. If R2 indicates that the data are well-fitted, then stop; otherwise, repeat step 1 for each subset. For step 1, depending on the distribution of Y, the following two rules are recommended. (1)If the distribution of Y in the training set is normal, convert Y into a two-category variable using the mean or median, Y,to split the set. In other words, if Y > Y,then the class of categorical Y is high (HI; otherwise the class is low (L). (2) If the distribution of Y is skewed, then convert Y into two or three categories such that, between each group of category, their means of numerical Yare far separated. Also aim to produce subsets of data which have a similar number of data records. For most of the cases that we studied, the algorithm is not sensitive to rules used to convert numerical Y into categorical data. Occasionally, when the distribution of Y is highly skewed, the second rule of conversion is necessary. It is important to maintain a reasonable number of data records in each node so that a meaningful regression can be performed; too few records will lead to overfitting. A heuristic rule recommends at least twice as many data records as the number of parameters in the regression model. The selection of a minimum R2 above which a good fit is assumed is very subjective and usually depends on the context in which it is used. To determine an association relationship, i.e., correlation, the R2 threshold could be as low as 0.30 in a socialscience domain. A high R2 threshold leads to large decision trees while a low R2 threshold yields imprecise knowledge. A value of R2 of at least 0.85 for the test data set is recommended for reasonable results. 4. Case Studies In this section, two example problems are used to demonstrate the use of the IPRT algorithm. A comparison will be made between the simple multivariate quadratic regression without the IPRT (which is called a one-node tree) and the IPRT algorithm (which leads to a multiple-node tree). Before employing the IPRT algorithm, feature selection is performed using the procedure STEPDISC of the SAS package (SAS Institute, 1988). In this procedure, the significance level of the conditional F ratio was set at 0.15 (default value in SAS). The quality of the results is measured in terms of the ability to accurately predict Y. There are many ways to

index accuracy. The common one is the percent error, Le., Yi - Pi i = 1, ..., n (4.1) errori% = -X 100% Yi where Yi is the observed value in a training set, Pi is the prediction of Yi, and n is the number of records in the test set. The percent error is not a good index of performance of a measurement when Y varies over a wide range. Therefore, an average of absolute error 2, as defined in (4.3), will be used as an index of performance. (4.2) ei = lYi - PiI (4.3)

where n is the number of records in the test set. In addition, Re is also utilized in this study. Generally, R2 is defined as (4.4)

where P is the mean of Yi. R2 measures how much of the variation in Yi can be explained by a regression. R2 is considered to be an objective measure insensitive to the scaling of Y in different cases. Therefore, R2 is used in IPRT to determine if further partitioning of a node is needed. For linear regression C(Yi- Q 2 = C(Yi- n2 + C(Y,- PJ2 (4.5) so that R2 can be calculated as (4.6)

However, for quadratic regression, eq 4.4 is not equal to eq 4.6. Since multiple quadratic regression is used through this work, R2 is calculated from eq 4.4. Sometimes, a method models the whole population well except a few outliers. Since the sum of the square of errors is used in the calculation of R2,one exceptional outlier can hide good performance on the rest of data. For such cases, the mean of the absolute errors, z, gives a better performance measurement since it weighs every point equally. In both examples below, the performance of IPRT algorithm will be presented in terms of these two measures, i.e., 7 and R2 as well as a-graph. A graph of the observed Y versus the predicted Y gives a clear, direct sense of how good the mapping is. 4.1. Example 1. The following set of equations are used to demonstrate the IPRT algorithm. X1 < 3.0 and X3 < 3.5 Y = XlX2X3+ e (4.7a) X1 1 3.0 and X3 < 3.5 Y = 4X1X2X3+e (4.7b) Xi < 3.0 and X3 2 3.5 Y = 8XlXzX3 + 5 x 3 + E (4.74 X12 3.0 and X3 1 3.5 Y = 8X2X3 + 5 x 3 + e (4.7d) where 1 IX1 I5, 0.5 IX2 I1.5, and 1 IX3 I5. The response of Y to X1,X2,and X3 is different for four different regions. We have deliberately chosen different functional relationships in different region of X to see if they can be recovered using IPRT. Y is nonlinear function of X in each region. In (4.7d), there is no effect of X1on Y. A mild noise, randomly generated between [0, 11, is added. Two spurious variables, X4 and X5,are added in

1994 Ind. Eng. Chem. Res., Vol. 31, No. 8, 1992

Table 111. Results of Feature Selection for Examde 1 step 1

2 3

variable entered removed x3 none XI none x2 none

conditional Fratio

signif level

Wilks' lambda

102.017 9.659 7.158

0.0001 0.0001 0.0002

0.2748 0.2195 0.1847

S

I

Table IV. Result8 of the Asdication of IPRT to Examule 1 training set R2 test set R2 1. One-Node Tree 0.8076 overall 0.7680 43.2 2. Two-Node Tree 0.8442 43.21 0.8538 overall

X3 < 3.48

0.9541 0.6308

{ X 3 t 3.48 overall

< 3.48 and Xi < 3.48 and Xi t 3.48 and Xi Xs 2 3.48 and Xi

19.8 61.7

4. Four-Node Tree 0.9999 43.2 < 2.98 0.9689 2.9 t 2.98 0.9989 29.2 < 3.01 0.9996 81.0 t 3.01 0.9994 46.0

0.7965 0.5173 0.9996

0.9009 0.9964 0.9981 0.9987

Figure 3. Classification tree for example 1.

the records. One hundred twenty records, randomly generated from this set of equations, are used as a training set while another 60 randomly generated records are used as a test set. For this tutorial case, the following results should be obtained. (1) X,, X2, and X3 should be selected as the results of feature selection. (2) A classification tree with nodes branched at X1 = 3.0 and X3 = 3.5 should be obtained from the partitioning of the training set. After partitioning, the resulting regression equations in each terminal node should be able to predict Y in the test set. The results of the feature selection process are shown on Table 111;the selected variables are listed in the order of selection. After three steps, X3,X1, and X2are selected. In the following step, among the remaining variables, X5 has the largest conditional F ratio (1.408). Its corresponding significance level, 0.2441, is higher than the threshold. Therefore, X6 is rejected and the feature selection process stops. The selected set of significant variables is the same set that appear in eqs 4.7a-d. Following the algorithm presented in section 3, a quadratic regreasion analysis is at first performed on the training set. The result, R2= 0.7680, is not an acceptable fit. Therefore, IPRT is employed. A threshold criterion of R2 = 0.85 is used. At the root node, the maximum information gain, G, of X,, X2,and X3 are calculated according to (2.5). 4

G(X,=2.996) = 0.0116 G(X2=1.130) = 0.0172

Table V. Linearization Coefficients for Example 1 linear eqs obtained by linear eqs obtained by linearzn of (4.7a-d) linear regression anal. node an 81 (19 aa an a1 a9 ap 1 2 3 4

3.5 28 78.8 48.8

1.75 7 30 0

3.5 28 60 30

2 16 21 13

3.0 29.2 81.0 46.0

1.55 6.9 28.6 0

2.60 26.1 60.7 27.8

1.81 17.8 21.9 13

The process of applying the IPRT algorithm to this example is summarized in Table IV. The results show that the means of Y in the four nodes are well separated. To show the improved performance of the resulting functional mapping, the results of quadratic regression analysis on the test set with/without IPRT (Le., onenode tree/four-node tree) are shown in Figure 4 in terms of the three measures of accuracy: visual inspection of graphs of observed Y versus predicted Y,R2,and 5. It has been proven that the measure of information gain reflects the extent of association of attributes with class (Mingers, 1987). This explains why the IPRT algorithm could select the most influential independent variables, X1 and X3, for the partitioning of the data set. Partitioning of the data set reduces the interactions among the data The resulting tree partitions the data along the lines X1= 3.0 and X3 = 3.5 as expected. As mentioned in section 3, a system of nonlinear equations can be approximated by linearizing the system if the variations of independent variables do not deviate too far from the mean values. Using a Taylor series expansion: y = f(Xl,X2,X3)

G(X3=3.478) = 0.2603

Using the best information gain, X 3 = 3.48 is chosen to partition the training set into two subsets. One is a group in the LH node whose X3 is less than 3.48 while another is in the RH node whose X3is greater than or equal to 3.48. The subsets in the LH node and RH node have 53 records and 67 records, respectively. Conduct a quadratic regression following step 3 in the IPRT algorithm. In the LH node, the R2 obtained with the training set, 0.9541, is higher than the threshold. But in step 5, the R2obtained with the test set, 0.7965, indicates that further partitioning in this node is needed. In the RH node, the R2, 0.6308, is below the threshold; further partitioning, therefore, is also needed. Repeating step 1 and step 2, the best information gain is obtained at X,= 2.98 and X, = 3.01 for the LH and RH nodes, respectively. With further partitioning, a four-node tree is generated as shown in Figure 3. Following step 3 through step 5, all R2 values are above the threshold for both the training set and the test set in each node; thus, no more partitioning is necessary.

= QO

+ a,X, + a2X2 + (23x3

(4.9)

On applying linear regression analysis to the four-node tree, the following results were found: (1) R2's of linear regression equations are 0.9427,0.9589,0.9418, and 0.9811 for nodes 1 , 2 , 3 , and 4, respectively (when performed on the test set). In other words, the original system can be approximated by four pieces of linear equations. (2) The resulting linear equations are in close agreement with the equations obtained by linearizing the oiiginal set of nonlinear equations in the proximity of ft,, X2, and X 3 for each node. A comparison of the coefficients ai is given in Table V. The coefficients are quite similar. This demonstrates that small variations of the most influentual variable within the terminal nodes (made by the partitioning of the data set) allow a piecewise linearization of the nonlinear system.

Ind. Eng. Chem. Res., Vol. 31, No. 8, 1992 1995 1-node tree

4-node tree Predicted Y

'redlCted Y 120

120

100

100

80

80

60

60

'

/

R2=0.8076

40

RZ= 0.9996

40

e = 11.45

e = 0.48

M

20

a

0

-20

-20 0

20

40

60

80

100

120

0

20

Observed Y Figure 4. Comparison of observed and predicted Y for example 1 with and without IPRT. Table VI. Brief View of Autoclave Process Data Generated by the Autoclave Simulator description no. 1 no. 2 no. 3 void size (cm) 0.005 0.012 0.080 thickness (cm) 1.689 1.685 1.820 wt fracn of resin 0.38 0.418 0.436 impurity in feed 0.072 0.060 0.081 initial void size (cm) 0.0653 0.0520 0.0997 initial temp (K) 296.8 290.2 299.0 autoclave pressure (atm) 3.8 3.2 2.5 first holding temp (K) 382.3 378.1 384.2 second holding temp (K) 444.2 437.2 448.1

The data used in this study was generated randomly. Therefore, the inferences of the linear regression analysis are reliable. The high significance level of the F statistic of X1in node 4, which is 0.7237, show that XIis weakly associated with Y in this region. This trend is expected because Y is not the function of X1in eq 4.7d. The significance of the F statistic for remaining independent variables in every node is less than 0.001,which shows a very strong association between the independent variables X and the dependent variable Y. These inferences are verified by eqs 4.7a-d. 4.2. Example 2: Application to Autoclave Curing Data. The autoclave process is a batch reactor used to produce fiber reinforced composite materials. There are strong interactions between autoclave temperature, autoclave pressure, and raw material properties (e.g., impurity contents, weight fraction of resin). This process will be a good example to demonstrate how IRPT can be used to improve data mapping. The data are generated from an autoclave simulator which was developed at Waehington University (Wu, 1990). Historical logs of process data were generated by randomly varying the process parameters around a fixed operation plan. Table VI shorn a typical data set that might appear in the operational log of the process. The key processing variables are the quality of final product (i.e., thickness and void size), raw material properties (i.e., weight fraction

40

60

80

100

120

Observed Y

Table VII. Results of Feature Selection for the Autoclave Process Data step 1 2 3 4

variable entered removed none Pr t2 none in none tl none

conditional Fratio 58.664 37.112 12.114 7.114

signif level 0.0001

Wilks' lambda 0.3973 O.OOO1 0.2019 O.OOO1 0.1526 0.0002 0.1284

of resin, impurity in the feed, and initial void size), and operating conditions (i.e., initial temperature, autoclave pressure, first holding temperature, and second holding temperature). For this study, 280 records were collected for a training set and 98 records were used for a test set. Before employing the IPRT algorithm, the feature selection method (discussed in section 3) is used to find the relevant independent variables responsible for variation of the dependent variable, i.e., void size. Four out of the seven independent variables are selected as shown in Table VII. Prior feature selection is important for quadratic regression because the number of samples needed to conduct a reliable analysis is proportional to the number of Coefficients in a quadratic equation. Note that 28 coefficients in a seven-variable quadratic equation have to be estimated whereas 15 coefficients are needed in a fourvariable equation. To avoid overfitting problems, the sample size needed for seven-variable quadratic regression analysis is almost twice the sample size needed for fourvariable regression analysis. Table VIII shorn the detailed process of using the IPRT algorithm. The performance of the mapping steadily improved as the data set was partitioned further and further until a seven-node tree was obtained. Data mapping by quadratic regression analysis with/without IPRT, i.e., one-node treelaeven-node tree, is compared in Figure 5. The R2values on the test set for the one-node tree and the seven-node tree are 0.4962 and 0.9616,respectively. Unlike the tutorial example above, the model of autoclave process is very complex: it consists of several si-

1996 Ind. Eng. Chem. Res., Vol. 31, No. 8, 1992

Table VIII. Results of Using IPRT on the Autoclave Data training set quadratic R2 linear R2 1. Two-Node Tree 0.6954 0.3504

overall

0.6108 0.5681 0.9337

3. Three-Node Tree 0.8947 (1) 0.8928 (2) 0.4324 0.9190 (3)

0.6073 0.5131 0.1167 0.6297

0.0130 0.0320 0.0085 0.0015

0.5850 0.6090 0.2199 0.9337

4. Seven-Node Tree 0.9897 0.9951 (1) 0.9346 and tl < 386 (2) 0.9881 tl1386 (3) 0.9953 im < 0.7 (4) 0.9891 im 10.7 and tl < 386 (5) 0.9170 im 1 0.7 and t , 1386 (6) 0.9589 (7)

0.9455 0.9155 0.5994 0.8965 0.9763 0.8963 0.6956 0.6630

0.0130 0.0117 0.0234 0.0978 0.0120 0.0051 0.0188 0.0015

0.9616 0.9912 0.7736 0.9518 0.8044 0.9871 0.8490 0.9344

< 3.2

overall

E

2.6 Ipr and pr < 3.2

I

p r 1 3.2

Table IX. Results of Linear Regression in Selected Nodes of the Seven-Node Regression Tree node 1 3 4 5

a, 0.0123 0.0978 0.0120 0.0047

0.4962

0.0130 0.0215 0.0015

f!.!:; :"