Genetic Programming Based Variable Interaction Models for

Classification of data originating from complex process and biological systems is challenging owing to the presence of multivariate and highly nonline...
1 downloads 0 Views 339KB Size
Ind. Eng. Chem. Res. 2009, 48, 4899–4907

4899

Genetic Programming Based Variable Interaction Models for Classification of Process and Biological Systems Raghuraj K. Rao,† Kyaw Tun,‡ and S. Lakshminarayanan*,† Department of Chemical and Biomolecular Engineering, National UniVersity of Singapore, 4 Engineering DriVe 4, NUS, Singapore, 117576, and Synthetic Genomics Team, RIKEN Yokohama Institute, Tsurumi, Japan

Classification of data originating from complex process and biological systems is challenging owing to the presence of multivariate and highly nonlinear interactions between variables. Patterns, difficult to distinguish using decision boundaries or available discriminating rules, can be separated based on unique inter-relations among the feature vectors. Given the complex nature of such systems, the variable interaction models are difficult to establish. Genetic programming (GP), a data-driven evolutionary modeling approach, is suggested here to be a potential tool for designing variable dependency models and exploiting them further for class discriminant analysis. Thus, this paper proposes a new GP model based classification approach. The approach is applied on illustrative data sets, and its performance is benchmarked against well-established linear and nonlinear classifiers such as LDA, kNN, CART, ANN, and SVM. It is demonstrated that GP based models can play an effective role in classification of data into multiple classes. 1. Introduction Automatic classification of observations into known groups of characteristics plays a key role in systems analysis and decision making. Discriminant analysis has widespread applicability in problems ranging from fault detection and isolation in process industries, to product quality characterization in chemometrics, to clinical diagnosis in medicine, and to knowledge inference in biology. The pattern recognition theory and algorithms have been researched extensively in the fields of industrial engineering, statistics, and information sciences, and a number of machine learning and data mining approaches have been proposed.1,2 The supervised learning algorithms mainly involve two steps. In the first step, the classifier is formulated and trained using data collected on several attributes (variables) for different distinct classes that characterize the system. Next, these models are used to predict the class of an unknown sample from the same system using same set of features. The wide range of statistical approaches employed to resolve such classification problems includes simple probabilistic Bayesian rules, linear and nonlinear discriminating functions, and more powerful machine learning techniques such as decision trees, neural networks (NN), and support vector machines (SVM). Theories for these methods are already well established, and readers are directed to refs 1-4 for details on the strengths and limitations of each of these methods. Measurements on different features of system are generally used to understand various system behaviors. The profiles of the individual variables and the correlation patterns that exist between sets of these features characterize certain behaviors of the system. These patterns, unique to each class, can therefore serve as signatures to identify the particular nature of system output. Hence, the principal objective of any supervised classifier is to learn these signatures so as to predict the output patterns for given values of input features. If the input data belonging to two classes exhibit distinct patterns, then the classes can be linearly separable when projected on the descriptor space.5 Such * To whom correspondence should be addressed. Tel.: (65) 68748484. Fax: (65) 67791936. E-mail: [email protected]. † National University of Singapore. ‡ RIKEN Yokohama Institute.

data can be effectively classified using classifiers such as linear discriminant analysis (LDA) or its variants such as diagonal LDA (DLDA) and regularized discriminant analysis (RDA).1 In complex multivariate data sets, as found in many modern large scale chemometrics problems (spectral analysis), bioinformatics problems, and image analysis problems, the class data points show overlapping clusters and are not easily separable by a simple linear decision boundary in the multidimensional space.5 Such a lack of separation in samples affects the performance of methods like LDA and RDA which construct classifiers based on distinct interclass separations. To overcome this difficulty for systems with nonlinear characteristics, quadratic discriminant analysis (QDA)2 and SVM3 employ the “kernel trick”. These methods map the original observations into a higher dimensional nonlinear space, using suitable kernel transformation functions, such that linear classification in the new space is equivalent to nonlinear classification in the original space. Due to their ability to perform well, kernel approaches (especially SVM) have found widespread use in modern complex multivariate classification tasks.6 Though SVM has been able to show consistent success rates, the method requires rigorous tuning of kernel parameters and is computationally taxing for large data sets as it employs intensive optimization routines. Further, these decision boundary searching algorithms are binary (separating only two classes at a time) in nature and their extensions to multiclass problems are not trivial.7 Alternate classification concepts such as the distance measure based k-nearest-neighbor approach (kNN),2,5 artificial neural networks (ANN),8 which employ a black-box model mapping the features to the classes, regression based discriminant partial least squares (DPLS),5 and the rule based decision tree methods such as CART,4 C4.5,9 and ant-colony classifier10 have also been effectively applied. These methods and their extensions have shown data-specific performances and lack in generality. This aspect comes to the fore while testing the classifier with data samples for which they were not trained. Further to all these observations, the main motivation for this work is that the existing methods do not capitalize much on the nature of multivariate associations between the features which can bring out distinct dissimilarities between classes.11 This is especially important for data sets with overlapping patterns which cannot

10.1021/ie801147m CCC: $40.75  2009 American Chemical Society Published on Web 04/21/2009

4900

Ind. Eng. Chem. Res., Vol. 48, No. 10, 2009

be efficiently separated based on distance measures or decision boundaries. Structures of such variable interactions can be mathematically established and the distinct intervariable relations, specific to each class, can be used as discriminating models. However, when the system under consideration is not well understood (e.g., complex process and biological systems), establishing the structure of such variable interaction models can be a formidable challenge. First principles modeling approaches have limited application in this domain because the causal relations between variables are at best unclear in most of the cases. Due to functional complexity and unknown states (input or output) of variables, it is difficult to employ dependency models based on conservation or other physical principles. For example, how are process variables related in a multiphase, multicomponent polymer reactor or how does one relate gene expression values to diagnose a cancer tumor? Given the complex and poorly understood nature of the systems that are of interest to us, it makes sense to use data-driven evolutionary modeling methods such as genetic programming (GP) to uncover relationships between variables and to use the resulting models for classification purposes. The new genetic programming model based class discrimination (GPMCD) technique proposed in this paper attempts this evolutionary computing based data classification approach. To the best of our knowledge, this work presents, for the first time, the direct application of genetic programming to construct and use variable interaction models for multiclass discriminant analysis. This paper is organized as follows. We first start with explaining the concept of variable interaction models and the importance of GP in designing them. Next, the objective of verifying the existence of unique variable interaction structures for each class is established. Then we provide a detailed description of the GPMCD methodology along with implementation details. Finally, we establish the performance of GPMCD, vis-a`-vis existing advanced classifiers using several process and biologically significant case studies. 2. Methodology 2.1. Concept of Variable Predictive Models. As stated earlier, the measurable attributes or features which are selected to define the characteristics of a system will exhibit dependencies and interactions among them. In complex multivariate systems, the continuous predictor variables are bound to have certain dependencies due to redundant measurements or the regulatory/ causal nature of the system. Correlation based methods or conditional probabilities can qualitatively define these associations between continuous random predictor attributes and have been widely employed in the literature for many data mining problems.12 However, the correlation coefficient alone cannot identify all forms of direct or indirect multivariate relationships between variables and quantify them. Consider, for example, the Iris flower system where four attributes (sepal length, SL; sepal width, SW; petal length, PL; and petal width, PW) are used to characterize three groups of flowers (Iris setosa, Iris Virginica, and Iris Versicolor). The changes in SL can be correlated to changes in PL. Similarly, SW and PW might be correlated. However, these individual correlations do not infer simultaneous multivariate dependencies between sets of variables. Overall, we can check for interactions (relationships/ dependencies) of each variable with other variables defined on the same system. Such possible direct one to one (univariate) or multivariate associations between variables need not always be linear. For complex process and biological systems with several thousand variables (e.g., spectral data analysis for

chemometrics, gene expression data for cancer tumor characterization), nonlinear associations are quite common and considering correlation measures alone can be misleading at best. The variable associations require richer quantitative representations and mathematical structure for characterizing certain definitive system behavior. Such deterministic relations can be suitably developed and validated from the observations made on the system. These relations are termed variable predictive models (VPM) and have been recently shown to be effective way for characterizing steady state data sets without prior system knowledge.11,13 VPM is a designed mathematical model, representing changes in a particular variable as a function of one or a set of other variables. With the basic assumption that certain measured variables on the same system are dependent, VPM can be developed for each variable in the system. Prediction of a variable using a corresponding VPM, if statistically significant, highlights the existence of deterministic association between a predicted variable and a set of predictor variables used to build that specific VPM. The models are designed using the sample measurements (data matrix N [n × p], with n observations on p different variables) and are validated using tests based on the prediction errors. Given that dependencies exist between variables of any given system, the immediate question is, what should be the nature of these models? In the research domain that focuses on poorly understood complex systems, the structure of these models cannot be established a priori. Even though enough external measurements are made to observe different behaviors of such systems, the underlying principles and reasons for variations in these variables are not yet established. Also, these variations are unique to specific systems and cannot be easily generalized over different forms of the system which further hinders development of universal laws and theories explaining these changes in system variables. Hence, defining a specific structure and mathematically modeling the design principles for complex process and biological systems still remains a major research challenge. This lack of prior system knowledge provides opportunities to employ evolutionary modeling techniques such as genetic algorithms (GA) and genetic programming (GP) for such problems. In the present study, we exploit the power of GP for designing the previously explained VPM to characterize variable interactions in process/biological systems. The models developed using GP are then extended to discriminate different system behaviors. 2.2. GP Models (GPM) for Variable Prediction. Genetic programming (GP)14 is an evolutionary method for automatically generating nonlinear input-output models. Given two continuous variables, GP generates a population of models with random (but mathematically meaningful) structure without needing any prior knowledge of the system. These models are then validated using part of the initial data (i.e., data that are not used to build the model) using a suitable fitness measure. The probability of a given model surviving into the next generation is proportional to how well it predicts the output data. Components of successful models are continuously recombined (using a variety of genetic operators such as crossover, mutation, etc.) with those of others to form new models. In each generation, GP optimizes the model structure, with a lower level nonlinear least-squares algorithm harnessed to estimate the associated model parameters.15 For a detailed discussion on the application of GP modeling to various linear/nonlinear steady state/dynamic process systems, please refer to the literature.16-20 The main components of a GP tool are the following: the terminal set, which is a list of relevant input/output variables

Ind. Eng. Chem. Res., Vol. 48, No. 10, 2009

4901

Figure 1. Variable interaction correlation structure for Iris data. All the three types of flowers exhibit distinct variable associations.

and constants; thefunctional set, which is a list of mathematical operators (e.g., +, -, /, *, ∧, sqrt, sin, exp, log, etc.); the search space, which is a set of all possible models that can be constructed using the functional and terminal sets; the fitness function, which is a measure for evaluating the suitability of a candidate model in terms of its fitting/prediction capability. Measures such as root-mean-square error (RMSE) are augmented by a term that penalizes complex (long) structures to form the fitness function. Relatively successful individuals are selected from the current population (in proportion to their fitness) to serve as parents for the next generation. This selection policy is called fitness proportional selection. The following genetic operators are used for the creation of a new generation of individuals: (i) crossoVer, the generation of two offspring from two parents, each of which is split arbitrarily into two subtrees, and reassembled as two new entities; (ii) mutation, the generation of a single new individual from a selected parent, by randomly changing one of its terminal or functional elements; (iii) permutation, similar to mutation, except that two terminal elements are picked at random and exchanged. Typical tunable parameters in a GP include the following: population size; number of generations (iterations); probabilities for crossover, mutation, and permutation; functional weightings (the probability of selecting specific operators); the fitness measure; the selection policy. Typically, the genetic program is run for either a prespecified number of generations or until a desired performance is attained by the best solution. In the present setup, one cannot really distinguish between the input and output variables of the system. For example, to characterize product quality in a bioprocess, we are not sure of the direction of dependencies between different system variables (the concentrations of different biological and chemical components in the bioreactor). Therefore, each variable is treated as an output variable and modeled using the remaining variables using GP for obtaining the corresponding VPM. For example, consider a system N [n × p] with n observations on p different continuous variables. If the set of p variables is written as X ) {X1, X2, X3,..., Xp}, then each variable Xi (i ) 1, 2,..., p) is considered as an output variable (predicted variable) and is modeled as a function of the best set of other variables [Xj] ⊂ {X1, X2, Xi-1, Xi+1,..., Xp; i * j} in the same system. Each set of input variables [Xj] along with output variable Xi is subjected to a separate GP run to obtain the genetic programming model GPMi which can effectively predict Xi given the values for [Xj]. For a selected Xi, all the (p - 1) possible variables are used as members of the terminal set during the corresponding GP run. The best model for predicting Xi, i.e., GPMi, evolves over several generations, automatically retaining a best subset of predictor variables from [Xj]. For the entire system (N), we generate a set of p best GPMi which represent the variation in each system

variable Xi (i ) 1, 2, 3,..., p) as a function of the best set of remaining variables. Hence the GP modeling technique serves the dual purpose of optimizing the structure and parameters of the GPMi and also selects the optimal set of predictor variables. Most importantly, these two features are achieved without any prior knowledge of the system. At the end of all the GP runs, a pool of p GPMi effectively represents the nonlinear intervariable relationships of the complex, unknown system. The concept of intervariable GPM is further extended to design a unique classifier. 2.3. GP Models for Class Discrimination (GPMCD). The main basis of this new classification approach (discriminant analysis) is that some of the variables exhibit definite and different dependencies among each other across classes. The hypothesis being tested is, do the specific classes of the system exhibit unique and independent structures of such intervariable relations? We demonstrate this idea with Iris flower classification data with four variables characterizing three classes of flowers (as explained in section 2.1). Figure 1 provides the shade map of intervariable correlation structure for each type of flower. All the diagonals have correlation values equal to unity as each variable is related to itself. Other than this, it can be seen that SW-SL are related strongly for I. setosa type, PW-PL are related for I. Versicolor, and SW-PW are related for I. Virginica type flower. This establishes a unique pattern for each class which can be exploited for distinguishing it from other types. In order to quantify these relations, one can define a linear fit between strongly related variables within each class. However, in reality, as the systems get more complex, the variable associations are seldom linear/univariate. Hence, a naturally evolved model relating the variables in a class, built using the observations for that class, has the ability to mimic the true form of multivariate relationships as long as they can be captured in the form of mathematical equations. GP based variable predictive models provide the perfect solution to this problem. Once such class-specific variable interaction models are established by learning the system observations for each class, then they can be effectively used to predict the class of an unknown sample. This, class-specific variable interaction modeling concept formulates the basis for the proposed supervised classification approach, named here as GPMCD. As shown in Figure 2, the underlying principle of the GPMCD method is to build variable predictive models (GPM) for all the features using variable interactions for each class and use their predictive ability to classify a new observation. Let us consider a generalized classification problem having a training set N [n × p; g] with n observations made on p features, belonging to one of the g different classes of the system behavior and a test set M [m × p; g]. The first step in GPMCD is to

4902

Ind. Eng. Chem. Res., Vol. 48, No. 10, 2009 Table 1. Data Sets Used in the GPMCD Investigation; Sample and Class Details for Case Studies data set

data N [ n × p; k]

IRIS DIABETES CANCER

[150 × 4; 3] [768 × 8; 2] N [38 × 4; 2]

WINE

M [34 × 4; 2] [178 × 13; 3]

FDD

N [200 × 5; 4] M [200 × 5; 4]

Figure 2. Flow diagram for GPMCD algorithm. Method involves data processing, training, and testing.

learn the class-specific variable associations in N by modeling the variable dependency structure for each variable in each class. For this, as a part of the preprocessing step, data belonging to each class are extracted from the set N into g different group matrices, Gk [nk × p], with nk being number of samples belonging to class k (k ) 1, 2,..., g). For each attribute vector Xki (i ) 1, 2,..., p) in Gk build the best GPMi as described in section 2.2. This gives a set of p best predictive models belonging to given particular class k (collectively denoted as the group k model: GPMki ) uniquely characterizing the variable associations for that class (training step). The individual class models GPMki are represented by a definite structure optimally designed by GP based on its predictive ability. Since the GPMCD algorithm captures all the direct intervariable relations, it is conjectured that the final set of GPMki obtained on the training data represents a variable interaction discriminatory model to be used for sample testing. At the end of this training stage, the set of optimally learned GPMki models have distinctly higher accuracies in predicting any sample belonging to group k. By sample prediction it is meant that, given the values of predicting variables [Xj] in a particular model GPMki , the model predicted value Xˆi has a higher likelihood of being close to the measured value Xi if that sample belongs to group k. The probability of the full sample belonging to group k is increased by checking the predictive capability of GPMki for all the attribute values in that sample using remaining predictor values from the same set. The sample is considered to belong to class k if the corresponding GPMki provided the best reprediction accuracy. The sample reprediction capabilities for each of GPMki (k ) 1, 2,..., g) are thus the primary discriminating criteria used in GPMCD. During the testing step, the proposed model based class characterization approach (GPMCD) statistically tests each new sample S in the test matrix M, by projecting all the features of S on GP models and repredicting the feature vector Sˆ using the corresponding GPMki . The GPMCD classifier is then built using the objective function as given in eq 1. The unknown sample S is classified as belonging to class k based on the minimum squared prediction error SSEk. p

min |SSEk | ) k

∑ (S - Sˆ ) ; 2

i

i

k ∈ {1, 2, ..., g}

(1)

i)1

As all or some of the measurable features out of a large pool of variables in a multivariate system are related to one another,

application taxonomy of flowers diabetic disease diagnosis leukemia cancer tissue detection

source UCI24 UCI24 refs 25, 26

wine quality classification for UCI24 three different cultivators nonlinear system, fault simulated detection

the group model GPMki is robust to outliers or noisy data and also can distinguish the group characteristics from other groups even if the samples in matrix N are closely located in the p-dimensional descriptor space. Since the criteria for discrimination are the prediction errors obtained using GPMki , the GP model based discrimination method does not suffer from the problem of inseparability of samples on distance measure. GPMCD is effective in classifying the overlapped data points and achieve nonlinear classification without having to find a decision boundary, trace distributions, or blow up data onto the higher dimensional space. This unique feature distinguishes GPMCD from other existing decision boundary classifiers such as LDA, QDA, SVM, probability function based Bayesian classifiers, and distance measure based kNN algorithms. The new method is also distinct from all the decision tree based algorithms because it does not perform any logical comparisons or generate rules based on the numerical values of the variables. Application of GP for class discrimination is of recent interest, and many forms of GP implementations have been attempted.21-23 These studies utilize the power of GP to design decision rules or discriminating functions which are then directly used to predict the class of the new sample. Also, these are binary (two class) implementations and do not address the simultaneous multiclass problems. Unlike these approaches, GPMCD does not use GP to predict the classes directly. Instead, it uses GP to develop variable prediction models separately for each class which are then employed for class discrimination. Also, it is different in its evolutionary modeling basis as it uses the variable prediction accuracy as the model selection criteria and not the class prediction accuracy. Due to its ability to design independent class interaction models, GPMCD can train and predict multiple class samples simultaneously. The performance of the new GPMCD algorithm is illustrated in the following sections, using different classification applications of process and biological significance. The results are compared with different existing classifiers such as LDA, kNN, ANN, and SVM. For the sake of brevity, the theory and detailed description of the methods used for comparison are avoided here and the same can be found in the relevant literature.1-3 2.4. Case Studies. Five representative case studies with different process and biological significance are selected to demonstrate the generalized performance of GPMCD. Table 1 highlights the details of the data sets used for the same along with their respective sources. The first data set (IRIS) is a simple classification problem considered earlier for explaining the GPMCD concept. This data set represents a typical industrial product classification problem and attempts to resolve the differences between flowers belonging to different biological taxa. Simple measurements of petal and sepal lengths (in millimeters) are used to characterize three distinct flower types. The data points (n ) 150) have a uniform sample distribution in three groups (50 in each group) and are therefore very appropriate for statistical analysis. The second data set (DIA-

Ind. Eng. Chem. Res., Vol. 48, No. 10, 2009

BETES) with eight variables with two classes is selected because of its clinical significance. The eight continuous variables, used to characterize the presence or absence of the disease, represent various physiological parameters for the patient such as plasma glucose, blood pressure, skin-fold thickness, and body mass index. As no definite relationship is known between these variables, the unique interactions between them provide a classical case for evolutionary programming based classification. These two data sets are downloaded from http://www.ics.uci.edu/∼mlearn/databases/.24 The data set (CANCER) on leukemia tissue prognosis provides a high dimensional data set for binary classification (identification of acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL)). Most of the 6172 variables (gene expressions) are shown to be irrelevant as they are equally expressed in AML and ALL tissues. Hence only a small set of differentially expressed genes are selected for further analysis. Many statistical gene selection methods have been suggested in the literature,25 and also few important genes for this problem have been experimentally identified. We select four such important genes as suggested by Fua and Fu-Liu.26 The next two data sets have been chosen to demonstrate the GPMCD performance for process systems. The WINE data set is a product quality characterization problem already studied for different classifier performances10 and hence provides opportunities for comparative studies. The FDD data set represents a fault detection and diagnosis problem for a simulated nonlinear system involving five variables. In this data set, variables x1 and x2 are the independent variables. Variables x3, x4, and x5 are calculated using following model equations. x3 ) 2x12 + x1x2 ;

x4 ) x2x3 + exp(x2);

x5 ) x1x2 ⁄ x4 (2)

Two hundred samples are generated. Subsequently, faults are introduced into the data. The first 50 samples (class 1) are uncorrupted. Suitable biases ((15%) are added in variables x2, x3, and x4 for each set of successive 50 samples to simulate fault classes 2, 3, and 4, respectively, thus providing a system N [n ) 200 × p ) 5; g ) 4] for classification analysis. Similarly, another set of 200 samples are extracted as test cases (M [m ) 200 × 40; 4]). Though representative data sets are used for illustrating the performance of the proposed GPMCD classifier, the concept can be extended to any system N with varying characteristics (n, p, g, and distribution of samples). 2.5. Implementation and Analysis. All the data sets listed in Table 1 are separately subjected to GPMCD training and testing. The GPM models for predicting variables in each class are generated using the genetic programming software GeMS,27 developed in-house for various modeling applications. This software can be made available to interested readers upon request. Here, we will only point out the unique features of the GeMS software. GeMS uses a novel pointer based indexing approach to store the genes and chromosomessthis provides fast access to the models, needs less computer memory, and renders the toolbox easily extensible. With the provision to limit the number of operands each functional element can take, GeMS also exploits any available a priori knowledge in order to limit the search space or to transform the model assembled by GP into more effective models. The user can define the evolution policy using keywords for action to be taken (inclusion, exclusion, parameter reduction, etc.) for conditions such as repetition of a specific gene, existence of a specific gene, etc. GeMS uses stochastic “global” optimization methods such as simulated annealing, genetic algorithms, and differential evolution in conjunction with local optimization techniques like the Gauss-Newton method to determine the global optimum values of the parameters. GeMS provides an option to tune the model

4903

Table 2. Variable Association Models Designed By GP for Different Case Studies sample GPMik a

data set IRIS

X21 ) 0.6853*X1 (similar to SW1 ) 0.6853(SL)) PL2 ) (X4 + 0.42039*(X4 + X1)) ) 1.42039(PW) + 0.4039(SL) PW3 ) ((1 + (1 + (0.5457*X2 + 1))) + 1) ) 3 + 0.5457(SW)

DIABETES

X51 ) 0.9291 ×10-2*(((X8 - X2) - X6))∧2 X22 ) ((-90.23*(X6/X5)) + (128.3*1 + X6))

CANCER

X31 ) (exp(9.643*1) + 3.607*(X2 - 1)) X12 ) X3 X42 ) (-4715*(((1)∧0.5 - 1) - X2))

WINE

X31 ) (((0.5023 ×10-2*X5 + 1) + X8) + 0.3703 ×10-1*X4) X92 ) (-0.3486*(X8 - (X12 + X6))) 2 X10 ) (X7 + (1/X9)) X13 ) (0.3658 ×10-1*(X9*X10) + 12.82*1)

FDD

X31 X52 X33 X34

) ) ) )

0.8465*(((3.362*X1 + 1.181*X2)*X1) - (X1)∧2) 1.000*((X1*X2)/X4) 1.000*(((X1 + X2) + X1)*X1) (2.000*(X1)∧2 + (0.8333*(X4*X5) + (-0.1545 ×10-4*X4)))

a GP model GPMik predicts the variable Xi in group k (denoted as Xik) using other variables (Xj) in the same group.

complexity in terms of chromosome size and a “complexity corrected” fitness criteria. In the present study, the settings used for each GP run are the following: maximum generations ) 30; population size ) 120; fitness criteria ) RMSE; mutation/crossover probability ) 0.5; terminal set ) {variables other than Xi: set Xj}; output variable ) {Xi}; model type ) multivariate algebraic (one to many). Though GP runs can generate models with random structures with varying sizes and element compositions, too lengthy and complex structures may lead to data overfitting. In order to regulate the same, chromosome-related parameters are fixed as follows. The maximum number of parameters in a model Npmax ) 5; the maximum length and depth of the chromosome is restricted to 10 in order to control the model complexity. The functional elements are restricted to the set {∧2, ∧3, sqrt, sin, cos, log, exp} to avoid highly nonlinear component interactions (which can enhance the risk of local optimization). Correction for model complexity is also important for efficient computation since GP would be forced to simplify the model itself through successive generations, thereby reducing the search space. This ability was achieved by correcting the RMSE fitness criteria with higher penalty for higher complex structures using a formulation given in eq 3, which is in-built in GeMS settings.27 RMSEcorrected ) RMSE(1 + k1(Np) + k2(Mc))

(3)

where Np is the number of parameters in the model, Mc is the complexity value of the chromosome, and k1 and k2 are weighting factors set at 0.5, respectively (default GeMS settings). During each GP run, GeMS takes the training data for variables Xi and set of variables [Xj] and designs the GPM models with the settings explained above. Part of the training data is used for model fitness evaluation using criteria in eq 3. The selected population evolves over generations retaining the models with the best RMSE and lesser complexity. Models generated for each class are compiled and made compatible for the next step. Sample models generated by GeMS for each data set are given in Table 2. The GPMCD algorithm discussed in section 2.3 has been implemented and executed in MATLAB.28 Selected classical supervisory learning algorithms are also implemented in MATLAB to carry out a comparative analysis. Decision boundary based LDA and decision tree based CART (without pruning) are implemented using the MATLAB

4904

Ind. Eng. Chem. Res., Vol. 48, No. 10, 2009

multivariate statistics toolbox. ANN classifier with a three hidden layer architecture and error back-propagation training algorithm is developed using the Neural Network toolbox. kNN is coded as a separate classifier function. SVM test/train algorithms in the Bioinformatics toolbox are utilized to obtain benchmark results. The learning ability of the GPMCD classifier is tested in two ways: (i) by resubstitution (RS) analysis where the training data are used as test samples and (ii) a new sample test (NS) wherein half the initial data points not included during training are later used as test samples. The RS test indicates the self-consistency aspects of the classifier, while the NS test gauges the real performance of the classifier by testing its ability to generalize.2 In each of these tests, overall classification accuracy is selected as the primary indication of the classifier performance. For all the results, this accuracy is shown as the percentage of test samples (full set N for RS and half the set for NS, if a separate test set is not available) that are classified correctly by the trained classifier. As the data sets are clean, no preprocessing or data elimination is performed. The same data set is used for all the tests, for all the classifiers. Wherever a separate test set is unavailable, the data sets are split equally into training and testing sets, proportionately maintaining the class distribution in each split. The GPM for each class are optimally constructed using only the training set data. The training data are further randomly split into two parts. Two-thirds of the data is used to construct the models, and the remaining 1/3 (hold out sample) is used to validate the fitness criteria in GP. The pool of class-specific models collected from the GP run as GPMki are then utilized as discriminating functions in GPMCD on test data which are not used during training (except during the RS test). All the test results are reported and discussed in the following section. 3. Results Sample class-specific variable association models (GPMi) obtained from GP runs for each class, are outlined in Table 2 for each data set. The model structure as given by GeMS is simplified and reported in terms of system variables {Xi, Xj}. For continuity of terminology used earlier, the variable notations are replaced with the original four variable names for the IRIS data set alone. The model structures automatically optimized by the evolutionary approach for IRIS data are coherent with the classwise variable correlation structure depicted in Figure 1. Also, without any initial system knowledge and bias, the natural selection based modeling procedure (GP) has designed a linear variable association structure to characterize each type of flower. Out of three predictive models possible for each variable, the dominant relations (as seen in Figure 1 for each class) are perfectly captured as best the GPMi respectively. For I. Versicolor flowers (group 2), GP designs variable PL as a bivariate linear model with higher weight for PW (strongly correlated) along with the SL variable in the same class. These observations establish the power of the GP based modeling technique in identifying the true nature of intervariable dependencies and learning the characteristic signatures of each class. Further complex models with multivariate nonlinear associations are observed for different physiological properties in the DIABETES data set. The gene expression values exhibit almost univariate dependence on some other gene for cancer characterization. These model structures typical for each class of the given system provide better insights to the underlying design principles of biological systems apart from discriminative power during GPMCD implementation. GPM for IRIS data highlight

the direct linear relation between sepal and petal properties. The nonlinear relations for DIABETES data are evident from the complex physiological relations between variables selected (blood pressure, blood composition, etc.). Linear association between gene expression values is a clear indication of the existence of correlated genes which are coexpressed or suppressed during a definite genetic activity such as mutation or tumor growth. Similarly, for process data sets, the variable interaction models evolve as complex nonlinear structures to characterize the WINE data (relating nonlinear dependencies between compositions). The GP models for FDD data bring out the inherent nature of equations used to generate the data. For class 1, which is the no-fault set (50 samples), GP generates almost exact representation for variables x3 and x5. For the fault situations (classes 2, 3, and 4), GPMCD uniquely learns the deviations for each fault and generates class-specific variable dependency structures which essentially help in discriminating the test samples. Hence this simulated FDD data set with a known system structure helps reveal the strength of GP in recognizing pattern-specific variable interactions, thereby facilitating subsequent class discrimination. Figure 3 illustrates the working principle of GPMCD for the IRIS data set. All the class models are used to predict four variables (rows) in all three classes (columns). For I. setosa class (class 1), the GPM1i predictions (the dashed-dotted line) almost overlap the actual sample variable values (circles) for all four variables. On the contrary, when GPM2i (dashed line) or GPM3i (solid line) are used to predict the variables of samples belonging to class 1, the predicted values are far from actual values. A similar distinction between the predictive ability of class-specific GPM from other GPM can also be seen in other groups. Dashed lines in group 2 (I. Versicolor) and solid lines in group 3 (I. Virginica) are closer to group variables, respectively. This distinction enables the GPMCD classifier to segregate samples into respective groups using predictive ability of groupwise GP models. Another important observation that can be made from Figure 3 is the possible existence of variables which do not distinguish groups very well. It can be seen that the variable SL, although it has a clear prediction for the I. setosa group model, shows significant overlap in model prediction for classes 2 and 3. This implies that SL is a bad feature for identifying I. Versicolor and I. Virginica flowers. Such observations made during GPMCD analysis can provide further insights to discrimination of a relevant variable subset selection. The results for GPMCD analysis as outlined in Table 3 reveal the underlying concept of model based distinction of classes. The analysis is carried out using the new GPMCD method, and the results are analyzed along with the accuracies obtained by several other standard classifiers. Results for RS tests clearly indicate the ability of the GPM based classification approach to learn the patterns from the observations made on the system. For the binary classification problems (DIABETES and CANCER), the RS results are much higher than a random classifier (based on the distribution of samples, minimum of 35% for DIABETES and 52% accuracy for CANCER if all the samples are predicted as the same class). This clearly provides evidence that the variable association structures defined by GPM for each class are distinct and can be used to methodically distinguish the samples. The higher RS accuracies for a complex system with 13 variables (WINE) with nonuniform class distribution is encouraging and provides evidence of the self-consistency of the new method. These results highlight the supervised learning

Ind. Eng. Chem. Res., Vol. 48, No. 10, 2009

4905

Figure 3. Variable prediction plots for IRIS data. Four variables predicted (each row) in all three groups. Column a, I. setosa; column b, I. Versicolor; and column c, I. Virginica. Legend for each subplot: - · -, group 1; - -, group 2; s, group 3 GP model prediction. Corresponding actual sample values are represented using O on the same plot. Refer to text for more explanation. Table 3. Comparative Performance of Classifiers (% Accuracy)a method data set IRIS DIABETES CANCER WINE FDD

RS NS RS NS RS NS RS NS RS NS

LDA

kNN

CART

ANN

SVM

GPMCD

96 96 76.82 75 97.37 97.06 100 98.0 66.0 65.5

100 94.67 100 66.4 100 100 100 94.5 100 61

98 94.67 92.84 72.92 100 97.06 97.19 89.55 85 52

98 97.33 78.52 76.30 97.37 97.06 100 97.5 83.0 78.5

100 93.33 100 83.33 100 97.06 100 97.5 93 83

97.33 98.67 92.84 82.29 94.12 100 100 98.88 98 93

a

First row percentage accuracies give resubstitution (RS) test results; second rows are for new sample (NS) test with half the data left out during training.

capabilities of GPMCD and support the new classifier design philosophy that does not seek a decision boundary, distance measures, or distribution functions. The new sample prediction test results (second row for each data set in Table 3) provide better insights into the superiority of the proposed method in generalizing the class prediction. The overall consistent and comparable results of GPMCD for data sets of varying classification complexity indicate that the objectives of the present study have been achieved. The GPMCD performance is comparable with all the other methods that employ different classifier principles. RS accuracies for

GPMCD are less in most cases, especially compared to CART and ANN. However, while comparing the results in Table 3, the new sample test results for GPMCD are not very far from the RS accuracies. This proves that GPMCD learning is optimum and does not suffer from data overfitting compared to CART and ANN. Validation test results for WINE data are comparable to the performances of computationally more intense classifiers (97.9% for ant colony and 92.2% for C4.5) as reported by Shelokar et al.10 For the simulated data set where the errors are within known bounds (FDD), the performance of GPMCD is superior to the nonlinear classi-

4906

Ind. Eng. Chem. Res., Vol. 48, No. 10, 2009

fiers such as ANN and SVM. These observations establish the intrinsic strength of the new method that captures the variable interactions distinctly into GPMki for predicting the characteristic groups. 4. Discussion We further discuss some of the important advantages of GPMCD along with certain limitations and suggestions to improve them. During the analysis, we observed that for large sample sizes (DIABETES data set), ANN and SVM have longer computational times. LDA is superior with simpler data sets, but its performance is affected while dealing with multiclass, linearly inseparable data set problems (DIABETES and FDD). On the contrary, the computational time for GPMCD depends largely on the number of variables rather than sample size. High dimensional data sets pose a serious challenge to the GP based method as all the variables need to be subjected to multivariate GP runs before optimizing a best set of GPMi. One way to avoid the combinatorial explosion is to select a smaller set of important variables using domain knowledge or statistical feature selection techniques.29-31 The viability of this approach could be demonstrated using the illustrative IRIS data set. The GPMCD classification of Iris flowers with only two variables (petal width and petal length), shortlisted using a partial correlation metric based variable selection algorithm,31 provides an accuracy of 97.33% for the NS set compared to 98.67% using all four variables (as shown in Table 3). As demonstrated also during the CANCER data set analysis, only four out of 6172 genes can achieve 100% classification of samples. During a separate GPMCD analysis for GPMi selection, the performance remains unchanged even after eliminating two of these four genes from the data set. Hence, the GPMCD algorithm can be suitably combined with variable selection algorithms to address classification problems with a large number of variables. Another most significant advantage of the GPMCD method compared to SVM is its simple formulation for multiclass problems. For example, simple variable association models for each type of fault in a process system (simulated as FDD data) effectively capture the multiclass distinction between them without projecting the variables onto a new space. The computational time is on the order of a few seconds for training and predicting all four classes simultaneously. This is significantly less than SVM’s one against others (OAO) multiclass formulation that needs six different binary classifiers to complete the task with each classifier employing a rigorous optimization algorithm to determine the optimum hyperplane. Methods like LDA/QDA need at least as many training data points as much as number of variables (preferable to have more data samples than the number of variables) for good model fitting as they try to model the selected variables at the same time. Since GPMs involve very few coefficients to determine, the GPMCD approach does not need many data points in each class. Though higher data size with more samples in each class increases the statistical significance of prediction accuracies, the GPMCD classifier can nevertheless be effectively trained using a lesser number of samples (as seen in CANCER data set analysis). Further, since a pool of multiple GPMs is designed, the performance of the final classifier model (GPM) is less prone to noise and outliers compared to distance and decision rule based methods. Finally, as established before, the GPMCD algorithm is inherently an individual class-specific formulation (modeling each class independently), making the multiclass extension quite trivial.

Like any other data-driven supervised classification algorithm, GPMCD also requires prior learning of the system classes using training data, which is generally available in the domain of interest. Such learning/prediction tools are largely important for a wide range of applications focusing either on automation of existing know-how, or on learning the mechanism underlying an observed series of complex system behaviors. The new algorithm attempts to extend the domain of such machine learning expertise. However, it must be highlighted that, with suitable modifications, the same algorithm can be made to extract new knowledge (as in unsupervised settings) which was not known a priori. One possibility is to employ strict prediction performance cutoffs while testing the class-specific GPMs. If the optimally trained models belonging to all the known classes fail to satisfy the criteria, then the sample can be categorized as belonging to a new class. Such analysis can provide practically important insights to further understanding of the system. On the other hand, the performance of this model based approach may be affected by several factors. Some of the factors are the availability of continuous and dependent features, the model complexity, and the parameters employed during GP runs. The sensitivity of the proposed classifier to these factors needs to be investigated further. The limitation of evolutionary computing such as the possibility of getting trapped in local optima, the difficulty in elucidating random/lengthy model structures, and the computational complexity for data sets involving a huge number of variables need to be addressed. If these factors can be overcome with advances in hardware and software, the proposed new GPMCD method appears to be a strong and potent candidate for modern large scale pattern recognition applications. 5. Conclusion Evolutionary computing capabilities of the GP modeling technique are exploited to establish a model based classifier. The models obtained from GP provide insights into the linear/ nonlinear dependencies among variables in a complex system. A new and simple classification approach based on these variable association GP models is proposed. The performance of the new method is analyzed using five significant and statistically meaningful classification problems involving process and biological systems. From the results obtained, it is evident that the GPMCD method performs superior to several of the existing leading classifiers, particularly on new data samples. This makes GPMCD eminently suitable for analysis of complex systems with continuous and associated features but unknown structure. Literature Cited (1) McLachlan, G. J. Discriminant analysis and statistical pattern recognition; Wiley Interscience: New York, 1992. (2) Duda, R. O.; Hart, P. E.; Stork, D. G. Pattern classification; John Wiley: New York, 2000. (3) Vapnik, V. Statistical Learning Theory; Wiley-Interscience: New York, 1998. (4) Breiman, L.; Friedman, J. H.; Olshen, R. A.; Stone, C. J. Classification and Regression Trees; Wadsworth International Group: Belmont, CA , 1984. (5) Tominaga, Y. Comparative study of class data analysis with PCALDA, SIMCA, PLS, ANNs, and k-NN. Chemom. Intell. Lab. Syst. 1999, 49, 105–115. (6) Chiang, L. H.; Kotanchek, M. E.; Kordon, A. K. Fault diagnosis based on Fisher discriminant analysis and support vector machines. Comput. Chem. Eng. 2004, 28, 1389–1401. (7) Allwein, E. L.; Schapire, R. E.; Singer, Y. Reducing multiclass to binary: A unifying approach for margin classifiers. J. Mach. Learn. Res. 2000, 1, 113–141.

Ind. Eng. Chem. Res., Vol. 48, No. 10, 2009 (8) Venkatasubramanian, V.; Vaidyanathan, R.; Yamamoto, Y. Process fault detection and diagnosis using neural networks-I. Steady-state processes. Comput. Chem. Eng. 1990, 14 (7), 699–712. (9) Quinlan, J. R. C4.5: Programs for machine learning; Morgan Kaufmann: San Francisco, CA, 1993. (10) Shelokar, P. S.; Jayaraman, V. K.; Kulkarni, B. D. An ant colony classifier system: application to some process engineering problems. Comput. Chem. Eng. 2004, 28, 1577–1584. (11) Rao, R.; Lakshminarayanan, S. Variable predictive modelssa new multivariate classification approach for pattern recognition applications. Pattern Recognit. 2009, 42 (1), 7–16. (12) Pearl, J. Causality: Models, Reasoning, and Inference; Cambridge University Press: Cambridge, U.K., 2000. (13) Rao, R.; Lakshminarayanan, S. VPMCD: Variable interaction modeling approach for class discrimination in biological systems. FEBS Lett. 2007, 581, 826–830. (14) Koza, J. R. Genetic programming: on the programming of computers by means of natural selection; MIT Press: Cambridge, MA, 1992. (15) Bard, Y. Nonlinear parameter estimation; Academic Press: New York, 1974. (16) Madar, J.; Abonyi, J.; Szeifert, F. Genetic Programming for the Identification of Nonlinear Input-Output Models. Ind. Eng. Chem. Res. 2005, 44 (9), 3178–3186. (17) Gray, G. J.; Murray-Smith, D. J.; Li, Y.; Sharman, K. C. Nonlinear model structure identification using genetic programming and a block diagram oriented simulation tool. Electron. Lett. 1996, 32, 1422–1424. (18) Lakshminarayanan, S.; Fujii, H.; Grosman, B.; Dassau, E.; Lewin, D. R. New product design via analysis of historical databases. Comput. Chem. Eng. 2000, 24 (2-7), 671–676. (19) Hinchliffe, M. P.; Willis, M. J. Dynamic systems modelling using genetic programming. Comput. Chem. Eng. 2003, 27, 1841–1854. (20) Grosman, B.; Lewin, D. R. Adaptive genetic programming for steady-state process modeling. Comput. Chem. Eng. 2004, 28 (12), 2779– 2790.

4907

(21) Kishore, J. K.; Patnaik, L. M.; Mani, V.; Agrawal, V. K. Application of Genetic Programming for Multicategory Pattern Classification. IEEE Trans. EVol. Comput. 2000, 4 (3), 242–258. (22) Zhang, Y.; Bhattacharyya, S. Genetic programming in classifying large-scale data: an ensemble method. Inf. Sci. 2004, 163, 85–101. (23) Zhang, L.; Nandi, A. K. Fault classification using genetic programming. Mech. Syst. Signal Process. 2007, 21, 1273–1284. (24) Newman, D. J.; Hettich, S.; Blake, C. L.; Merz, C. J. UCI Repository of machine learning databases; Department of Information and Computer Science, University of California: Irvine, CA, 1998. (25) Golub, T. R.; Slonim, D.; Tamayo, P.; Huard, C.; Gaasenbeek, M.; Mesirov, J.; Coller, H.; Loh, M.; Downing, J.; Caligiuri, M.; Bloomfield, C.; Lander, E. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science 1999, 286, 531– 537. (26) Fua, L. M.; Fu-Liu, C. S. Multi-class cancer subtype classification based on gene expression signatures with reliability analysis. FEBS Lett. 2004, 561, 186–190. (27) Tun, K.; Lakshminarayanan, S. Identification of algebraic and state space models using genetic programming. Presented at DYCOPS 2004, Boston, MA, July 5-7, 2004. (28) MATLAB 7.0.4, release 14; The MathWorks Inc.: Natick, MA, 2005. (29) McCabe, G. P. Principal variables. Technometrics 1984, 26, 137– 144. (30) Guyon, I.; Elisseeff, A. An introduction to variable and feature selection. J. Mach. Learn. Res. 2003, 3, 1157–1182. (31) Rao, R. K.; Lakshminarayanan, S. Partial correlation based variable selection approach for multivariate data classification method. Chemom. Intell. Lab. Syst. 2007, 86 (1), 68–81.

ReceiVed for reView July 25, 2008 ReVised manuscript receiVed January 20, 2009 Accepted March 24, 2009 IE801147M