Subscriber access provided by University of Pennsylvania Libraries
Article
Support Vector Machine Classification Trees Peter de Boves Harrington Anal. Chem., Just Accepted Manuscript • DOI: 10.1021/acs.analchem.5b03113 • Publication Date (Web): 13 Oct 2015 Downloaded from http://pubs.acs.org on October 18, 2015
Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.
Analytical Chemistry is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.
Page 1 of 26
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
1
Support Vector Machine Classification Trees
2
Peter de Boves Harrington
3
Ohio University Center for Intelligent Chemical Instrumentation
4
Department of Chemistry & Biochemistry, Clippinger Laboratories
5
Athens, OH 45701-2979 USA
6
Abstract
7
Proteomic and metabolomic studies based on chemical profiling require
8
powerful classifiers to model accurately complex collections of data. Support vector
9
machines (SVMs) are advantageous in that they provide a maximum margin of
10
separation for the classification hyperplane. A new method for constructing
11
classification trees, for which the branches comprise SVMs, has been devised. The
12
novel feature is that the distribution of the data objects is used to determine the
13
SVM encoding. The variance and covariance of the data objects are used for
14
determining the bipolar encoding required for the SVM. The SVM that yields the
15
lowest entropy of classification becomes the branch of the tree. The SVM-tree
16
classifier has the added advantage that nonlinearly separable data may be
17
accurately classified without optimization of the cost parameter C or searching for a
18
correct higher dimensional kernel transform. It compares favorably to a regularized
19
linear discriminant analysis (RLDA), SVMs in a one against all multiple classifier,
20
and a fuzzy rule-building expert system (FuRES), a tree classifier with a fuzzy
21
margin of separation. SVMs offer a speed advantage, especially for data sets that
22
have more measurements than objects.
23
Keywords Support vector machine (SVM), classification tree, chemometrics,
24
chemical profiling
ACS Paragon Plus Environment
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 2 of 26
2/25 25 26
Introduction Chemical profiling and qualitative analysis is a burgeoning field because of
27
the advances in chemical instrumentation and its synergistic coupling to
28
chemometrics1. Authentication of complex materials such as food and
29
nutraceuticals often requires fast and robust classifiers2. Furthermore, methods are
30
required that can be applied automatically and, therefore, should be parameter free.
31
Support vector machines (SVMs)3 are a relatively new type of classifier. Their key
32
advantage is that they can construct classification models very quickly, especially
33
for megavariate data4 (i.e., data sets that has many more measurements than
34
objects).
35
the ever increasing resolution and speed of chemical measurement.
36
The analysis of megavariate data is becoming commonplace because of
SVMs have found many applications in analytical chemistry. Perhaps, the
37
earliest application is for the detection of cancer biomarkers5. Another example is
38
the classification of the traditional Chinese medicine Semen Cassiae (Senna
39
obtusifolia) seeds into groups of roasted and raw from their infrared spectra6.
40
SVMs were also used for high-throughput mass spectrometry for the prediction of
41
cocoa sensory properties7. They have been used for the forensic analysis of inks
42
and pigments using Raman spectroscopy and laser induced breakdown
43
spectroscopy8. SVMs predicted fuel properties from near-infrared spectroscopy
44
(NIRS)9 data. Together with NIRS, SVMs have been used to detect endometrial
45
cancers10.
46
SVMs are binary classifiers that require bipolar encoding of the classes (i.e.,
47
the classes are encoded as -1 or +1). Some basic schemes have been applied to
48
adapt SVMs to classify more than two classes at a time. Perhaps, the simplest and
ACS Paragon Plus Environment
Page 3 of 26
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
3/25 49
most common is the one-against-all approach11. In this approach, an SVM model is
50
built for each class and all the other objects are grouped together into an opposing
51
class. Then during prediction, the SVM that yields the largest output designates the
52
predicted class of the object. A second approach builds an SVM model for each
53
pair-wise combination of classes. All the SVM models are evaluated and are polled,
54
the SVMs models with the largest number of positive results for a class will
55
designate the predicted class. Neither of these two approaches works particularly
56
well. There are other more complicated approaches as well that treat the SVMs as
57
black-box classifiers.
58
SVMs work by finding a classification hyperplane that separates two classes
59
of objects and provides the largest margin of separation. Having a large separation
60
between objects of differing classes provides stability to the classifier in that small
61
perturbations to the data in the form of drift or noise will not cause misclassification.
62
Around the same time that the SVM was developed, the fuzzy rule-building expert
63
system (FuRES) was devised12, which also maximizes a fuzzy margin around the
64
plane of separation. FuRES solved the binary classification problem, by using a
65
divide and conquer approach through the formation of a classification tree. At each
66
branch of the tree, a multivariate discriminant separates the objects and directs
67
them to the other branches (i.e., rules) that would perform further separations of
68
the objects until all the objects are classified.
69
For complex problems such as clinical proteomic or metabolic studies, often
70
the data are multimodal and simple classifiers are not appropriate. Tree-based
71
approaches distribute the data objects into smaller groups for which a simple linear
72
classifier will be effective. A key advantage of the tree-based classifier is that
ACS Paragon Plus Environment
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 4 of 26
4/25 73
nonlinearly separable data may be classified, and for SVMs this advantage avoids
74
the necessity of finding a workable kernel transform.
75
This paper presents a strategy of using a classification tree to assemble SVMs,
76
except unlike other work, instead of using permutations of the classes to find a
77
binary encoding, the novel feature of this work is the distribution of the data itself is
78
used to determine the encoding. Two approaches are used, the first is variance or
79
data driven and is based on principal component analysis (PCA)13. The second
80
approach is covariance driven and is based on partial least squares (PLS)14. Then
81
after the SVM models are built, the one that provides the lowest entropy of
82
classification (i.e., is the most efficient classifier) is selected for the branch of the
83
tree.
84
SVMs have an inherent advantage over FuRES in that they are kernel based
85
classifiers, so for data sets with few objects m and many measurements n by
86
forming a kernel of the size m×m the computational speed is very fast when m is
87
much smaller than n. As will be seen, using the tree algorithm, the requirement
88
that the data be linearly separable will no longer apply. However, any kernel
89
transform may be used with this algorithm.
90 91 92
Theory The SVM is a binary linear classifier that optimizes a classification hyperplane
93
between the surface data points of two clusters in the data space.3, 15 The
94
constrained optimization relies on Lagrangian multipliers that allow for a primal
95
solution in the native space or a dual solution in a kernel space. When the number
96
of measurements is greater than the number of objects transforming to a kernel
ACS Paragon Plus Environment
Page 5 of 26
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
5/25 97
can yield a significant advantage with respect to computational speed and reduced
98
load. However, when there are fewer variables than objects, optimization in the
99
primal space is more efficient. Furthermore, as objects located far from the
100
boundary are removed from the calculation as the support vectors are determined,
101
further computational efficiency is obtained. Several excellent tutorials that present
102
the mathematical details can be found here.16
103
The classification is achieved via a hyperplane that will separate objects in
104
the data space that is defined by an orthogonal weight vector w. Predictions are
105
made by
106 107 108 109
= +
(1)
for which the predicted class of object is obtained by multiplication with the
weight vector w. The bias value b defines the point of intersection of the weight
vector with the hyperplane. In the two-class model, bipolar encoding is used (i.e., the class descriptor yi
110
is either +1 or -1) for the data object xi. The data object is a row vector with n
111
measurements. An example of the constrained optimization for the weight vector
112
w and its intercept b are given below.
1 min ∥ ∥ + 2
(2)
∙ + ≥ 1 − ∀ ≥ 0 ∀
(3) (4)
113
The weight vector w is determined by minimizing the convex function given in (2)
114
for which the first term simply minimizes the Euclidean length of w and the second
115
term is a regularization parameter C that controls the slack variables ξ. The slack
ACS Paragon Plus Environment
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 6 of 26
6/25 116
variable ξi allows training object i to fall inside the margins so a wider plane of
117
separation for the m objects would be obtained. A smaller value of C yields a larger
118
margin of separation at the cost of misclassifying some of the training objects. C is
119
the parameter to the cost function that is critical to the performance of the SVM
120
models. The optimization is constrained by equations (3) and (4).
121 122
As a consequence of the dual solutions afforded by the Lagrange multipliers, equations (2)-(4) can be rewritten as '
'
1 min " # #$ $ &% − # ) 2 ( %(
(5)
# = 0
(6)
0 ≤ # ≤ ∀
(7)
123
In the dual formulation, the αi are weights that are used to define the support
124
vectors (i.e., the data objects used to define the w vector). The weight vector w
125
and the bias value b are obtained from below. The m × m kernel matrix K in this
126
work comprised the outer product of the mean-centered data matrix. Other kernels
127
may be used, but this study utilizes the linear kernel as defined above. It also
128
makes sense to decouple the nonlinear transform to achieve linear separability from
129
the kernel that merely provides a reduced space for the calculation. Equation 4 can
130
be minimized with constraints 5 and 6 in MATLAB using the quadprog function of
131
the Optimization Toolbox. The weight w and intercept b are obtained from
132
equations (8) and (9) below.
= #
(8)
ACS Paragon Plus Environment
Page 7 of 26
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
7/25 = 133
∑ # − ∙ , ∑ #
(9)
SVMs are limited to solving a two-class or binary problems. However,
134
several approaches have been adopted to solve multiple classification problems.
135
The simplest is to build an SVM classification model for each class that has all the
136
other classes grouped into a single negative class. Then an object that yields the
137
greatest result will achieve the class designation of that corresponding model.
138
Two approaches have been devised to assign multiple classes to one of the
139
two bipolar classes. One approach is based on variance (i.e., PCA based) while the
140
other approach is based on covariance (i.e., PLS based). These operations are
141
performed on the mean-corrected data X for the primal form (m greater than n)
142
and directly on the kernel K for the dual form (m less than n). The equations will
143
be given for the dual form, but the calculations for the primal form are similar and
144
give the same result.
145
The equations for deriving the variance-based encoding is given first. Step
146
one is to correct the model-building data set by subtracting the mean and then
147
forming the kernel as given below
148 149
/ . − . / 0 - = . − .
(10)
for which the m × m kernel is calculated as the outer product of the mean
/ is composed of the corrected data. The objects are rows and the mean matrix .
150
average of the objects. The advantage of the kernel is that it is computational
151
efficient when the number of measurements (i.e., columns) exceeds the number of
152
objects (i.e., rows).
ACS Paragon Plus Environment
Analytical Chemistry
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 8 of 26
8/25 153 154
Next an initial guess is made for the binary vector y0 and a good initial guess is to pick the column of the kernel K that has the largest sum of squares. 12( = -1
(11)
12( = 12( ,3 12
(12)
155
for which a new estimate of the vector yi+1 is made and equations 10 and 11 are
156
iterated until convergence. For each iteration, the vector yi+1 is normalized to unit
157
vector length.
158
The first principal component is calculated, and the yi+1 are the normalized
159
object scores. After convergence, the scores are then sorted from low to high and
160
are searched to find the largest gap between scores. The midpoint of this gap is
161
used for binary encoding so that values less than the gap value are bipolarly
162
encoded as -1 and greater than the gap value as +1. In this approach, the
163
encoding for the SVM is based on the distribution of the data objects or their
164
variance.
165
The second approach for binary encoding is based on covariance and will use
166
the class designations that are binary encoded in a target matrix Ybin. The kernel K
167
is obtained as in equation (10). A similar iteration is used as in equations (10) and
168
(11). However, additional steps are added in between (10) and (11). 4 = 1&2( 567
12( = 567 4
(13) (14)
169
for which the row vector q contains a weight for each class that is defined in the
170
m×g matrix Ybin of binary encoded class descriptors. The process iterates until
171
convergence and then in a similar fashion as before, the vector y is sorted and the
ACS Paragon Plus Environment
Page 9 of 26
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Analytical Chemistry
9/25 172
largest gap between values is found but we add a constraint that it must also
173
separate two different classes. The midpoint of this gap is the criterion that is used
174
for bipolar encoding as was described previously.
175
If the two bipolar encodings are the same, only a single SVM model is built.
176
However, if they are different, an SVM model is calculated for each encoding. Then,
177
the entropy of classification H for each model is calculated from estimates of the
178
variance-based and covariance-based SVM models. The model with the lowest
179
entropy of classification is chosen for the branch of the tree structure. If the two
180
entropies are equal, the SVM model with the shorter weight w and the larger
181
margin is selected.
182 183
The entropy of classification H is defined by counting the number of objects
on each side of the hyperplane as defined by positive or negative values of .
184
Probabilities are calculated by dividing the number of objects for each class on a
185
side of the plane by the total number of objects on that side of the hyperplane.
186
Then the entropy of classification is calculated as
9: × ∑=