Support Vector Machine Classification Trees - American Chemical

Oct 13, 2015 - The variance and covariance of the data objects are used for determining the bipolar encoding required for the SVM. The SVM that yields...
1 downloads 9 Views 487KB Size
Subscriber access provided by University of Pennsylvania Libraries

Article

Support Vector Machine Classification Trees Peter de Boves Harrington Anal. Chem., Just Accepted Manuscript • DOI: 10.1021/acs.analchem.5b03113 • Publication Date (Web): 13 Oct 2015 Downloaded from http://pubs.acs.org on October 18, 2015

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Analytical Chemistry is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 26

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

1

Support Vector Machine Classification Trees

2

Peter de Boves Harrington

3

Ohio University Center for Intelligent Chemical Instrumentation

4

Department of Chemistry & Biochemistry, Clippinger Laboratories

5

Athens, OH 45701-2979 USA

6

Abstract

7

Proteomic and metabolomic studies based on chemical profiling require

8

powerful classifiers to model accurately complex collections of data. Support vector

9

machines (SVMs) are advantageous in that they provide a maximum margin of

10

separation for the classification hyperplane. A new method for constructing

11

classification trees, for which the branches comprise SVMs, has been devised. The

12

novel feature is that the distribution of the data objects is used to determine the

13

SVM encoding. The variance and covariance of the data objects are used for

14

determining the bipolar encoding required for the SVM. The SVM that yields the

15

lowest entropy of classification becomes the branch of the tree. The SVM-tree

16

classifier has the added advantage that nonlinearly separable data may be

17

accurately classified without optimization of the cost parameter C or searching for a

18

correct higher dimensional kernel transform. It compares favorably to a regularized

19

linear discriminant analysis (RLDA), SVMs in a one against all multiple classifier,

20

and a fuzzy rule-building expert system (FuRES), a tree classifier with a fuzzy

21

margin of separation. SVMs offer a speed advantage, especially for data sets that

22

have more measurements than objects.

23

Keywords Support vector machine (SVM), classification tree, chemometrics,

24

chemical profiling

ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 26

2/25 25 26

Introduction Chemical profiling and qualitative analysis is a burgeoning field because of

27

the advances in chemical instrumentation and its synergistic coupling to

28

chemometrics1. Authentication of complex materials such as food and

29

nutraceuticals often requires fast and robust classifiers2. Furthermore, methods are

30

required that can be applied automatically and, therefore, should be parameter free.

31

Support vector machines (SVMs)3 are a relatively new type of classifier. Their key

32

advantage is that they can construct classification models very quickly, especially

33

for megavariate data4 (i.e., data sets that has many more measurements than

34

objects).

35

the ever increasing resolution and speed of chemical measurement.

36

The analysis of megavariate data is becoming commonplace because of

SVMs have found many applications in analytical chemistry. Perhaps, the

37

earliest application is for the detection of cancer biomarkers5. Another example is

38

the classification of the traditional Chinese medicine Semen Cassiae (Senna

39

obtusifolia) seeds into groups of roasted and raw from their infrared spectra6.

40

SVMs were also used for high-throughput mass spectrometry for the prediction of

41

cocoa sensory properties7. They have been used for the forensic analysis of inks

42

and pigments using Raman spectroscopy and laser induced breakdown

43

spectroscopy8. SVMs predicted fuel properties from near-infrared spectroscopy

44

(NIRS)9 data. Together with NIRS, SVMs have been used to detect endometrial

45

cancers10.

46

SVMs are binary classifiers that require bipolar encoding of the classes (i.e.,

47

the classes are encoded as -1 or +1). Some basic schemes have been applied to

48

adapt SVMs to classify more than two classes at a time. Perhaps, the simplest and

ACS Paragon Plus Environment

Page 3 of 26

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

3/25 49

most common is the one-against-all approach11. In this approach, an SVM model is

50

built for each class and all the other objects are grouped together into an opposing

51

class. Then during prediction, the SVM that yields the largest output designates the

52

predicted class of the object. A second approach builds an SVM model for each

53

pair-wise combination of classes. All the SVM models are evaluated and are polled,

54

the SVMs models with the largest number of positive results for a class will

55

designate the predicted class. Neither of these two approaches works particularly

56

well. There are other more complicated approaches as well that treat the SVMs as

57

black-box classifiers.

58

SVMs work by finding a classification hyperplane that separates two classes

59

of objects and provides the largest margin of separation. Having a large separation

60

between objects of differing classes provides stability to the classifier in that small

61

perturbations to the data in the form of drift or noise will not cause misclassification.

62

Around the same time that the SVM was developed, the fuzzy rule-building expert

63

system (FuRES) was devised12, which also maximizes a fuzzy margin around the

64

plane of separation. FuRES solved the binary classification problem, by using a

65

divide and conquer approach through the formation of a classification tree. At each

66

branch of the tree, a multivariate discriminant separates the objects and directs

67

them to the other branches (i.e., rules) that would perform further separations of

68

the objects until all the objects are classified.

69

For complex problems such as clinical proteomic or metabolic studies, often

70

the data are multimodal and simple classifiers are not appropriate. Tree-based

71

approaches distribute the data objects into smaller groups for which a simple linear

72

classifier will be effective. A key advantage of the tree-based classifier is that

ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 26

4/25 73

nonlinearly separable data may be classified, and for SVMs this advantage avoids

74

the necessity of finding a workable kernel transform.

75

This paper presents a strategy of using a classification tree to assemble SVMs,

76

except unlike other work, instead of using permutations of the classes to find a

77

binary encoding, the novel feature of this work is the distribution of the data itself is

78

used to determine the encoding. Two approaches are used, the first is variance or

79

data driven and is based on principal component analysis (PCA)13. The second

80

approach is covariance driven and is based on partial least squares (PLS)14. Then

81

after the SVM models are built, the one that provides the lowest entropy of

82

classification (i.e., is the most efficient classifier) is selected for the branch of the

83

tree.

84

SVMs have an inherent advantage over FuRES in that they are kernel based

85

classifiers, so for data sets with few objects m and many measurements n by

86

forming a kernel of the size m×m the computational speed is very fast when m is

87

much smaller than n. As will be seen, using the tree algorithm, the requirement

88

that the data be linearly separable will no longer apply. However, any kernel

89

transform may be used with this algorithm.

90 91 92

Theory The SVM is a binary linear classifier that optimizes a classification hyperplane

93

between the surface data points of two clusters in the data space.3, 15 The

94

constrained optimization relies on Lagrangian multipliers that allow for a primal

95

solution in the native space or a dual solution in a kernel space. When the number

96

of measurements is greater than the number of objects transforming to a kernel

ACS Paragon Plus Environment

Page 5 of 26

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

5/25 97

can yield a significant advantage with respect to computational speed and reduced

98

load. However, when there are fewer variables than objects, optimization in the

99

primal space is more efficient. Furthermore, as objects located far from the

100

boundary are removed from the calculation as the support vectors are determined,

101

further computational efficiency is obtained. Several excellent tutorials that present

102

the mathematical details can be found here.16

103

The classification is achieved via a hyperplane that will separate objects in

104

the data space that is defined by an orthogonal weight vector w. Predictions are

105

made by

106 107 108 109

 =   +

(1)

for which the predicted class  of object  is obtained by multiplication with the

weight vector w. The bias value b defines the point of intersection of the weight

vector with the hyperplane. In the two-class model, bipolar encoding is used (i.e., the class descriptor yi

110

is either +1 or -1) for the data object xi. The data object is a row vector with n

111

measurements. An example of the constrained optimization for the weight vector

112

w and its intercept b are given below. 

1 min ∥  ∥ +     2

(2)



  ∙  +  ≥ 1 −  ∀  ≥ 0 ∀

(3) (4)

113

The weight vector w is determined by minimizing the convex function given in (2)

114

for which the first term simply minimizes the Euclidean length of w and the second

115

term is a regularization parameter C that controls the slack variables ξ. The slack

ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 26

6/25 116

variable ξi allows training object i to fall inside the margins so a wider plane of

117

separation for the m objects would be obtained. A smaller value of C yields a larger

118

margin of separation at the cost of misclassifying some of the training objects. C is

119

the parameter to the cost function that is critical to the performance of the SVM

120

models. The optimization is constrained by equations (3) and (4).

121 122

As a consequence of the dual solutions afforded by the Lagrange multipliers, equations (2)-(4) can be rewritten as '

'



1 min "   # #$  $  &% −  # ) 2 ( %(

(5)





 #  = 0

(6)



0 ≤ # ≤  ∀

(7)

123

In the dual formulation, the αi are weights that are used to define the support

124

vectors (i.e., the data objects used to define the w vector). The weight vector w

125

and the bias value b are obtained from below. The m × m kernel matrix K in this

126

work comprised the outer product of the mean-centered data matrix. Other kernels

127

may be used, but this study utilizes the linear kernel as defined above. It also

128

makes sense to decouple the nonlinear transform to achieve linear separability from

129

the kernel that merely provides a reduced space for the calculation. Equation 4 can

130

be minimized with constraints 5 and 6 in MATLAB using the quadprog function of

131

the Optimization Toolbox. The weight w and intercept b are obtained from

132

equations (8) and (9) below. 

 =  #  

(8)



ACS Paragon Plus Environment

Page 7 of 26

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

7/25 = 133

∑  #  −  ∙ , ∑  #

(9)

SVMs are limited to solving a two-class or binary problems. However,

134

several approaches have been adopted to solve multiple classification problems.

135

The simplest is to build an SVM classification model for each class that has all the

136

other classes grouped into a single negative class. Then an object that yields the

137

greatest result will achieve the class designation of that corresponding model.

138

Two approaches have been devised to assign multiple classes to one of the

139

two bipolar classes. One approach is based on variance (i.e., PCA based) while the

140

other approach is based on covariance (i.e., PLS based). These operations are

141

performed on the mean-corrected data X for the primal form (m greater than n)

142

and directly on the kernel K for the dual form (m less than n). The equations will

143

be given for the dual form, but the calculations for the primal form are similar and

144

give the same result.

145

The equations for deriving the variance-based encoding is given first. Step

146

one is to correct the model-building data set by subtracting the mean and then

147

forming the kernel as given below

148 149

/ . − . / 0 - = . − .

(10)

for which the m × m kernel is calculated as the outer product of the mean

/ is composed of the corrected data. The objects are rows and the mean matrix .

150

average of the objects. The advantage of the kernel is that it is computational

151

efficient when the number of measurements (i.e., columns) exceeds the number of

152

objects (i.e., rows).

ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 26

8/25 153 154

Next an initial guess is made for the binary vector y0 and a good initial guess is to pick the column of the kernel K that has the largest sum of squares. 12( = -1



(11)

12( = 12( ,3 12

(12)



155

for which a new estimate of the vector yi+1 is made and equations 10 and 11 are

156

iterated until convergence. For each iteration, the vector yi+1 is normalized to unit

157

vector length.

158

The first principal component is calculated, and the yi+1 are the normalized

159

object scores. After convergence, the scores are then sorted from low to high and

160

are searched to find the largest gap between scores. The midpoint of this gap is

161

used for binary encoding so that values less than the gap value are bipolarly

162

encoded as -1 and greater than the gap value as +1. In this approach, the

163

encoding for the SVM is based on the distribution of the data objects or their

164

variance.

165

The second approach for binary encoding is based on covariance and will use

166

the class designations that are binary encoded in a target matrix Ybin. The kernel K

167

is obtained as in equation (10). A similar iteration is used as in equations (10) and

168

(11). However, additional steps are added in between (10) and (11). 4 = 1&2( 567

12( = 567 4

(13) (14)

169

for which the row vector q contains a weight for each class that is defined in the

170

m×g matrix Ybin of binary encoded class descriptors. The process iterates until

171

convergence and then in a similar fashion as before, the vector y is sorted and the

ACS Paragon Plus Environment

Page 9 of 26

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

9/25 172

largest gap between values is found but we add a constraint that it must also

173

separate two different classes. The midpoint of this gap is the criterion that is used

174

for bipolar encoding as was described previously.

175

If the two bipolar encodings are the same, only a single SVM model is built.

176

However, if they are different, an SVM model is calculated for each encoding. Then,

177

the entropy of classification H for each model is calculated from estimates of the

178

variance-based and covariance-based SVM models. The model with the lowest

179

entropy of classification is chosen for the branch of the tree structure. If the two

180

entropies are equal, the SVM model with the shorter weight w and the larger

181

margin is selected.

182 183

The entropy of classification H is defined by counting the number of objects

on each side of the hyperplane as defined by positive or negative values of  .

184

Probabilities are calculated by dividing the number of objects for each class on a

185

side of the plane by the total number of objects on that side of the hyperplane.

186

Then the entropy of classification is calculated as

9: × ∑=