Consensus Classification Using Non-Optimized Classifiers - Analytical

Mar 5, 2018 - Presented in this Article is a method to classify a sample by combining multiple classification methods without specifically classifying...
0 downloads 7 Views 518KB Size
Subscriber access provided by - Access paid by the | UCSB Libraries

Consensus Classification Using Non-Optimized Classifiers Brett Brownfield, Tony Lemos, and John H. Kalivas Anal. Chem., Just Accepted Manuscript • DOI: 10.1021/acs.analchem.7b04399 • Publication Date (Web): 05 Mar 2018 Downloaded from http://pubs.acs.org on March 11, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Consensus Classification Using Non-Optimized Classifiers Brett Brownfield,† Tony Lemos,† and John H. Kalivas†* †Department of Chemistry, Idaho State University, Pocatello, Idaho 83209, United States *[email protected]

ABSTRACT: Classifying samples into categories is a common problem in analytical chemistry and other fields. Classification is usually based on only one method, but numerous classifiers are available with some being complex such as neural networks and others are simple such as k nearest neighbors. Regardless, most classification schemes require optimization of one or more tuning parameters for best classification accuracy, sensitivity, and specificity. A process not requiring exact selection of tuning parameter values would be useful. To improve classification, several ensemble approaches have been used in past work to combine classification results from multiple optimized single classifiers. The collection of classifications for a particular sample are then combined by a fusion process such as majority vote to form the final classification. Presented in this paper is a method to classify a sample by combining multiple classification methods without specifically classifying the sample by each method, i.e., the classification methods are not optimized. The approach is demonstrated on three analytical data sets. The first is a beer authentication set with samples measured on five instruments allowing fusion of multiple instruments by three ways. The second data set is composed of textile samples from three classes based on Raman spectra. This data set is used to demonstrate the ability to classify simultaneously with different data pre-processing strategies thereby reducing the need to determine the ideal pre-processing method, a common prerequisite for accurate classification. The third data set contains three wine cultivars for three classes measured at 13 unique chemical and physical variables. In all cases, fusion of non-optimized classifiers improves classification. Also presented are atypical uses of Procrustes analysis and extended inverted signal correction (EISC) for distinguishing sample similarities to respective classes.

Classifying a sample to particular category is common in analytical chemistry.1-7 The meaning to classification varies between fields. Classification in this paper is the problem of determining which class (category) a sample belongs to regardless of the type of classifier (traditional generative and discriminative groupings or other approaches not fitting this binary grouping paradigm). Thus, any method used to determine class membership is considered a classifier in this paper. There are a multitude of stand-alone classifiers that can be used8,9 including partial least squares discriminant analysis (PLS-DA)10 and k nearest neighbors (kNN). Most classifiers are based on respective tuning parameter values and hence, require optimization. For PLS-DA, this is the number of latent variables (LVs) and for kNN, it is the number of nearest neighbors and the distance measure. The analytical chemist has to decide which algorithm to use and how best to optimize it for the particular data set being studied. Specifically, the most suitable optimized classifier is not the same one for all data sets. Additionally, optimization of a classifier to its “best” tuning parameter value(s) is a confounding problem with no commonly accepted solution and is an actively researched area. To avoid the method selection, ensemble (fusion) approaches have been developed.11-18 In general, fusion is “a multilevel, multifaceted process dealing with the automatic detection, association, correlation, estimation, and combination of data and information from multiple sources.”19 With fusion, information presented by a number of classifiers infers a more robust classification beyond the capacity of a single classifier, i.e., a consensus classification is obtained thereby reducing the chance of misclassifying an abnormal sample. While improved classification accuracy, specificity, and sensitivity are important goals

sought with fusion, it should be kept in mind that fusion also makes the classification less risky.20 There are generally two approaches to ensemble classifiers. One is concerned with forming a collection of classifiers using a single algorithm such as random forests or other bagging processes.8,9,14,21,22 In these methods, using the same algorithm, different classifiers are formed by randomly changing the training (learning, calibration, reference) set such as selecting different samples (observations) and/or variables (features) multiple times. The goal is to ensemble a collection of weak classifiers to form a strong classifier. However, this really does not address the problem facing the analytical chemist as to which single classification algorithm to use and how to optimize it. The other strategy to ensemble classifiers is to combine results obtained from multiple optimized independent single classifiers such as PLS-DA, kNN, and others.13-18 Included in this combination can be the classification results from ensemble classifiers such as an optimized random forest classifier. As an example, rather than classification by one neural network, a collection of optimized neural networks was fused.23,24 Much of the work on this front involves determining the optimal linear or non-linear weighted combination of the each neural network classification result to form the final classification. Another example is where several optimized self-organizing maps (SOM) were used to form consensus geographical classification of crude oil samples.25 When multiple classifiers are combined, training (optimization) of each classifier is required and a decision is needed on how to best combine the classifiers. Rather than optimizing each classifier, further diversity in the collection of unique classifiers can be obtained by using a group of a single classifier based on a window of the respective tuning

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

parameter values. For example, as used in this study, a window of 10 single PLS2-DA classifiers would be composed of classification results based on 1 LV, 1 and 2 LVs, 1 through 3 LVs, and so on up to the 1 through 10 LV PLS2-DA classification results. For kNN, values for 1 NN, 3 NN, and so on up to 10 NN would be included. Thus, the fusion approach presented in this study to obtain a consensus classification does not involve optimizing the classifiers thereby simplifying the classification ensemble. By using a window of tuning parameter values, training can be eliminated. With a collection of classifiers, however, an appropriate fusion rule is still needed. The combination of a collection of independent classifiers is a general problem occurring in all areas and is actively researched.11,12 Numerous fusion approaches have been proposed for small groups of single optimized classifiers using linear or non-linear combination rules. If a weighting scheme is included (weights given to each standalone classifier), then determination of those weights is also a confounding problem. Stacking with weighting is a common approach to fuse multiple classifiers.18,26 However, the classification approach presented in this paper creates too many non-optimized classifiers to use stacking as a fusion rule. The stacking would be overdetermined requiring a biased regression method such as PLS thereby necessitating a tuning parameter optimization. Majority vote is another fusion approach. However, using majority voting requires actual classification by each classifier based on respective determined classifier thresholds. Fusion rules not requiring pre-classification are combinations processes such a sum, product, minimum, median, etc. of the raw values obtained by each classifier. It has been shown that the simple sum rule outperforms other combination rules studied8 and is used in this study without any weighting scheme thereby simplifying the ensemble of non-optimized classifiers.A common problem in analytical chemistry classification studies based on spectral data is determining the “best” spectral preprocessing method.27-29 A large set of possibilities are available. Rather than trying to identify an appropriate pre-processing method, recent work used stacking to fuse calibration models based on different data pre-processing methods.29 In this paper, multiple pre-processing methods are simultaneously used rather than optimizing to the ideal pre-processing method (with no pre-processing being one of the possibilities). Presented in this paper is application of the sum fusion rule14 to 17 independent classifiers. The sum fusion is over the raw values computed for each classifier. The smallest sum for a class is the predicted class for that sample. Classifiers requiring tuning parameter selection are used over windows of tuning parameters thereby forming a collection of that particular classifier. For example, one data set is based on five instruments and the 17 classifiers per instrument becomes 113 stand-alone classifiers per instrument for a total of 564 single classifiers to sum and classify each sample. As previously noted, fusion of different pre-processing methods is also evaluated. No weights, thresholds, or training (optimization of classifiers) is used. In summary, the contribution of this paper is presentation of a simple fusion method for the analytical chemist that can be used with any number of single independent classifiers. If a classifier requires optimization, a collection of that classifier

Page 2 of 10

can be used removing the need to optimize and train. Eliminating the training step allows automation of the process. Because raw values are used without thresholds, innumerable measures that distinguish sample or class differences can be included in the fusion To demonstrate the potential of this unique classification approach, three data sets are studied with two and three classes. Letting p and n denote the number of corresponding variables and objects (samples), two data sets have p >> n and the third data set has p < n.

METHODS In this study, 17 single classifiers are used. The six classifiers listed in Table S-1 (Mahalanobis distance (MD), Q-residual, sinθ, divergence criterion (DC), PLS2-DA, and kNN) require tuning parameter selection. The eleven classifiers listed in Table S-2 are tuning parameter independent (cosθ, Euclidean distance, determinant, inner product correlation, unconstrained Procrustes analysis (PA), constrained PA for two classifiers, and extended inverted signal correction difference (EISCD)). The next two sections describe the 17 single classifiers. Numerous other classifiers could be included.8 Because all classifiers except PLS2-DA and kNN are described as outlier measures in a recent fusion study for improving outlier detection,30 only brief clarifications are provided for the previously used 15 classifiers. The section Classification by Fusion describes the fusion process to classify a sample. As previously noted, tuning parameter windows are used for some of the classifiers and maximum window sizes are data set dependent relative to respective data set ranks (sizes). Determination of exact window sizes (k for the number of respective tuning parameter values) for fusion is not performed. Instead, trends in classification quality measures (see Classification Quality Measures section) are plotted from fusion of all 17 classifiers as fusion window sizes increase (increasing values of k) for each data set. Comments are provided in respective sections on how to determine tuning parameter window sizes. Tuning Parameter Based Classifiers. Of the six classifiers listed in Table S-1, two are commonly used in analytical chemistry: PLS2-DA and kNN. The classification method known as soft independent modelling by class analogy (SIMICA)31 is another popular classifier in analytical chemistry and combines MD32,33 and Q-residual32,33 for classification. Thus, the four classifiers PLS2-DA, kNN, MD, and Q-residual are evaluated as stand-alone classifiers in the Results and Discussion. The sinθ measure in Table S-1 was shown to classify well34 and is included. The DC35 is a similarity measure utilized in this study as a classifier. All six methods are based on comparing a vector xi of p measured variables (spectrum for spectral data) to each class space spanned by the respective n class samples X. Eigenvector Based Classifiers. For MD, Q-residual, sinθ, and DC (Table S-1, eq. sets S1, S2, S3, and S4 respectively), the class spaces are spanned with eigenvectors from the singular value decomposition (SVD) of X. Classification by these methods requires optimization of the number of eigenvectors (k). However, how to determine a value for k is not consistent in classification applications. The value used decides whether a sample is a class member or not depending on the spectral relationships captured by the particular number of eigenvectors.

2 ACS Paragon Plus Environment

Page 3 of 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Instead of optimizing k for each independent method, respective classifiers are formed over a window of multiple values of k (tuning parameter values) to form a collection of classifiers. Specifically, classifier classification values are computed with the first eigenvector (eigenvector 1) for the first window, then the respective two classifier classification values from eigenvector 1 and eigenvectors 1 and 2 for the second window, followed by the respective three classifier classification values from eigenvectors 1, 1 and 2, and 1-3 for the third window until the last window with all possible classifier classification values from 1 through the rank of the smallest class. To determine an appropriate window size to classify samples in real world settings, a cross validation procedure could be used and classification quality trends would be evaluated. However, this would essentially entail a training phase. Alternatively, a simple rule can be used to determine the number of eigenvectors for a window; eigenvectors are included starting with the first and stop when approximately 99.9% of the X information is captured and the corresponding eigenvectors are not excessively composed of noise structure. Shown in Figure 1 is a generic example of the fusion input matrix containing classifier values for a data set with three classes. The eigenvector based methods MD, sinθ, Q-residual, and DC are in the rows noted. Shown is the situation with 23 classifiers each (k is 23 for a 1 eigenvector classifier (the first eigenvector), a 2 eigenvector classifier (eigenvectors 1 and 2) and so forth up to the 23rd classifier using eigenvectors 1 through 23). Thus, each stand-alone classifier has expanded to a window of 23 classifiers for a total of 92 classifiers from the four eigenvector based classifiers. Values shown are the raw values computed for each classifier relative to each class. Each row in Figure 1 has been processed by row normalization to unit length to remove magnitude differences. By using raw values, thresholds for each classification method are not needed.

Figure 1. Example of classifiers assembled into the fusion input matrix for classifying one sample into one of three classes. Raw values in each row are normalized to unit length. Each range represents a window of k = 23 classifiers. Detailed in the Classification by Fusion section is the fusion process but briefly, the sum fusion rule is applied to each column and the class with the smallest sum is the predicted class membership for that sample. For comparison to fusion, MD and Q-residual are also used as independent stand-alone classifiers. In these situations, windows are not used and the plotted classification quality trends are at a specific number of eigenvectors as if that number had

been optimized to be best for classification. The decision rule was the class with the smallest value determines class membership. PLS2-DA Classification. As previously indicated, the method of PLS2-DA (see eq. set 5, Table S1) is used for all data sets. Shown in the first block of the fusion input matrix in Figure 1 are the 23 PLS2-DA classifiers (rows 1-23) over the 23 LV window. With PLS2-DA, dummy prediction values are required. The dummy variables are a 1 in the rows and columns of Y to identify class membership and -1 elsewhere. Because the column sum minimum of Figure 1 identifies the predicted class for that particular sample, each predicted class value ˆyi is trans-

yˆ t yˆ max − yˆi where yˆ max denotes formed to minima values by= the maximum predicted value over all rows of predicted values for the current window. The same value used for the maximum number of eigenvector windows for each data set is used for the number of LV windows. To determine a window size to classify samples in real world situations, then as with determining an eigenvector window size, LVs are retained up to when approximately 99.9% of the information is captured and the corresponding LVs are not excessively composed of noise structure. Classification results are reported using PLS2-DA as a standalone single classifier at each set of LV as if this is the number of LVs optimized as best. Class membership is decided by the maximum positive predicted value. These results are compared in the Results and Discussion section to those from the fusion process. kNN. Rows 24-46 of the fusion input matrix displayed in Figure 1 contain the kNN results as the value of k increases from 1 to 23, i.e., 23 kNN classifiers. With kNN, two tuning decisions are normally needed. One is the number of nearest neighbors to base the classification decision on and the other is the distance measure. Only the Euclidean distance is used here (Table S1, eq. S6), but other distances could be additionally included in the fusion matrix. One particular value for k is not required with the fusion process. Instead, a window of k NN classifiers is used where each row in the fusion matrix records the respective column class number of NN. For example, starting with 1 NN, values in the first row of the kNN block in Figure 1 are 0 except the class with the 1 NN. For 2 NN classifier (row), two situations are possible for the three class example in Figure 1. Row values could be a 1 in two classes (columns) for 1 NN each and 0 in the third class or a 2 in one class (2 NN) and 0 for the other two classes. This pattern of entries in the kNN block of the fusion input matrix continues up to the maximum number of NN desired. For this study, the maximum k value is the same value for the number of eigenvectors and LVs. Because the column sum minimum of Figure 1 identifies the predicted class for that particular sample, each NN value is transformed to minima values NNt by NNt = NNmax - NN where NNmax denotes the maximum NN value over all rows of NN values in the current window. For comparison to fusion, kNN was also used as an independent stand-alone classifier assuming each independent NN acts as an optimized kNN classifier. As a single classier, ties are possible at particular values of k. For binary classification, the easiest way to avoid ties is to always use an odd number for k. With

3 ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

multiclass problems, several options are possible. One is to reduce the value of k by one until the tie is broken, but this can produce inconsistency in the classification rule for all the samples. Used here for tabulation of classification quality measures when a tie occurs is adding 1 to both the true positive (TP) and false positive (FP) categories. With a window of k kNN classifications, the significance of ties are diminished: assuming a majority of the kNN classifiers in the window do not have ties. As previously noted, when kNN is used as a stand-alone classifier, the distance measure and a value for the number of NN require optimization. In practise, there is no hard-fast rule of thumb. With the proposed fusion approach, multiple distances and values of k can be used. Classifiers with No Tuning Parameters. Classifiers described in this section can be considered as non-traditional classifiers. The goal with the classifiers in Table S-2 is to determine the degree of similarity for a target sample compared to each class mean. All of these classifiers are minimized for classification and are detailed in previous work as outlier measures in a fusion study for improving outlier detection.30 Thus, the classifiers are only listed and two are briefly described. The first three listed in Table S-2 (cosθ, the Euclidean distance, and the determinate (eqs. S7, S8, and S9, respectively)36 are common similarity measures. The next measure used as a classifier is the inner product correlation (eq. S10). It characterizes the correlation between the two outer product matrices Xi

(X

i

= xi xiT ) and X ( X = xx T ) .37 The next set of classifiers in

Table S-2 are now briefly described. Procrustes analysis (PA) is not normally thought of as a classifier. The goal of unconstrained PA is to estimate a transformation matrix T to map data set X2 to resemble data set X1 by solving X1 = X2T for T.38-40 The transformation matrix encompasses the degree of rotation and dilation needed to make X2 resemble X1. Translation is another PA step accomplished by a prior mean centering respective data sets, but this was not included in this study. Rather than determining how well X2 is transformed to resemble X1 (the usual PA criterion), the degree of difficulty (amount of work) needed to carry out the transformation is determined. Thus, even though PA may be able to carry out an accurate transformation, the more difficult the transformation, the greater the amount of rotation and dilation needed and the more likely the sample does not belong to the particular class. The degree of difficulty is assessed relative to the base self-transformation of X to X by X = XT (eq. set S11). The other type of PA is constrained PA where rotation and dilation are separated into an orthogonally constrained rotation matrix H i and a dilation constant ρi (eq. sets S12 – S14). Similar to unconstrained PA, baseline H and ρ are obtained for selftransformation of X (eq. set S13). Rather than determining the quality of the transformation, analogous to unconstrained PA, the degree of difficulty to carry out the transformations is assessed (eq. set S14). As with unconstrained PA, the translation step was left out. Extended inverted multiplicative signal correction (EISC)41,42 has also not been used as a classifier. It is the inverted form of

Page 4 of 10

the spectral pre-processing method extended multiplicative scatter correction (EMSC).43,44 The fundamental operation is similar to PA: correct a spectrum to be resemble a target spectrum as a function of multiple parameters. As used for outlier detection with spectral data or other continuous data such as chromatography, the focus is on transformation of spectral differences45,46 between xi and x ( d= x − xi ) and is referred to as EISC differences (EISCD). The same correction terms used previously for outlier detection are applied as classifiers (see eq. sets S15 and S16)30, but other terms can be included. Also like PA, classification is based on the difficulty of the EISCD correction for a sample relative to each class. The extent of correction needed is determined by the Euclidean norms of each correction vector b and the corresponding amount of correction actually applied Xcb . The smaller the values, the more likely the sample belongs the respective class. These two classifiers are also calculated by interchanging xi with x for a total of four EISCD classifiers. Using the measures presented in this section as sole classifiers would be weak because thresholds would be difficult to determine. Hence, in the fusion approach, only raw values are used and the combination of these classifiers should strengthen the consensus of class membership. In fact, because raw values are used, any similarity measure that can distinguish sample (vector) differences could be included. Classification by Fusion. As briefly noted previously, pictured in Figure 1 is the sum fusion rule on a fusion input matrix for one sample from a three class system. The fusion input matrix is formally defined as being m × c for m classifiers and c classes. Each row of raw classifier values is normalized to unit length to eliminate magnitude differences between the many different classifiers. Each column is then summed and the class with the smallest sum is the predicted class membership for that sample. The sum rule has been recommend when compared to several other fusion process including the product, minimum, maximum, median, and majority vote rules.14 Note that for the particular sample shown in Figure 1, the kNN classifier misclassifies the sample. The fusion input matrix can also be transformed such that each classifier specifically classifies the sample in question. In this case, and not used in this study, decision rules on all raw classifier values are needed to predict the sample class membership. For example, rules can be simple for the MD classifier based on minimum MD or more complex using statistical based thresholds. Using a simple classification rule of minimum values on Figure 1 produces the fusion input matrix in Figure S-1 where class membership is a 0 and 1 otherwise. As with Figure 1, kNN misclassifies the sample in Figure S-1. Using the sum rule again identifies the final class membership by the smallest sum. If the class membership was a 1 and 0 otherwise, then the class with the greatest sum identifies class membership, i.e., the majority vote. Classification Quality Measures. To assess the classification quality of sum fusion across tuning parameter windows and the four single stand-alone classifiers PLS2-DA, MD, Q-residual, and kNN at respective optimized tuning parameter values, the accuracy, sensitivity, and specificity are calculated by

4 ACS Paragon Plus Environment

Page 5 of 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Accuracy Sensitivity Specificity

Figure 2. Four ReAle beer single classififier results and fusion of all 17 classifiers using only the UV instrument. Accuracy (red), sensitivity (blue), and specificity (green). accuracy = (TP + TN)/(TP +TN + FP + FN) (1) sensitivity = TP/(TP + FN) (2) specificity = TN/(TN + FP) (3) where TP, FP, TN, and FN are respectively true positive, false positive, true negative, and false negative. These values are calculated from a leave-one-out cross validation on each class. Briefly, a sample is removed from a class and acts as the target sample to be classified. All classifiers are then used to classify the sample. This process is repeated until each sample has been left out of each class. The number of tuning parameters used for respective classifiers windows and data sets are noted in the corresponding Experimental data set descriptions.

EXPERIMENTAL Software. Matlab code written by the authors was used for all classifications. Data Sets. Beer. The beer data is a product authentication classification problem with 19 Birra del Borgo ReAle samples and 41 other craft samples.47 The 41 non-authentic samples are composed of 12 non-ReAle samples from the same manufacture Birra del Borgo and 29 samples from producers in Italy and other parts of Europe. All beer samples were measured on five instruments: mid-infrared (MIR), near infrared (NIR), ultraviolet (UV), visible (VIS), and thermogravimetric analysis (TG). Plotted in Figure S-2 are the respective spectra and thermogravimetric data. The TG has been pre-processed by the first derivative using a Savitsky-Golay third-order polynomial and 19 point window. All data is used as shown in Figure S-2 without further pre-processing. This data was originally analysed as a binary classification problem and it is used the same way in this study to demonstrate the utility of multi-instrument fusion by two different ways. The maximum k value across all five instruments for eigenvector, LVs, and NN windows is 16. For example, the MD is computed with eigenvector 1, then eigenvectors 1-2, and so on up to eigenvectors 1-16. At the maximum window value 16, there would be 16 MD classifiers. Because the Beer data is being run as a binary classification problem, the accuracy is the same for each class and the sensitivity and specificity switch in values between the two classes, i.e., sensitivity for class 1 is specificity for class 2 and vice-versa. Thus, results are only shown for the ReAle beer class. Textiles. The textile data is a three class NIR data set of 223 samples of manufactured textiles with various compositions.48 The classification problem is to categorize a physical property that can have three discrete values. The three classes have respectively 54, 123, and 46 samples. Class 2 was randomly downsized to 54 samples to maintain class size equivalency.

Spectra were measured from 1100 – 2500 nm at 0.5nm increments for 2800 wavelengths for each spectrum. Plotted in Figure S-3 are the raw spectra and a principal component plot for the raw spectra. In previous classification work, spectra were pre-processed by the standard normal variate (SNV).49 Classification with this data set is performed by fusion three ways: raw spectra, SNV pre-processed, and a fusion of both the raw and SNV pre-processed. In all situations, the maximum k value for eigenvectors, LVs, and NN windows is 44. Classification result shown are the overall for all three classes. Wine Cultivars. This data stems from chemical analysis of three wine cultivars grown in the same region of Italy and is downloadable.50 There are 13 chemical constituents (variables) for each sample (alcohol, malic acid, ash, ash alkalinity, magnesium, total phenols, flavonoids, nonflavanoid phenols, proanthocyanins, color intensity, hue, dilute wine OD280/OD315, and proline). Shown in Figure S-4 is a list of the 13 variables and the corresponding boxplots over 59, 71, and 48 samples for the three classes. Also plotted in Figure S-4 is the PCA plot. The maximum k value for eigenvectors, LVs, and NN is 12. Classification result shown are the overall for all three classes.

RESULTS AND DISCUSSION Results from all 17 classifiers (Tables S-1 and S-2) as optimized single classifiers is not presented. Instead, individual results are only shown for PLS2-DA, MD, Q-residual, and kNN. As previously noted, these four are isolated because PLS2-DA and kNN are commonly used in analytical chemistry and SIMICA (also a common classifier) is a combination of MD and Q-residual. Fusion results presented are over all 17 classifiers simultaneously, including respective tuning parameter windows for non-optimized classifiers PLS2-DA, MD, Q-residual, and kNN, sinθ, and the DC (Table S-1) . Beer. Shown in Figures 2 and S-5 are the classification results using the UV and other four instruments, respectively. As a reminder, the x-axis for each single classifier PLS2-DA, MD, Q-residual, and kNN is the corresponding classification tuning parameter value. For these situations, the accuracy, sensitivity, and specificity values plotted denote the classification quality at the respective optimized classifier. For example, the classification quality values plotted in Figure 2 at the corresponding 7 tuning parameter values denote the quality obtained if 7 was used as the optimized tuning parameter value to obtain the best classifications, i.e., single respective classifiers at 7 LVs for PLS2-DA, 7 eigenvectors for MD and Q-residual, and 7 NN for kNN. Conversely, with fusion, each tuning parameter window on the x-axis is based on all 17 non-optimized respective classifiers. Using the 7 tuning parameter window in Figure 2 for the UV instrument as an example, PLS2-DA, MD, Q-residual, sinθ,

5 ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

and the DC each act as 7 single classifiers for 42 classifications of each sample. Combined with the remaining 11 single classifiers in Table S-2, the total comes to 53 classifications for the sum rule. From Figures 2 and S-5, it is observed that the quality of the classification depends on the classification method and the corresponding tuning parameter value as well as which instrument is being used. However, fusion of all 17 classification methods shows consistency regardless of the instrument. Specifically, the single classifiers are erratic relative to the respective tuning parameter values. Thus, depending on the optimized tuning parameter value, the classifier quality varies. However, the fusion process smooths out the irregular behavior. Another general trend observed is accuracy, sensitivity, and specificity improve as the tuning parameter window sizes increases.. Thus, once a large enough window is used, then acceptable classification is possible thereby removing the need to separately tune each single classifier. Overall, the MIR instrument performs best across the single classifiers. All the results may be improvable by optimizing the data pre-processing method for each instrument and classifier,28 but this was not investigated. As noted previously, multiple data pre-processing can be included in the fusion processes as characterised with the next data set. When multiple instruments are used to measure the same samples, a standard data fusion practice is to augment the spectra column-wise to form a single multi-instrument array. Figure 3 illustrates the situation (low-level fusion). This combined data set was classified by the same 17 classifiers and fusion using the same tuning parameter windows used on the individual instruments. The classification results are similar to those shown in Figures 2 and S-5 with the fusion results displayed in Figure S-6.

X=

MIR

NIR

UV

VIS

Page 6 of 10

level of prediction quality in Figure 5. This is probably due to the greater degree of variance needing to be captured by augmenting as in Figure 3 in order to represent the unique information in each instrument from the unequal intensities of the raw instrumental responses. Regardless, both instrument fusion processes smooth out the irregular behavior of the single optimized classifiers.

Figure 4.The fusion input matrix from augmenting five instruments vertically for classifying one ReAle beer sample. Values in each row are normalized to unit length and ranges represent number of k values in each classifier window. Class 1 is ReAle beer and class 2 is non-ReAle beer. Each instrument block is 107 classifiers on the y-axis. Note that for the sample shown in Figure 4, the classification values (smaller values for the class membership) show that the classification is instrument, method, and tuning parameter dependent. However, by combing all the information, a consensus classification does correctly classify the sample as ReAle beer.

TG

Figure 3. Example of augmenting spectra and TG data into one array for classification by single classifiers or fusion of the 17 classifiers. An alternative fusion process with multiple instruments is to stack the classification results for all 17 methods vertically by instrument. This arrangement is shown in Figure 4 for one particular ReAle beer sample. From Figure 4 it is observed that the NIR instrument misclassifies this particular sample. The corresponding fusion results for all samples are plotted in Figure 5. Comparing these results with those in Figure S-6 based on the conventional way to augment multiple instruments, the new instrument augmenting method by stacking is seen to be superior. This improvement is probably due to the increased number of classifiers. For example, at a window size of 7 tuning parameters for the situation in in Figure 3, there are 53 non-optimized classifiers summed. When the instruments are stacked as in Figure 4, a total of 265 non-optimized classifiers are now summed at window size 7. There also seems to be a parsimony advantage to fusing the instruments in the format of Figure 4 over that in Figure 3. Specifically, comparing the plots in Figures 5 and S6, a smaller window of tuning parameters is needed for the same

Figure 5. Fusion results for ReAle beer samples from stacking the five instruments as in Figure 4. Accuracy (red), sensitivity (blue), and specificity (green). The classification values from the novel approach of stacking multiple instruments can be combined with the classification values from the conventional instrument augmentation approach. Essentially, another block of multiple classifiers from the conventional instrument augmentation is used to enhance the classification consensus for each sample. The classification results shown in Figure S-6 reveal some small improvements from using this third approach of combing multiple instruments. Using the 7 tuning parameter window size again, the number of classifiers now totals 318 (53 + 265) classifiers. No new instrumental analyses are needed to gain the improved classification results and hence, the extra block of classification is easy to obtain and useful.

6 ACS Paragon Plus Environment

Page 7 of 10

Analytical Chemistry MD

PLSDA

Qres

Fusion

kNN

%

100

Accuracy Sensitivity Specificity

50

0 10

20

30

40

10

20

30

40

10

PLSDA

20

30

40

10

Number of PCs

Number of PCs

Number of LVs

MD

20

30

40

Number of Neighbors

10

20

30

40

Tuning Parameter Windows Fusion

kNN

Qres

100

%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

50

0 10

20

30

Number of LVs

40

10

20

Number of PCs

30

40

10

20

Number of PCs

30

10

40

20

30

40

10

20

30

40

Number of Neighbors

Tuning Parameter Windows

Figure 6. Four textile single classifier results and fusion of all 17 classifiers using the raw data (top) and SNV (bottom). Accuracy (red), sensitivity (blue), specificity (green), and unclassified samples (black). From Figures 2, 5, S-5, and S-6 showing the quality measure trends for the fusion method, it is observed that there is little if any sacrifice to the accuracy, sensitivity, and specificity if the full rank value is used for k. Conversely, the single classifiers do not follow this observation and hence, optimization is critical for these classifiers. No data pre-processing was carried out to optimize the classification results, i.e., the data was used as provided. With this new approach to fusion, it is simple to add a block of fusion input data for each pre-processing method. Blocks for different sets of selected variables such as wavelengths for spectral data can also be included. When the data is continuous, such as with this spectral and TG data, moving windows of variables could be used. There would be an augmented block of input fusion data for each window size as well as blocks for each degree of variable change between each window. With the next data set, raw spectra and single normal variate (SNV) pre-processing are studied individually and in combination. Textiles. From the PCA plot of the raw spectra in Figure S3, it can be concluded that the three classes are overlapped in the first two PCs (84.69% captured variance) indicating accurate results may be difficult to achieve by a particular single optimized classifier. Contained in Figure 6 are plots of the accuracy, sensitivity, and specificity using the raw data with single classifiers and fusion of all 17 classifiers. Analogous to the beer data, the single classifiers tend to be erratic relative to the tuning parameter value while fusion is more consistent and the classification quality measures trend upward as the tuning parameter window size increases. The pre-processing method SNV was used in previous work42 and it was applied here. The results in Figure 6 indicate there is little gained by using SNV. All the trends are similar to the raw data. Nevertheless, rather than optimizing the pre-processing method, the two sets of results can be fused as well as including other pre-processing methods. Shown in Figure 7 are the classification results using the raw and SNV values simultaneously in the fusion input matrix. As expected because the raw and SNV results do well, the combined fusion results are similar. Thus, little was gained in this case, but more importantly, nothing was sacrificed. Similar to multiple instruments and Figure

3, each pre-processing method can be fused by augmenting to form more columns of data. This type of fusion was not studied.

Figure 7. Textile fusion results using both the raw and SNV fusion input data. Accuracy (red), sensitivity (blue), and specificity (green). Wine Cultivars. The PCA plot in Figure S-4 indicates that all three cultivars have a common overlap region with the greatest overlap between cultivars 2 and 3. Plotted in Figure 8 are the classification results showing the inconsistency in the single classifiers at each specific tuning parameter value compared to the regularity gained from fusion. Note the single classifiers and PLS2-DA and Q-residual at greater tuning parameter values perform best for the textile data (raw or SNV) while for the Beer data, one classifier was not particular the best. For the wine data, PLS2-DA and MD function better at higher tuning parameter values. Thus, using only one classifier for classification is not always best.

CONCLUSIONS Using the sum rule across collections of non-optimized classifiers allows enhanced flexibility compared to single optimized classifiers. Seventeen classification schemes were used to form consensus classifications. Other processes can easily be included in the fusion processes. For example, the kNN classifier used only the Euclidian distance. Several other distance measures could be included in the fusion as additional blocks of classification values over collections of NN. Similarly, there are

7 ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 10

Accuracy Sensitivity Specificity

Figure 8. Four wine cultivar single classififier results and fusion of all 17 classifiers. Accuracy (red), sensitivity (blue), specificity (green), and unclassified samples (black). a variety of support vector machine (SVM) strategies and a large selection of neural networks that could be added. Equally, collections of single ensemble classifiers such as random forests could be added. Not studied for the spectral data or other data with continuous variables was using wavelength windows of variables (wavelengths).51 In this case, classification values from a collection of window sizes can be added to the fusion input matrix. Numerous window sizes can be used as well as several sets relative to how a window slides across a spectrum. A similar process can be envisioned for a set of discrete measured variables. This preprocessing approach prior to the final fusion can be considered as a forming blocks of mid-level classification measures. Essentially, blocks of classifier can be stacked where each block is a collection of respective classifier outputs across a window of tuning parameter values. There is no limit to the number of methods to include as the simple sum rule is used for the final classification. Stacking blocks of different classifiers as in Figures 1 and 4 allows the option of using a form of cross validation to assess the quality of the classification. For example, a random set of rows of the input fusions matrix can be removed and the classification can again be made with on the remaining classifiers. This process can be repeated multiple times to form a frequency distribution of classification. The emphasis of this paper is not identifying the best classifier or how to determine the tuning parameter window size, but to demonstrate enhanced classification through the flexibility of a new strategy to combine classifiers. Approaches are suggested on how to select a tuning parameter window. For the data sets studied, it was found using up to rank k values was satisfactory with a small sacrifice to the accuracy, sensitivity, and specificity. Because the classifiers do not need to be optimized, the fusion process lends itself to make it easier to automate classification.

ASSOCIATED CONTENT Supporting Information Tables S-1 and S-2 and Figures S-1 through S-6 as referenced in the text.

AUTHOR INFORMATION Corresponding Author *Email: [email protected]

ACKNOWLEDGMENTS

This material is based upon work supported by the National Science Foundation under Grant No. CHE-1506417 (co-funded by CDS&E) and is gratefully acknowledged by the authors. The authors are thankfully to Frederico Marini and Cyril Ruckebusch for the beer and textile data.

REFERENCES (1) Eberhardt, K.; Beleites, C.; Marthandan, S.; Matthäus; Diekmann, S.; Popp, J. Anal. Chem. 2017, 89, 2937-2947. (2) Perez, J.J.; Watson, D.A.; Levis, R.J. Anal. Chem. 2016, 88, 11390-11398. (3) Mahoney, C.M.; Kelly, R.T.; Alexander, L.; Newburn, M.; Bader, S.; Ewing, R.G.; Fahey, A.J.; Atkinson, D.A. Anal. Chem. 2016, 89, 3598-3607. (4) Ong, T.-H.; Kissick, D.J.; Jansson, E.T.; Comi, T.J.; Romanova, E.V; Rubakhin, S.S.; Sweedler, J.V. Anal. Chem. 2015, 87, 7036-7042. (5) Szymańska, E.; Brodrick, E.; Williams, M.; Davies, A.N.; van Manen, H.-J.; Buydens, L.M.C. Anal. Chem. 2015, 87, 869-875. (6) Jones, A.E.; Turner, P.; Zimmerman, C.; Goulermas, J.Y. Anal. Chem. 2014, 86, 5399-5405. (7) Fu, C.; Khaledi, M.G. Anal. Chem. 2014, 86, 2371-2379. (8) Fernández-Delgado, M.; Cernadas, E.; Barro, S. J. Mach. Learn. Res. 2014, 15, 3133-3181. (9) Hastie, T.; Tibshirani, R.; Friedman, J. The Ellements of Statistical Learning Data Mining, Inference, and Prediction, Second Ed.; Springer, New York, NY, 2009. (10) Brereton, R.G.; Lloyd, G.R. J. Chemom. 2014, 28, 213-225. (11) Campos, F.M.; Correia, L.; Calado, J.M.F. J. Intell. Robotic Syst. 2015, 77, 377-390. (12) Woźniak, M.; Graña, Corchado, E. (2014). Inform. Fusion, 2014, 16, 3-17. (13) Ho, T.K.: Hull, J.J.: Srihari, S.N. IEEE Trans. Pattern Anal. Mach. Intell. 1994, 16, 66-75. (14) Kittler, J.; Hatef, M.; Duin, R.P.W.; Matas, J. IEEE Trans. Pattern Anal. Mach. Intell. 1998, 20, 226-239. (15) Ruta, D.; Gabrys, B. Comp. Inf. Syst. 2000, 7, 1-10. (16) Ruta, D.; Gabrys, B. Inf. Fusion 2005, 6, 63-81. (17) Xu, L; Krzyzak, A. IEEE Trans. Syst., Man, Cyber. 1992, 22, 380-384. (18) Džeroski, S.; Ženko, B. Mach. Learn. 2004, 54, 255-273. (19) Klein, L.A. Sensor and Data Fusion Concepts and Applications, SPIE Optical Engineering Press, Bellingham, WA, 1999. (20) Hibon, M.; Eveniou, T. Int. J. Forecasting. 2005, 21, 15-24. (21) Breiman, L. Mach. Learn. 1996, 24, 123-140. (22) Ho, T.K. IEEE Trans. Pattern Anal. Mach. Intell. 1998, 20, 832-844. (23) Cho, S.-B.; Kim, J.H. IEEE Trans. Syst., Man, Cyber. 1995, 25, 380-384. (24) Hashem, S.; Schmeiser, B. IEEE Trans. Neur. Net. 1995, 6, 792-794 (25) Borges, C.; Gómez-Carracedo, M.P.; Andrade, J.M.; Duarte, M.F.; Biscaya, J.L.; Aires-de-Sousa, J. Anal. Chim. Acta. 2010, 101, 43-55. (26) Breiman, L. Mach. Learn. 1996, 24, 49-64.

8 ACS Paragon Plus Environment

Page 9 of 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

(27) Gerretzen, J.; Szymanska, E.; Jansen, J.J.; Bart, J.; van Manen, H.-J.; van den Heuvel, E.R.; Buydens, L.M.C. Anal. Chem. 2015, 87, 12096-12103. (28) Engel, J.; Gerretzen, J.; Szymanska, E. Jansen, J.J.; Downey, G.; Blanchet, L.; Buydens, L.M.C. Trends Anal. Chem. 2013, 50, 96-106. (29) Xu, L.; Zhou, Y.P.; Tang, L.J.; Wu, H.L.; Jiang, J.H. Anal. Chim.Acta, 2008, 616, 138-143. (30) Brownfield, B.; Kalivas, J.H. Anal. Chem. 2017, 89, 5087-5094. (31) Wold, S.; Sjostrom, M. In Chemometrics Theory and Application, American Chemical Society Symposium Series 52, Kpwalski, B.R., Ed.; American nChemical Society: Wash., D.C., 1977, 243-282. (32) Næs, T.; Isaksson, T.; Fern, T.; Davies, T. A User Friendly Guide to Multivariate Calibration and Classification; Chichester: UK, 2002. (33) Ferré, J. In Comprehensive Chemometrics: Chemical and Biochemical Data Analysis Vol. 3; Brown, S.D., Tauler, R., Walczak, B., Eds.; Elsevier: Amsterdam, 2009; pp 33- 89. (34) Higgins, K.; Kalivas, J. H.; Andries, E. J. Chemom. 2012, 26, 66-75. (35) Goodenough, D. G.; Narendra, P. M.; and O’Neill, K. Can. J. Remote Sens. 1978, 4, 143-148. (36) Sánchez, F. C.; Toft, J.; van den Bogaert, B.; Massart, D. L. Anal. Chem. 1996, 68, 79-85. (37) Ramsay, J. O.; Berge, J. T.; Styan, G.P.H. Psychometrika 1984, 49, 403-423. (38) Andrade, J. M.; Gómez-Carracedo, M. P.; Krzanowski, W.; Kubista M. Chemom. Intell. Lab. Syst. 2004, 72, 123–132. (39) Kalivas, J. H. J. Chemom. 2008, 22, 227-234. (40) Anderson, C. E.; Kalivas, J. H. Appl. Spectrosc. 1999, 53, 12681276. (41) Hellend, I. S; Næs, T.; Isaksson, T. Chemom. Ntell. Lab. Syst. 1995, 29, 233-241. (42) Pedersen, D. K.; Martens, H.; Nielsen, J. P.; Engelsen, S. B. Appl. Spectrosc. 2002, 56, 1206-1214. (43) Geladi, P.; MacDougall, D.; Martens, H. Appl. Spectrosc. 1985, 39, 491-500. (44) Martens, H.; Stark, E. J. Pharm. Biomed. Anal. 1991, 9, 625635. (45) Berns, R. S.; Petersen, K. H. Color Res. Appl. 1988, 13, 243256. (46) Ottaway, J.; Kalivas, J. H. Appl. Spectrosc. 2015, 69, 407-416. (47) Biancolillo, A.; Bucci, R.; Magrì A.L.; Magrì, A.D.; Marini, F. Anal. Chim. Acta., 2014, 820, 23-31. (48) Jacques, J.; Bouveyron, C.; Girad, S.; Devos, O.; Duponchel, L.; Ruckebusch, C. J. Chemom. 2010, 24, 719-727. (49) Barnes, R.J.; Dhanoa, M.S.; Lister, S.J. Appl. Spectrosc. 1989, 43, 772-777. (50) Lichman, M. [http://archive.ics.uci.edu/ml]. UCI machine learning repository, University of California, School of Information and Computer Science, Irvine, CA, 2013. (51) Fasaso, A.; Mirjankar, N.; Stoian, R.-I.; White, C.; Allen, M.; Sandercock, M.P.; Lavine, B.K. Appl. Spectrosc. 2015, 69, 84-94.

9 ACS Paragon Plus Environment

Analytical Chemistry

FOR TOC ONLY 100

% %

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 10

Accuracy Sensitivity Specificity

50

0 5

10

15

Tuning Parameter Windows

10 ACS Paragon Plus Environment