SIMCA Modeling for Overlapping Classes: Fixed or Optimized

SIMCA Modeling for Overlapping Classes: Fixed or Optimized Decision Threshold? Raffaele Vitale*†‡ ... Publication Date (Web): August 24, 2018. Cop...
0 downloads 0 Views 5MB Size
Subscriber access provided by Universitaetsbibliothek | Johann Christian Senckenberg

Article

SIMCA modelling for overlapping classes: fixed or optimised decision threshold? Raffaele Vitale, Federico Marini, and Cyril Ruckebusch Anal. Chem., Just Accepted Manuscript • DOI: 10.1021/acs.analchem.8b01270 • Publication Date (Web): 24 Aug 2018 Downloaded from http://pubs.acs.org on August 25, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 27 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

SIMCA modelling for overlapping classes: fixed or optimised decision threshold? Raffaele Vitale,⇤,†,‡ Federico Marini,¶ and Cyril Ruckebusch‡ †Molecular Imaging and Photonics Unit, Department of Chemistry, Katholieke Universiteit Leuven, Celestijnenlaan 200F, B-3001, Leuven, Belgium ‡Laboratoire de Spectrochimie Infrarouge et Raman - UMR 8516, Université de Lille Sciences et Technologies, Bâtiment C5, 59655, Villeneuve d’Ascq, France ¶Department of Chemistry, Università degli Studi di Roma La Sapienza, Piazzale Aldo Moro 5, 00185, Roma, Italy E-mail: [email protected] Phone: +33 7 69 47 66 54 Abstract An approach exploiting the principles of Receiver Operating Characteristic (ROC) curves for the simultaneous optimisation of both the complexity and the decision threshold in Soft Independent Modelling of Class Analogy (SIMCA) classification models is here proposed. The outcomes resulting from the analysis of two simulated and four real case-studies highlight that, in the presence of strong overlapping among various categories of samples, the implemented method can lead to better classification efficiency in external validation, compared to fixing such a threshold a priori. This guarantees a higher robustness towards class dispersion. On the other hand, in cases of clearer and more definite separation among the different groups of observations, their classification performance is equally satisfactory for test samples.

1

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

1

Introduction

Nowadays, a large number of problems in fields like foodstuff origin authentication, quality control or process monitoring is addressed by Class Modelling (CM) statistical methods. 1–3 Techniques such as UNEQual class modelling (UNEQ 4 ) or Soft Independent Modelling of Class Analogy (SIMCA 5,6 ) have been extensively used in the last decades for similar purposes. Contrarily to the more popular Discriminant Analysis (DA 7 ), the basic principle of CM is that classification rules are derived using only samples/objects belonging to a single target category. Faults in the definition of non-target categories, which could bias the classification performance, can thus be avoided. 8 To fully understand how CM works, imagine a univariate case in which samples belonging to two distinct classes have to be differentiated according to the value of a particular property of interest measured for each one of them. CM estimates a class delimiter (corresponding to a certain significance level) based on the distribution of these values within the target category. Afterwards, it assigns new samples to this category if the values of the same properties recorded for them are found to be within such a threshold. Figure 1 shows an example. Assuming the target class is associated to the distribution represented by the red bars, setting as delimiter the 95th percentile of this distribution (black dashed vertical line) allows to distinguish most of the target class samples from those coming from the second category (blue bars). In general, the significance level imposed for the estimation of the delimiter is fixed a priori, no matter the nature of the data under study and the type of approach resorted to (either univariate or multivariate), and this may generate severe issues. Think now of a similar scenario to the one outlined before, but where the target class distribution is more disperse, as in Figure 2. Here, the two distributions appear to be significantly overlapped and the 95th percentile of the target class distribution does not represent a satisfying classification threshold anymore. In this contingency, in fact, almost half of the samples belonging to the second category would be mistakenly assigned to the target class. A more reasonable solution would be e.g. setting as delimiter the 85th percentile of the target class distribution 2

ACS Paragon Plus Environment

Page 2 of 27

Page 3 of 27 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Figure 1: Example of univariate CM classification for two slightly overlapping categories of samples: within-class distributions (red and blue bars, respectively) of a particular property of interest measured for samples belonging to these two categories. Assuming the target class is associated to the distribution represented by the red bars, setting as classification delimiter the 95th percentile of this distribution (black dashed vertical line) would be enough to distinguish most of the target class samples from those coming from a second hypothetical category (blue bars) (black dotted vertical line), which would guarantee a better compromise between True Positive (the number of objects correctly identified as belonging to the target category - TP) and True Negative (the number of objects correctly identified as not belonging to the target category - TN) rate. But then, how to decide which significance level to pick? In this paper, a new data-driven methodology is proposed to address such a task. This approach exploits the concept of Receiver Operating Characteristic (ROC) curve. 9,10 A ROC curve (see Figure 3) is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. In practice, it displays the TP rate (also known as 3

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 2: Example of univariate CM classification for two overlapping categories of samples: within-class distributions (red and blue bars, respectively) of a particular property of interest measured for samples belonging to these two categories. Assuming the target class is associated to the distribution represented by the red bars, setting as classification delimiter the 95th percentile of this distribution (black dashed vertical line) would not be enough to distinguish most of the target class samples from those coming from a second hypothetical category (blue bars). A more efficient solution would be e.g. fixing such a threshold at the 85th percentile of the target class distribution (black dotted vertical line) sensitivity) vs. the (1

TN) rate (also known as specificity) at various threshold settings.

The optimal class delimiter can then be found as the value returning the best compromise between these two figures of merit (i.e. the lowest distance to the top-left corner of the ROC curve having coordinates [0, 1]). ROC curves are widely utilised in many application fields, ranging from psychology to medicine and meteorology. 11–13 Here, they are employed as optimisation tools in a SIMCA classification framework, even if the implemented methodology is easily extendable for being

4

ACS Paragon Plus Environment

Page 4 of 27

Page 5 of 27

1 0.9 0.8 0.7

sensitivity

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

0.6 0.5 0.4 0.3 0.2 0.1 0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1-specificity

Figure 3: Example of Receiver Operating Characteristic (ROC) curve (black solid line and dots). The coordinates of each displayed point depend on the values of classification sensitivity and specificity observed at a specific classification threshold setting. In an ideal case of perfect separation between classes, the ROC curve would be a perfect L-shaped curve going from coordinates [0, 0] to [1, 0] to [1, 1] and defining an area of 1 squared unit (red dashed line and circles). In case of complete overlapping between classes, the ROC curve would be a straight line going from coordinates [0, 0] to [1, 1] with unit slope and defining an area of 0.5 squared unit (red dashed line and crosses) applied in combination with other CM multivariate modelling strategies. The only requirement for ROC curves to be used is that measurements for samples belonging to non-target classes are also available for the sake of a proper threshold selection. Although this is actually not strictly needed in the CM context, it can be highly beneficial in all situations in which significant overlapping exists between categories. 8 This article explores the potential of this procedure as a possible way of tuning SIMCA model parameters in circumstances like this. 5

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

2 2.1

Page 6 of 27

Methods Soft Independent Modelling of Class Analogy (SIMCA)

Let X be an N ⇥J dataset constituted by Z submatrices Xz (Nz ⇥J), each one containing the P measurements collected for a single class of samples (with z 2 [1, . . . , Z] and Zz=1 Nz = N ).

In SIMCA, every category of objects is modelled independently from the others based on

a Principal Component Analysis (PCA) model of appropriate dimensionality or complexity (say Az ) as: Xz = Tz PT z + Ez

8z 2 [1, . . . , Z]

(1)

where Tz (Nz ⇥ Az ), Pz (J ⇥ Az ) and Ez (Nz ⇥ J) denote the scores, loadings and residuals matrices resulting from the decomposition of Xz , respectively. Once defined the single class subspaces as in Equation 1, the degree of outlyingness of new unlabelled samples with respect to all of them can be estimated according to a combined index. 14 For a generic observation xT new (1 ⇥ J), this combined index is calculated as:

dnew,z

v !2 ✓ u ◆2 u T2 Qnew,z new,z t = + 2 Tlim,z Qlim,z

(2)

2 is a statistic reflecting the (Mahalanobis) distance between the origin of the being Tnew,z

z-th model hyperplane and the projection of xT new onto it, Qnew,z a statistic reflecting the 2 perpendicular (orthogonal) distance between xT new and the z-th model hyperplane, Tlim,z an

empirical threshold for the Tz2 -statistic (usually corresponding to a significance level of 95%) and Qlim,z an empirical threshold for the Qz -statistic (usually corresponding to a significance level of 95%). Yue and Qin 14 have reported how a decision threshold for the aforementioned combined index (squared) can be obtained for a generic significance level, say ↵, as: d2lim,z = gz 6

2 hz ,↵

ACS Paragon Plus Environment

(3)

Page 7 of 27 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

where the coefficient gz and the degree of freedom of the

2

distribution, hz , are retrieved

according to the Box theory. 15 However, in a more immediate and commonly utilised SIMCA framework, 16–18 xT new is considered an outlier for the z-th class model and, thus, rejected by p p it if dnew,z is found to be larger than 2. On the other hand, if dnew,z  2, the investigated sample is recognised as part of the z-th categoryi . Whether such a threshold is always suitable no matter the nature of the data under study will be explored in the next sections.

2.2

ROC curve-based SIMCA model optimisation

The algorithm proposed here allows both complexity (number of principal components) and significance level of a SIMCA model to be simultaneously tuned through the construction of cross-validated ROC curves. It encompasses the six following steps: 1. the subset of samples belonging to the target class is subjected to a preliminary round of cross-validation. Specifically, a certain number of observations is removed from the whole training set, a PCA model of a particular dimensionality is calibrated using the remaining ones, and the T 2 - and Q-statistic are computed for the left out objects after their projection onto the PCA subspace. This is needed to have a measure of the sensitivity of the current classification; 2. training samples belonging to the non-target categories are also projected onto the target class model subspace and their T 2 and Q values calculated, accordingly. This is needed to have a measure of the specificity of the current classification; 3. the combined index is estimated for both groups of target and non-target observations. 2 and Qlim,z are set as: At this point and for a given significance level ↵, Tlim,z

2 Tlim,z = i

Az (Nz2 1) FA ,N Nz (Nz Az ) z z

Az ,↵

Therefore, samples can be potentially assigned to none or multiple modelled classes.

7

ACS Paragon Plus Environment

(4)

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 27

and Qlim,z = ( being FAz ,Nz

Az ,↵

v ) 2m

(5)

2

2m2 ,↵ v

the critical value of an F distribution with Az and Nz

Az degrees of

freedom at the significance level ↵, v the sample variance of the Q-statistic as observed in the training set, m the sample mean of the Q-statistic as observed in the training set and

2

2m2 ,↵ v

the critical value of a

2

distribution with

2m2 v

degrees of freedom at

the significance level ↵. 19,20 2 Alternatively, if Nz is sufficiently high, Tlim,z and Qlim,z are equated to the (1 ↵)⇥100th

percentile of the sample distribution of their reference statistics as observed in the training set; 4. based on these combined index values, a ROC curve is plotted by varying the corresponding threshold within a certain range (say from 0 to the maximum observed for the projected samples). In practice, at the various threshold settings, classification sensitivity and specificity are estimated as:

sensitivity =

TP ⇥ 100 TP + FN

(6)

specificity =

TN ⇥ 100 TN + FP

(7)

where FN and FP stand for False Negatives (the number of objects mistakenly identified as not belonging to the target category) and False Positives (the number of objects mistakenly identified as belonging to the target category), respectively, and can be represented in a graph like the one in Figure 3. All the previous steps are iterated for different numbers of principal components; 5. the complexity of the class model is determined as the number of principal components maximising the Area Under the ROC curve (AUROC), which is a measure of the general classification quality; 8

ACS Paragon Plus Environment

Page 9 of 27 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

6. once optimised the class model dimensionality, the final combined index threshold is selected as the one returning the lowest Euclidean distance to the top-left corner of the ROC curve. If multiple categories of samples are concerned, the methodology is applied separately to every one of them.

3

Datasets

3.1

Simulated data #1

Two-class data were simulated through the following computational procedure: 1. for each category, two orthogonal score vectors were generated at random following a normal distribution with equal mean and standard deviation. The mean of these vectors was kept identical for the two classes but with opposite sign to guarantee their separation; 2. the two groups of scores were concatenated row-wise and multiplied by a two-dimensional block of loadings, extracted by performing PCA on an experimental spectral dataset; 21 3. heteroscedastic noise was finally added to the resulting matrix for mimicking a more realistic scenario. Class overlapping was induced by gradually increasing the standard deviation of the score vectors for the second category, and, thus, its dispersion. Specifically, this standard deviation was varied at 9 levels (1, 1.5, 2, 2.5, 3, 4, 6, 9, 12 times the standard deviation of the score vectors associated to the first class - see Figure S1 and S2 for a graphical illustration of the simulated score vectors and of the synthetic spectral data resulting from them). For the sake of a more comprehensive comparison, sample size was also varied at 5 levels (25, 50, 100, 250, 500 objects per class were simulated). 300 generation replicates were run per each combination dispersion/sample size. 9

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

3.2

Simulated data #2

Additional two-class datasets were simulated according to the generic scheme represented in Figure S3 and following a procedure analogous to the previous case. Here, the first category was conceived as underlain by two separated subclasses of observations, whose dispersion was varied simultaneously at 6 levels only along PC #1 (the standard deviation of the corresponding PC #1 score vector was set at 4, 6, 9, 12, 15 and 20, respectively). On the other hand, class #2 dispersion was varied at 5 different levels again only along PC #1 (the standard deviation of the corresponding PC #1 score vector was set at 5, 10, 15, 20 and 25, respectively). Sample size was fixed at 500 samples per category and 300 simulation replicates were run per each combination class #1/class #2 dispersion.

3.3

Real data

Except for the NIR spectra, all the real datasets are available for download at the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/index.php 22 ). 3.3.1

NIR data 23,24

483 NIR spectra of 3 groups of manufactured textiles with different physical properties were recorded by a XDS Rapid Content Analyzer (FOSS, Hilleroed, Denmark) in reflectance mode from around 1350 to 2600 nm. Prior to the analysis, Standard Normal Variate (SNV 25 ) was applied to all the registered profiles. 3.3.2

Cell data 26

52 light intensity and shape descriptors (rectangularity, convexity, etc.) were extracted from microscopy images of 5 morphological classes of cells: deformed (693 instances), long (564 instances), normal (893 instances), round (498 instances), and small (387 instances). The data were originally collected to monitor the onset of structural alterations of the cellular membrane under the effect of an antibiotic. 10

ACS Paragon Plus Environment

Page 10 of 27

Page 11 of 27 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

3.3.3

Glass data 27

9 attributes (refractive index, sodium, magnesium, aluminium, silicon, potassium, calcium, barium and iron content) were quantified in window (163) and non-window (51) glass samples. The original classification study conducted on this data was motivated by criminological investigation: at the crime scene, in fact, glass traces can be used as evidence only if correctly identified. 3.3.4

Parkinson’s disease data 28

22 biomedical voice measurements (average, maximum and minimum voice frequency, amplitude, etc.) were recorded for two group of patients, healthy (48 instances) and affected by Parkinson’s disease (147 instances).

4

Results

The proposed ROC curve-based SIMCA model optimisation approach is here compared to a more standard procedure for tuning SIMCA model parameters. The latter consists in following the procedure outlined in Section 2.2 (steps 1-3) but fixing a priori the significance level for the T 2 - and Q-statistics at 95%ii . Subsequently, the decision threshold for the p combined index values is set at 2, and the number of principal components to extract is determined in cross-validation by maximising the classification efficiency. This is defined as the geometric mean of the classification sensitivity and specificity observed for each category of samples under study. For the sake of a fair and more comprehensive study, each dataset was split into a training ( 23 of the total number of samples) and a test ( 13 of the total number of samples) set by applying the Duplex algorithm 29 class-wise. The performance of the two methodologies was assessed in terms of classification sensitivity, specificity and/or efficiency ii

The significance limits for these two statistics are estimated based on the internally cross-validated values of T 2 and Q associated to the training samples.

11

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

in external validation. The results will give a clear view of the robustness of the SIMCA models in situations characterised by different degrees of class overlapping.

4.1

Simulated data #1

Figures 4 and 5 display the outcomes achieved by the two different SIMCA model optimisation methodologies under study for the two categories of observations of the first simulated dataset at the different levels of sample size and class #2 dispersion. The plots represent the average classification efficiency obtained in external validation for every data generation setting (red and black dots) with an associated error bar estimated over the 300 simulation replicates performed. The upper and lower limit of these error bars corresponds to the 90th - and 10th -percentile of the distributions of the efficiency values yielded after each one of such replicates, respectively. The results show that for class #1, for which dispersion is kept constant and sample size is the only parameter that varies, both the compared approaches performed equally well always returning average classification efficiencies of around 90-95%, no matter the number of observations concerned. Reasonably, the width of the error distributions decreased when this number increased. On the other hand, for class #2, an evident difference between the two techniques is observable. In fact, when the standard deviation of the simulated score vectors for this category was at least 2.5-3 times the standard deviation of the score vectors associated to the first class (dataset ID 4-5 to 9), the ROC curve-based optimisation guaranteed higher average classification efficiency and less variable performance than the standard SIMCA parameter selection strategy. Therefore, in cases like this, the proposed procedure might allow a SIMCA classifier to be more robust towards potential increases in the dispersion of the target class and, thus, less prone to misclassify samples when the degree of overlapping between categories is relatively large.

12

ACS Paragon Plus Environment

Page 12 of 27

Page 13 of 27 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Figure 4: Simulated data #1 - Class #1 - Average classification efficiency obtained in external validation by the standard and ROC curve-based SIMCA model optimisation procedure (red and black dots) while varying sample size (a - 25 samples; b - 50 samples; c - 100 samples; d - 250 samples; e - 500 samples) and class #2 dispersion (dataset ID). The represented error bars are estimated over the 300 simulation replicates performed. Their upper and lower limit corresponds to the 90th - and 10th -percentile of the distributions of the efficiency values yielded after each one of such replicates, respectively

13

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 5: Simulated data #1 - Class #2 - Average classification efficiency obtained in external validation by the standard and ROC curve-based SIMCA model optimisation procedure (red and black dots) while varying sample size (a - 25 samples; b - 50 samples; c - 100 samples; d - 250 samples; e - 500 samples) and class #2 dispersion (dataset ID). The represented error bars are estimated over the 300 simulation replicates performed. Their upper and lower limit corresponds to the 90th - and 10th -percentile of the distributions of the efficiency values yielded after each one of such replicates, respectively

14

ACS Paragon Plus Environment

Page 14 of 27

Page 15 of 27 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

4.2

Simulated data #2

As for Section 4.1, Figures 6 and 7 display the outcomes achieved by the two different SIMCA model optimisation methodologies under study for the two categories of observations of the second simulated dataset at the different levels of class #1 and class #2 dispersion. The results somehow corroborate what concluded before. Concerning the first category of samples, for low class #2 dispersion values, as class #1 dispersion increases, the ROC curvebased optimisation guaranteed a higher classification efficiency, while both the compared approaches exhibited similar performance when class #2 was more disperse. On the other hand, regarding category #2, the ROC-curve based SIMCA model optimisation was found to systematically return more efficient classifications (which were yielded by better compromises between classification sensitivity and specificity - not shown) for large values of class #2 dispersion and no matter the dispersion of class #1. As class #2 dispersion decreases, this difference becomes less pronounced until the power of the two strategies gets virtually indistinguishable.

4.3

Real data

The outcomes resulting from the analysis of the 4 different real datasets were validated by Latin partition bootstrapping. 30 In all the scenarios, i) 20 sample partitions were generated, ii) one was iteratively left out, iii) the remaining 19 were used for SIMCA model training, and iv) classification sensitivity, specificity and efficiency in external validation were estimated exploiting the test set initially assembled by applying the Duplex algorithm. One hundred bootstrapping replicates were carried out for every array and paired t-test 31 were performed to detect statistically significant differences between methods in terms of classification efficiency.

15

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 6: Simulated data #2 - Class #1 - Average classification efficiency obtained in external validation by the standard and ROC curve-based SIMCA model optimisation procedure (red and black dots) while varying class #1 and class #2 dispersion (class #2 score vector standard deviation: a - 5; b - 10; c - 15; d - 20; e - 25). The represented error bars are estimated over the 300 simulation replicates performed. Their upper and lower limit corresponds to the 90th - and 10th -percentile of the distributions of the efficiency values yielded after each one of such replicates, respectively

16

ACS Paragon Plus Environment

Page 16 of 27

Page 17 of 27 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Figure 7: Simulated data #2 - Class #2 - Average classification efficiency obtained in external validation by the standard and ROC curve-based SIMCA model optimisation procedure (red and black dots) while varying class #1 and class #2 dispersion (class #2 score vector standard deviation: a - 5; b - 10; c - 15; d - 20; e - 25). The represented error bars are estimated over the 300 simulation replicates performed. Their upper and lower limit corresponds to the 90th - and 10th -percentile of the distributions of the efficiency values yielded after each one of such replicates, respectively

17

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

4.3.1

Page 18 of 27

NIR data

The NIR data feature three different classes of observations, that, as also shown in Figure S4, do not present a high degree of overlapping. As expected in a situation like this, both the optimisation methodologies under study return similar outcomes for all of them (see Table 1). Just slight differences in the classification efficiency indices due to distinct values of the decision thresholds can be spotted. This is in good agreement with what outlined in the previous section. Table 1: NIR data - Average and 95% confidence intervals of classification sensitivity, specificity and efficiency in external validation yielded by the two SIMCA model optimisation methodologies under study for each considered class of observations. Bold characters highlight statistically significant differences between the two compared approaches in terms of classification efficiency. Statistical significance was assessed as detailed in Section 4.3 Optimisation method Standard optimisation ROC curve-based optimisation

4.3.2

Class #1

Sensitivity (test set) Class #2 Class #3

95.18 ± 0.31% 93.25 ± 0.18%

86.39 ± 0.42% 85.61 ± 0.19%

85.08 ± 0.57% 85.92 ± 0.60%

Class #1

Specificity (test set) Class #2 Class #3

87.62 ± 0.19% 90.01 ± 0.22%

92.93 ± 0.25% 93.42 ± 0.23%

95.85 ± 0.12% 95.67 ± 0.15%

Class #1

Efficiency (test set) Class #2

91.32 ± 0.19% 91.61 ± 0.09%

89.58 ± 0.17% 89.42 ± 0.12%

Class #3 90.28 ± 0.29% 90.63 ± 0.27%

Cell data

As illustrated by Figure S5, the cell dataset relates to an example of 5-class classification in which 4 categories of samples (long, normal, round and small) are rather well defined and separated, while one of them (deformed) exhibits a more or less pronounced overlapping with all the others. For this reason, the classification efficiency yielded by the two Table 2: Cell data - Average and 95% confidence intervals of classification sensitivity, specificity and efficiency in external validation yielded by the two SIMCA model optimisation methodologies under study for each considered class of observations. Bold characters highlight statistically significant differences between the two compared approaches in terms of classification efficiency. Statistical significance was assessed as detailed in Section 4.3 Optimisation method Standard optimisation ROC curve-based optimisation

Sensitivity (test set) Normal class Small class

Deformed class

Long class

90.53 ± 0.13% 70.42 ± 0.35%

89.01 ± 0.09% 90.62 ± 0.09%

86.66 ± 0.13% 84.22 ± 0.14%

94.72 ± 0.12% 92.82 ± 0.21%

Specificity (test set) Normal class Small class

Round class

Deformed class

Long class

92.65 ± 0.11% 94.46 ± 0.10%

32.36 ± 0.20% 72.93 ± 0.48%

88.78 ± 0.16% 87.87 ± 0.12%

86.98 ± 0.11% 88.73 ± 0.14%

92.14 ± 0.14% 93.40 ± 0.20%

Round class

Deformed class

Long class

Efficiency (test set) Normal class

Small class

Round class

95.49 ± 0.04% 93.87 ± 0.11%

54.12 ± 0.15% 71.62 ± 0.08%

88.89 ± 0.06% 89.23 ± 0.04%

86.82 ± 0.04% 86.44 ± 0.02%

93.42 ± 0.06% 93.10 ± 0.08%

94.06 ± 0.06% 94.16 ± 0.07%

compared optimisation approaches in external validation is comparable when long, normal, round and small cells are concerned (see Table 2). Conversely, an evident improvement 18

ACS Paragon Plus Environment

Page 19 of 27 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

in the performance of the SIMCA model for the deformed cells can be observed when the ROC curve-based methodology is applied for the selection of its dimensionality and for the estimation of the decision threshold. In this case, class models guaranteeing a classification efficiency of around 70% for the test set are obtained, against approximately 54% resulting from the standard strategy described in Section 4.3. 4.3.3

Glass data

Glass data include measurements for two categories of observations whose distributions (at least in the subspace of their first two principal components - see Figure S6) much resemble some of those resulting from the simulation procedure detailed in Section 3.1 (see e.g. Figure S1f). And as for the corresponding synthetic dataset, here the same conclusions can be drawn (see Table 3). Both the standard and ROC curve-based SIMCA model optimisation techniques led to analogous classification efficiency values for the less disperse group of samples (window glass), but the second outperformed the first when modelling the most disperse class (other glass). Specifically, an average classification efficiency of 42.04% and 60.24% was achieved by tuning their parameters by the standard and the ROC curve-based approach, respectively. Table 3: Glass data - Average and 95% confidence intervals of classification sensitivity, specificity and efficiency in external validation yielded by the two SIMCA model optimisation methodologies under study for each considered class of observations. Bold characters highlight statistically significant differences between the two compared approaches in terms of classification efficiency. Statistical significance was assessed as detailed in Section 4.3 Optimisation method Standard optimisation ROC curve-based optimisation

4.3.4

Sensitivity (test set) Window glass class Other glass class 89.02 ± 0.28% 87.45 ± 0.46%

72.76 ± 1.78% 53.35 ± 1.87%

Specificity (test set) Window glass class Other glass class 90.71 ± 0.52% 92.76 ± 0.62%

26.71 ± 2.38% 72.76 ± 3.61%

Efficiency (test set) Window glass class Other glass class 89.83 ± 0.24% 90.02 ± 0.30%

42.04 ± 1.64% 60.24 ± 1.04%

Parkinson’s disease data

Regarding the Parkinson’s disease dataset, two classes of patients exhibiting a high degree of overlapping are concerned (see Figure S7). Here, the ROC curve-based optimisation 19

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 27

Table 4: Parkinson’s disease data - Average and 95% confidence intervals of classification sensitivity, specificity and efficiency in external validation yielded by the two SIMCA model optimisation methodologies under study for each considered class of observations. Bold characters highlight statistically significant differences between the two compared approaches in terms of classification efficiency. Statistical significance was assessed as detailed in Section 4.3 Optimisation method Standard optimisation ROC curve-based optimisation

Sensitivity (test set) Healthy class Parkinson’s disease class

Specificity (test set) Healthy class Parkinson’s disease class

Efficiency (test set) Healthy class Parkinson’s disease class

69.19 ± 1.21% 86.62 ± 2.12%

85.06 ± 0.53% 68.80 ± 2.21%

76.54 ± 0.58% 76.11 ± 0.65%

84.65 ± 0.65% 60.88 ± 1.73%

41.50 ± 0.83% 62.25 ± 1.54%

59.12 ± 0.54% 60.85 ± 0.53%

methodology yielded SIMCA models showing a significantly higher classification efficiency in external validation for the Parkinson’s disease category with respect to the standard approach exploiting an a-priori classification threshold (see Table 4).

5

SIMCA model complexity and significance level: sequential or simultaneous optimisation?

Until now, the advantages resulting from a proper adjustment of the significance level for the combined index in SIMCA modelling over fixing it a priori in the presence of overlapping among categories of samples have been shown. To the best of the authors’ knowledge, no extensive study highlighting these pros has ever been conducted before. Nevertheless, it has to be said that the possibility of such an adjustment is an option that has already been explored elsewhere. 32–34 However, in these works, the optimisation of the dimensionality of SIMCA models and of their significance level has always been conceived in a sequential way, i.e. the significance level has often been tuned after having assessed the number of principal components to extract (based on a specific user-defined criterion). Therefore, what is better? A simultaneous or a sequential optimisation? In order to investigate this aspect, an additional dataset was exploited, containing 241 NIR spectra of pistachio nuts of two different geographical origins (121 from Iran and 120 from USA). 35 In this case, the proposed ROC curve-based algorithm was compared (in the same way as for all the

20

ACS Paragon Plus Environment

Page 21 of 27 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

other real case-studies - see Section 4.3) to a double cross-validation approach by which, in two consecutive steps, the optimal complexity of the SIMCA model, first, and, then, the significance level of the SIMCA model with the optimal dimensionality were estimated by maximising the classification efficiency. 41 NIR spectra of pistachios from Bronte (Italy) were also utilised to evaluate the robustness of the two methodologies towards a completely new set of observations not taken into account for tuning the classifier. All spectral profiles were SNV-corrected prior to the analysis. Table 5: Pistachio data - Average and 95% confidence intervals of classification sensitivity, specificity and efficiency in external validation yielded by the two SIMCA model optimisation methodologies under study for each considered class of observations (Iran and USA). Classification specificity values for the sample belonging to the third external class (Bronte) and provided by both the class models are also reported (see 8th and 9th column). Bold characters highlight statistically significant differences between the two compared approaches in terms of classification efficiency and specificity for the Bronte category. Statistical significance was assessed as detailed in Section 4.3 Optimisation method Standard optimisation ROC curve-based optimisation

Sensitivity (test set) Iran class USA class 52.68 ± 1.39% 62.56 ± 1.44%

80.12 ± 0.89% 90.40 ± 0.41%

Specificity (test set) Iran class USA class 82.70 ± 0.87% 77.20 ± 1.27%

89.41 ± 0.62% 83.37 ± 1.00%

Efficiency (test set) Iran class USA class 65.64 ± 0.61% 69.05 ± 0.51%

84.54 ± 0.35% 86.71 ± 0.43%

Specificity (Samples from Bronte - Italy) Iran class USA class 100.00 ± 0.00% 100.00 ± 0.00%

100.00 ± 0.00% 100.00 ± 0.00%

The outcomes reported in Table 5 corroborate the conclusions drawn in all the previous sections: the simultaneous SIMCA model optimisation based on the principles of the ROC curves yields a better compromise between classification sensitivity and specificity in external validation for the two categories of samples directly involved in the computational procedure. Regarding the third external class (Bronte), both the considered techniques allowed the two calibrated class models to reject all the extraneous objects.

6

Conclusions

A novel procedure exploiting the principles of ROC curves for the simultaneous optimisation of both the complexity and the combined index decision threshold in SIMCA classification models was proposed. Two interesting points arose from the analysis of the different handled case-studies: 21

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

• in the presence of strong overlap amongst classes due to a higher dispersion of the target one, the implemented approach was found to lead to better classification efficiency in external validation compared to the more standard procedure based on a fixed significance level. Therefore, it can be said that adequately tuning the significance level through a proper adjustment of the decision threshold guarantees a classification that is more robust towards the dispersion of the target category (i.e. a better compromise between classification sensitivity and specificity may be achieved - see e.g. Tables 1-5); • in cases of clearer and more definite separation among classes, the two aforementioned methodologies enabled a similar and equally satisfactory classification of test samples not utilised in the model calibration stage. The only caveat to the use of this tuning strategy is that both target and non-target category objects need to be available. However, although in the CM framework only samples from the target class are necessary for building the category model, it is often beneficial, if not even recommended, to have individuals from non-target classes in order to evaluate the specificity of such a model, especially when the experimental data may contain non-relevant and non-meaningful information. 36

7

Supporting information

Additional figures provided in SM.pdf. The file includes additional representations of the simulated data and the scores plots resulting from the PCA decomposition of the 4 real datasets.

References (1) Moya, M.; Koch, M.; Hostetler, L. One-class classifier networks for target recognition applications. Proceedings of the World Congress on Neural Networks. Portland, USA, 22

ACS Paragon Plus Environment

Page 22 of 27

Page 23 of 27 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

1993; pp 797–801. (2) D. Tax, R. Duin, Outlier detection using classifier instability, Springer-Verlag GmbH, Berlin, Germany. 1998. (3) Albano, C.; Dunn III, W.; Edlund, U.; Johansson, E.; Nordén, B.; Sjöström, M.; Wold, S. Four levels of pattern recognition. Anal. Chim. Acta 1978, 103, 429–443. (4) Derde, M.; Massart, D. UNEQ: a disjoint modelling technique for pattern recognition based on normal distribution. Anal. Chim. Acta 1986, 184, 33–51. (5) Wold, S. Pattern recognition by means of disjoint principal components models. Pattern Recogn. 1976, 8, 127–139. (6) Wold, S.; Sjöström, M. SIMCA: a method for analyzing chemical data in terms of similarity and analogy. Chemometrics: Theory and Application First Edition. 1977; pp 243–282. (7) Fisher, R. The use of multiple measurements in taxonomic problems. Ann. Eugenic. 1936, 7, 179–188. (8) Rodionova, O.; Oliveri, P.; Pomerantsev, A. Rigorous and compliant approaches to one-class classification. Chemometr. Intell. Lab. 2016, 159, 89–96. (9) Swets, J. Measuring the accuracy of diagnostic systems. Science 1988, 240, 1285–1293. (10) Brown, C.; Davis, H. Receiver operating characteristics curves and related decision measures: a tutorial. Chemometr. Intell. Lab. 2006, 80, 24–38. (11) Mossman, D.; Somoza, E. ROC curves, test accuracy, and the description of diagnostic tests. J. Neuropsychiatry Clin. Neurosci. 1991, 3, 330–333. (12) Centor, R. Signal detectability: the use of ROC curves and their analyses. Med. Decis. Making 1991, 11, 102–106. 23

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(13) Murphy, A. The Finley affair: a signal event in the history of forecast verification. Weather Forecast. 1996, 11, 3–20. (14) Yue, H.; Qin, S. Reconstruction-based fault identification using a combined index. Ind. Eng. Chem. Res. 2001, 40, 4403–4414. (15) Box, G. Some theorems on quadratic forms applied in the study of analysis of variance problems: effect of inequality of variance in one-way classification. Ann. Math. Stat. 1954, 25, 290–302. (16) Durante, C.; Bro, R.; Cocchi, M. A classification tool for N-way array based on SIMCA methodology. Chemometr. Intell. Lab. 2011, 106, 73–85. (17) Marini, F. Classification methods in chemometrics. Curr. Anal. Chem. 2010, 6, 72–79. (18) de la Guardia, M.; Gonzálvez, A. In Comprehensive Analytical Chemistry, 1st ed.; D., B., Ed.; Elsevier B.V., Oxford, United Kingdom, 2013; Vol. 60. (19) Qin, S. Statistical process monitoring: basics and beyond. J. Chemometr. 2003, 17, 480–502. (20) Nomikos, P.; MacGregor, J. Multivariate SPC charts for monitoring batch processes. Technometrics 1995, 37, 41–59. (21) Jaumot, J.; Gargallo, R.; de Juan, A.; Tauler, R. A graphical user-friendly interface for MCR-ALS: a new tool for multivariate curve resolution in MATLAB. Chemometr. Intell. Lab. 2005, 76, 101–110. (22) Lichman, M. UCI Machine Learning Repository, http://archive.ics.uci.edu/ml, School of Information and Computer Sciences, University of California, Irvine, USA. 2013.

24

ACS Paragon Plus Environment

Page 24 of 27

Page 25 of 27 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

(23) Devos, O.; Ruckebusch, C.; Durand, A.; Duponchel, L.; Huvenne, J. Support Vector Machines (SVM) in near infrared (NIR) spectroscopy: focus on parameters optimization and model interpretation. Chemometr. Intell. Lab. 2009, 96, 27–33. (24) Jacques, J.; Bouveyron, C.; Girard, S.; Devos, O.; Duponchel, L.; Ruckebusch, C. Gaussian mixture models for the classification of high-dimensional vibrational spectroscopy data. J. Chemometr. 2010, 24, 719–727. (25) Barnes, R.; Dhanoa, M.; Lister, S. Standard Normal Variate transformation and detrending of near-infrared diffuse reflectance spectra. Appl. Spectrosc. 1989, 43, 772–777. (26) Vitale, R.; Camacho, R.; Zahir, T.; Fauvart, M.; Michiels, J.; Ruckebusch, C.; Hofkens, J. Sequential multivariate classification for high throughput sorting of bacterial cells, XV Scandinavian Symposium on Chemometrics (SSC15), Naantali, Finland. 19-22/06/2017. (27) Evett, I.; Spiehler, E. Rule induction in forensic science, Central Research Establishment, Home Office Forensic Science Service; 1987. (28) Little, M.; McSharry, P.; Roberts, S.; Costello, D.; Moroz, I. Exploiting nonlinear recurrence and fractal scaling properties for voice disorder detection. Biomed. Eng. Online 2007, 6, 23. (29) Snee, R. Validation of regression models: methods and examples. Technometrics 1977, 19, 415–428. (30) de Boves Harrington, P. Statistical validation of classification and calibration models using bootstrapped Latin partitions. Trends Anal. Chem. 2006, 25, 1112–1124. (31) Student, The probable error of a mean. Biometrika 1908, 6, 1–25. (32) Pirro, V.; Oliveri, P.; Sciutteri, B.; Salvo, R.; Salomone, A.; Lanteri, S.; Vincenti, M.

25

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Multivariate strategies for screening evaluation of harmful drinking. Bioanalysis 2013, 5, 687–699. (33) Rodionova, O.; Balyklova, K.; Titova, A.; Pomerantsev, A. Quantitative risk assessment in classification of drugs with identical API content. J. Pharmaceut. Biomed. 2014, 98, 186–192. (34) Oliveri, P. Class-modelling in food analytical chemistry: development, sampling, optimisation and validation issues - A tutorial. Anal. Chim. Acta 2017, 982, 9–19. (35) Vitale, R.; Bevilacqua, M.; Bucci, R.; Magrì, A.; Magrì, A.; Marini, F. A rapid and noninvasive method for authenticating the origin of pistachio samples by NIR spectroscopy and chemometrics. Chemometr. Intell. Lab. 2013, 121, 90–99. (36) Forina, M.; Oliveri, P.; Lanteri, S.; Casale, M. Class-modeling techniques, classic and new, for old and new problems. Chemometr. Intell. Lab. 2008, 93, 132–148.

26

ACS Paragon Plus Environment

Page 26 of 27

Page 27 of 27 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

For Table of Contents Only

ACS Paragon Plus Environment