Domain-Invariant Partial-Least-Squares Regression - Analytical

May 3, 2018 - We test our approach on a simulated data set where the aim is to ... and test data come from the same data-generating process, which is ...
2 downloads 3 Views 2MB Size
Subscriber access provided by Kaohsiung Medical University

Article

Domain-Invariant Partial Least Squares Regression Ramin Nikzad-Langerodi, Werner Zellinger, Edwin Lughofer, and Susanne Saminger-Platz Anal. Chem., Just Accepted Manuscript • DOI: 10.1021/acs.analchem.8b00498 • Publication Date (Web): 03 May 2018 Downloaded from http://pubs.acs.org on May 5, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 26 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Domain-Invariant Partial Least Squares Regression Ramin Nikzad-Langerodi,



Werner Zellinger, Edwin Lughofer, and Susanne

Saminger-Platz

Department of Knowledge-Based Mathematical Systems, Johannes Kepler University Linz, Austria

E-mail: [email protected]

Phone: +43 7236 3343 430

Abstract Multivariate calibration models often fail to extrapolate beyond the calibration samples due to changes associated with the instrumental response, environmental condition or sample matrix. Most of the current methods used to adapt a source calibration model to a target domain exclusively apply to calibration transfer between similar analytical devices while generic methods for calibration model adaptation are largely missing. To ll this gap, we here introduce domain-invariant partial least squares (di-PLS) regression, which extends ordinary PLS by a domain regularizer in order to align source and target distributions in the latent variable space. We show that a domain-invariant weight vector can be derived in closed-form, which allows integration of (partially) labeled data from the source and target domain as well as entirely unlabeled data from the latter. We test our approach on a simulated data set where the aim is to desensitize a source calibration model to an unknown interferrent in the target domain (i.e. unsupervised model adaptation). In addition, we demonstrate unsupervised, semi1

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

supervised and supervised model adaptation by di-PLS on two real-world near infrared (NIR) spectroscopic data sets.

Introduction The method of partial least squares (PLS) has become a key tool for modeling multivariate, chemical data and is currently being used in a wide array of applications ranging from quality control, process analytics, drug development to biomarker discovery. 15 PLS solves the general problem of least squares regression involving fat data matrices (i.e. when the number of variables exceeds the number of samples) and mutually correlated variables while at the same time oering a high degree of interpretability. 6,7 However, a well-known problem with statistical models is that predictions on new data are reliable only if calibration and test data come from the same data generating process, which is rarely the case in realworld applications. 810 Dierences in environmental conditions during measurements, matrix eects, changing instrumental responses over time (drifts), dierences between instruments and dierent sample handling protocols in dierent laboratories potentially lead to wrong predictions on new data. Therefore, multivariate calibration models usually need to be adapted in order to maintain reliability and predictive accuracy upon such changes.

State of the Art Calibration model adaptation (sometimes referred to as model update, maintenance, calibration transfer and instrument standardization) is a well studied subject in chemometrics. 1116 A straightforward and widely used approach to model adaptation is to collect additional (reference) measurements in the secondary condition (i.e. the target domain) and to augment the calibration data in order to account for variation not captured by the source domain model. Desensitizing a calibration model with respect to known interferrents in the target domain has been reported previously and Kalivas recently intoduced a generalized model

2

ACS Paragon Plus Environment

Page 2 of 26

Page 3 of 26 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

adaptation framework based on Tihkonov regularization allowing incorporation of additional (reference) measurements from the target domain along with interferrent information. 1719 However, it is not always possible to obtain interferrent information in practice and leveraging additional (reference) measurements in the target domain involves careful design of experiment, validation and successful combination of source and target domain data sets. Calibration transfer (CT), also referred to as instrument standardization, can be seen as a special case of model adaptation where the goal is to use a source model established on a primary device to make predictions on samples acquired on a secondary device. In order to correct for dierences in the instruments' response, a typically small number of samples (i.e. transfer standards) need to be measured on both instruments. For a concise overview on CT we refer to a recent review by Workman. 16 The same CT techniques apply when the instrumental response has changed over time, after instrument maintenance or if the environmental conditions (e.g. temperature, humidity) have changed between calibration and application of the model. 11 However, in situations where dierences between calibration and test data are related to changes in sample composition rather than caused by instrumental or environmental changes, adaptation via physical standards is not possible. To this end dierent approaches for data-driven generation of virtual standards representing samples with similar properties in the source and target domain have been developed. 20,21 Du et al. recently introduced a technique called spectral space transformation (SST), which can be applied without calibration standards. SST corrects target domain spectra by projection onto the eigenvectors of the target domain covariance matrix (i.e. the loadings) followed by reconstruction using the projections (i.e. the scores) from target and loadings from the source domain. 22 Conceptually, SST is equivalent to subspace alignment introduced by Fernando et al. in the context of image classication with the dierence that the source domain subspace is aligned to match the target domain subspace in the latter. 23 Both approaches belong to a broader class of techniques known as asymmetric feature transformation in the machine learning community. 9 However, asymetric feature transformation basically forces the source 3

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

domain to match the target domain (or vice versa), which can be sub-optimal when one domain is not well represented by the features of the other. In contrast, symmetric feature transformation techniques aim at "bridging" source and target domain through intermediate, domain-invariant representations. 2428 In chemometrics such representations are referred to as common components and have been successfully employed for data integration. 29 In machine learning, domain-invariant representations are frequently used for domain adaptation. Most of these techniques minimize the maximum mean discrepancy (MMD), which is a widely used, non-parametric measure of similarity between empirical data distributions in a reproducing kernel Hilbert space. 30 However, despite solid theoretical foundation in statistics and large success in domain adaptation applications, non-linear transformation through kernels is mostly inappropriate for chemical, Beer-Lambert type of data (i.e. data where the relationship between instrument response and target property is linear). 31 On the other hand, kernel methods generally suer from poor interpretability  a key factor in chemometrics and analytical chemistry to understand the relationship between measurement (e.g. metabolic ngerprint) and target property (e.g. disease). More recently, several kernel-free methods have been proposed for distribution alignment in the context of domain adaptation 32,33 and calibration transfer. 31 However, these methods are either unsupervised (i.e. incorporation of reference measurements for latent space construction is not possible) or lack a simple closed-form solution and thus require complicated numerical optimization.

Our Approach In the present work we introduce a novel method for model adaptation via domain-invariant features, which we term domain-invariant partial least squares (di-PLS) regression. Our extension introduces a domain regularizer into the PLS objective in order to trade-o alignment of source and target domain data and predictive ability w.r.t. the property of interest while constructing the latent variable (LV) space. Notably, our method is easy to implement and exible in the sense that it can cope either with (partially) labeled data in the source 4

ACS Paragon Plus Environment

Page 4 of 26

Page 5 of 26 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

and target domain as well as entirely unlabeled data in the latter.

Theory Notation We denote scalars using italic

symbols

(e.g. x or V ). Lowercase and uppercase boldfaceitalic

symbols denote vectors (e.g. t) and matrices (e.g. X ), respectively. Unless otherwise stated, vectors are column vectors and the superscripted symbols

T

and

−1

indicate the transpose

and inverse, respectively of a vector or matrix. Horizontal concatenation is indicated using semicolon notation (e.g. X = [X 1 ; X 2 ]), vertical concatenation is indicated using comma qP P I J 2 notation (e.g. T = [t1 , . . . , tA ]). ||X||F = i=1 j=1 |xij | denotes the Frobenius norm of an I × J matrix X of I samples measured at J wavelengths. | · | denotes the absolute value of an argument and det denotes the determinant. By

var

(X ) and

cov

(X ) we express the

variance and covariance, respectively of a random variable X . We use X , y , T , P , W , E and A to denote the instrumental response matrix, the corresponding analyte concentrations, scores, loadings, weights, the residuals matrix and the number of latent variables, respectively following standard notation in chemometrics. The matrices X S and X T will be used to denote the instrumental response of samples from the source and target domain, respectively and I will denote the identity matrix. We further follow common denitions from the domain adaptation literature according to Pan et al.: 8 A

domain

consists of a marginal probability

distribution P (X) over an input space X . A task is dened as a label space Y and a predictive model that maps from the input to the label space, expressed as P (Y |X) in probabilistic terms. Since we deal with regression type problems, Y ∈ R. We refer to unsupervised model adaptation when source and target domain samples are fully labeled and unlabeled, respectively (referred to as transductive transfer learning by some authors 8 ). In contrast, we refer to semi-supervised model adaptation when only a few samples from the target domain and all samples from the source domain are labeled. 5

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 26

Domain Invariant Partial Least Squares (di-PLS) Regression For the readers' convenience, we start this section by briey reviewing the method of ordinary partial least squares (PLS) regression and then introduce our domain-invariant extension. PLS is a multivariate calibration technique concerned with the prediction of analyte concentration from multiple instrumental responses (i.e. spectra). Since direct regression of concentrations y on the corresponding spectra X of the calibration samples is often hampered by high correlations between individual variables and situations where I