Joint and Unique Multiblock Analysis for ... - ACS Publications

Feb 13, 2019 - rized into three different types: benchtop, portable, and imaging ... projection (TOP).12 Different calibration-transfer methods ... PL...
0 downloads 0 Views 928KB Size
Subscriber access provided by UNIV OF NEW ENGLAND ARMIDALE

Article

Joint and Unique Multiblock Analysis for integration and calibration transfer of NIR instruments Tomas Skotare, David Nilsson, Shaojun Xiong, Paul Geladi, and Johan Trygg Anal. Chem., Just Accepted Manuscript • DOI: 10.1021/acs.analchem.8b05188 • Publication Date (Web): 13 Feb 2019 Downloaded from http://pubs.acs.org on February 16, 2019

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 29 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Joint and Unique Multiblock Analysis for integration and calibration transfer of NIR instruments Tomas Skotare,† David Nilsson,† Shaojun Xiong,‡ Paul Geladi,‡ and Johan Trygg∗,†,¶ †Computational Life Science Cluster (CLiC), Department of Chemistry, Umeå University, 901 81 Umeå, Sweden ‡Department of Forest Biomaterials and Technology, Swedish University of Agricultural Sciences, 90183, Umeå, Sweden ¶Corporate research, Sartorius Stedim Biotech GmbH, Umeå, Sweden E-mail: [email protected]

1

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Abstract In the present paper we introduce an end-to-end workflow named joint and unique multiblock analysis (JUMBA), which allows multiple sources of data to be analyzed simultaneously to better understand how they complement each other. In near infrared (NIR) spectroscopy, calibration models between NIR spectra and responses are used to replace wet chemistry methods, where the models tend to be instrument-specific. Calibration transfer techniques are used for standardization of NIR-instrumentation, enabling the use of one model on several instruments. The current paper investigates both similarities and differences between a variety of NIR instruments using JUMBA. We demonstrate JUMBA on both a previously unpublished dataset where five NIR instruments measured mushroom substrate, and a publicly available dataset measured on corn samples. We found that NIR-spectra from different instrumentation largely shared the same underlying structures, an insight we took advantage of to perform calibration transfer. The proposed JUMBA transfer displayed excellent calibration transfer performance across the two analyzed datasets and outperformed existing methods in terms of both prediction accuracy and stability. When applied to a multi-instrument environment JUMBA transfer can integrate all instruments in the same model and will ensure a higher consistency between them compared to existing calibration transfer methods.

Introduction Near infrared spectroscopy (NIR) is widely used as a technique to quickly and non-destructively characterize and determine the chemical content in various, mostly organic, materials. NIR spectroscopy has successfully been used in a wide range of applications in various fields for determination of for example water 1–3 , carbohydrates 4 and protein content 5 . NIR instrumentation within R&D can roughly be categorized into three different types: benchtop, portable and imaging, where imaging instruments are specifically used when spatial information is important. Availability of different types of instruments provides further 2

ACS Paragon Plus Environment

Page 2 of 29

Page 3 of 29 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

opportunities for a sample to be measured on more than a single type of instrument. However, instrumental differences or changes to the instrumentation can pose a problem when applying a multivariate calibration model and lead to unacceptable model errors 6,7 . This would force a full model recalibration which carries a high cost, and a calibration transfer of a model between instruments would be a more cost-effective alternative. Common calibration transfer techniques include direct standardization (DS), piecewise direct standardization (PDS) 8–10 , spectral space transformation (SST) 11 and transfer by orthogonal projection (TOP) 12 . Different calibration transfer methods have been compared in previous works 8,13–15 . General unsupervised analysis of multiple types of NIR spectra can be performed using principal component analysis (PCA) 16 . Supervised techniques such as partial least squares (PLS) 17 and orthogonal partial least squares (O-PLS) 18 can be used to predict responses from NIR spectra to replace wet chemistry reference methods 19 . For a single response variable, O-PLS partitions spectral data into uncorrelated and predictive variations, which can be analyzed separately. While O-PLS is relevant for the investigation of a single data matrix against a single response variable, O2-PLS 20 is suited to integrate two different data matrices or blocks. Essentially, the O2-PLS algorithm will both find the variation shared (joint) between two blocks while also separating variation found in only one block (unique). For more than two blocks, OnPLS 21,22 was later developed as a generic multiblock method and has been successfully applied to analyze a variety of applications and case studies 23–26 . As with O2PLS, OnPLS separates variation that is both joint and unique for each analyzed block. However, as more than two blocks are integrated simultaneously, the shared variation will either be locally joint (shared between at least two blocks but not all) or globally joint (shared between all blocks). The integration of data provided by multiblock modeling combines several data sources and gives a better understanding of the data as a whole. In the present paper we introduce joint and unique multiblock analysis (JUMBA) for integration and calibration transfer of NIR instrumentation The JUMBA workflow involves 3

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

data treatment, model building and evaluation as well as analysis using both conventional and specialized visualizations 24 . We investigate if JUMBA can be used to reveal similarities as well as differences in NIR spectra measured on multiple instruments on the same samples. As the joint structures comprise the spectral variation shared between the instruments the use of only them for modeling purposes would be equivalent to performing instrument standardization. We will demonstrate JUMBA on two different datasets: I A previously unpublished dataset comprising five different NIR instruments used for measuring 68 samples of mushroom substrate. II The corn dataset provided by Eigenvector Research and originally supplied by Mike Blackburn at Cargill (http://www.eigenvector.com/data/Corn/).

Experimental section Mushroom Substrate dataset This dataset was measured on 68 mushroom samples of which 45 had corresponding reference response values. The responses included sugars (glucose, xylose, mannose, arabinose, galactose and rhamnose), total lignin content (Klason lignin plus acid soluble lignin), ash content and extractives. The studied samples originates from cultivation research at the Swedish University of Agricultural Sciences (SLU) and contains a large amount of spent mushroom substrate (SMS). SMS is globally produced from the mushroom industry industries where cultivation degrades the initial substrate and often considered a waste product. Potential uses of SMS have been the focus of several previous studies 27–29 . NIR characterization of SMS allows for rapid and non-destructive determination of its contents. While a previous study was measured on wet samples 29 , the SMS samples we measured were dried at 45◦ C for 96 4

ACS Paragon Plus Environment

Page 4 of 29

Page 5 of 29 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

hours and ground into powder (≤ 0.5mm).

Mushroom substrate treatment The mushroom substrate dataset includes samples from two years. Samples from 2015 (n = 17) and 2016 (n = 51) had identifiers starting with LE and N respectively. A set of controls were included in the samples. They included non-heat treated uncultivated substrate (LEC0, N0) as well as samples that were heat treated, inoculated and sampled on the first day of growth (N1, N11, N21, N31, N41). Additional control samples were LE16-18 which were subjected to heat treatment but not inoculation, and sampled on day 150. The other samples include a spread of materials subjected to different heat treatments and samplings on different days. The heat treatment was to sterilize or pasteurize the raw substrate, i.e., to remove or deactivate competitive microbes in the substrate and to secure the mushroom growth. The heat from saturated steam at 121°C and from hot-air ranging from 75°C to 100°C was used as different heat treatments. The substrates were packed in polypropylene bags in a form of synthetic log during the heat treatment and cultivation, with 8 replicates (bags) for each treatment. Samples were taken periodically and at different stages of mushroom cultivation. One bag from each treatment was sampled at each stage: after mycelium colonization (day 54), mycelia ripening (day 110-120), and first harvest (day 130). On day 146-150, after the harvest, the remaining bags were collected and the remaining substrate biomass was defined as SMS. The SMS from year 2015 was after three harvests of fruit bodies and the SMS from 2016 was after one harvest. All samples were dried in an oven at 45°C for 96 hours to determine moisture content and mass, and then milled to ≤ 0.5 mm and stored in sealed plastic bags before analysis. Determination of contents of carbohydrates (glucose, xylose, mannose, arabinose, galactose, rhamnose), total lignin (Klason and acid soluble lignin), extractives (water and ethanol extraction) and ash was performed by Celignis (http://www. celignis.com/) following the methods described by Hayes 30 . 5

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 29

Table 1: Overview of instruments used for the Mushroom dataset, including the used spectral range Instrument Type Foss NIR Systems 6500 Benchtop Perten DA 7250 Benchtop Specim Spectral Camera SWIR Imaging VIAVI MicroNIR Portable Tec5 HandySpec Field Portable

Used spectral range 458-2498 nm 950-1650 nm 999-2548 nm 908-1676 nm 435-2050 nm

Mushroom substrate measurement We measured the samples on five different instruments (Table 1, see supporting information for further instrumental information). We let the instruments warm up for at least 45 minutes before measuring, and sample bags were well mixed before the sample material was put in the to the respective measurement containers. We measured white and dark reference spectra prior to each sample measurement but not between replicates. For three of the instruments (Perten DA 7250, Specim Spectral Camera SWIR, Foss NIR Systems 6500) we used white and dark reference samples provided with the instruments. We sampled the white and dark references manually for the other two instruments (VIAVI MicroNIR, Tec5 Handyspec Field). The white reference used for these two instruments was spectralon with > 98% reflectivity. For dark reference for the Tec5 Handyspec Field instrument we used spectralon with 2% reflectivity and for the VIAVI MicroNIR we measured the dark reference by pointing the probe below a table with minimal exterior light. Three replicate measurements were performed for all instruments except for the Specim Spectral Camera SWIR, which used a different method described below. Data processing and pre-treatment The raw reflectance spectra from the five instruments were transformed into units of pseudoabsorbance using the reference spectra. For the Specim Spectral Camera SWIR we extracted the center 145x145 spectra from each image and averaged them to produce a single mean spectra. A principal component analysis was performed on the other four instruments sepa6

ACS Paragon Plus Environment

Page 7 of 29 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

rately to find outliers between replicates and noisy wavelengths. For each instrument, noisy wavelengths were removed and replicate spectra for each observation were averaged to a mean spectrum resulting in five blocks of data with 68 observations. The order of the observations were identical between the blocks, but the number of variables differed between instruments. We labelled the resulting dataset the full mushroom substrate dataset (FMD), with wavelengths specified in Table 1. To enable direct comparison between the instruments we created an additional dataset with matching wavelengths for the purpose of calibration transfer, labelled the matching mushroom substrate dataset (MMD). We used wavelengths between 1102 to 1645 nm to avoid the transition between sensors for the Foss instrument sensors at 1100 nm. To ensure that all instrument had matching wavelengths we used linear interpolation, resulting in 544 wavelengths for each block. For both datasets we applied standard normal variate transform (SNV) 31 as data pretreatment prior to modeling. As the Foss instrument uses two separate sensors we applied the SNV pre-treatment separately for each sensor for the full mushroom substrate dataset. All blocks were additionally column-centered.

Corn dataset The corn dataset is a reference dataset for calibration transfer originally provided by Mike Blackburn at Cargill. This data can be obtained from Eigenvector Research (http://www. eigenvector.com/data/Corn/; accessed on 20 June 2018). The dataset contains reference values for protein, starch, moisture and oil, and was measured on 80 different corn samples on three NIR instruments. The three instruments are labelled M5, MP5 and MP6. All instruments contained the same wavelengths, both in range and number (1100-2498 nm, 700 wavelengths). No further modifications to the wavelengths were required for this dataset. SNV and column-centering was applied to the spectra before analysis.

7

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Data integration and multivariate regression methods JUMBA: Joint and Unique Multiblock Analysis Joint and unique multiblock analysis (JUMBA) is a structured analysis workflow, and can be described as: I Preprocessing of data (a) Align datasets (i.e. ensure observations match between blocks) (b) Run PCA analysis for overview and quality control of each dataset II Determine the number of pairwise joint components between blocks (a) Run a generalized regression method, e.g. PLS or O-PLS between each pair of blocks to map the number of components III Run a modified OnPLS algorithm, which is independent of block order, to produce a multiblock model from now on referred to as the JUMBA model IV Evaluate the model (a) Create and inspect the correlation matrix plot 24 for unwanted correlations and also check for outliers in the model components (b) Potentially revise or repeat the workflow if the created model demonstrates issues (e.g. outliers) V The accepted model is then visualized and interpreted using tools as described in Skotare et al. 24 such as the correlation matrix plot and metadata correlation plot (for more details on these see supporting information) VI Use standard chemometric tools such as loading plots to interpret specific contents in each block

8

ACS Paragon Plus Environment

Page 8 of 29

Page 9 of 29 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

The final approved model had split the variation found in the analyzed blocks into joint, unique and residual variation (Equation 1).

Xi = XJ + XU + |{z} E |{z} |{z} Joint

Unique

(1)

Residual

The joint variation is variation shared by at least two blocks and unique variation is found only in a single block. In contrast, the naming convention in OnPLS is different and distinguishes between globally joint (joint components including all blocks) and locally joint (joint components including less than all blocks). Consequently, JUMBA uses the same naming convention for components as described in our previous work 24 but XG and XL are replaced with XJ . For example, XJ2 (1 2 3) denotes the second joint component, and also the component includes the first three blocks. A schematic of JUMBA workflow including additional details is included in the supporting information. Software JUMBA and PLS were performed using an in-house script written in MATLAB (Version 9.4.0.813654 (R2018a). Natick, Massachusetts: The MathWorks Inc.).

Calibration transfer Calibration transfer setup For calibration transfer, the matching mushroom substrate and corn datasets were split into three subsets: calibration, transfer and validation sets. The calibration set (CS) was used to create reference master calibration PLS models, and all observations in this set required corresponding response values. The transfer set (TS) were measured on both master and slave instruments. The TS was used for creating the actual calibration transfer models with no need for reference values. The validation set (VS) was used to evaluate the performance 9

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

of the transfer methods. All observations in VS need to have corresponding reference values. As the matching mushroom substrate dataset had missing response values for a portion of the observations, CS was chosen first as it was constrained to only contain observations with corresponding response values. For consistency, we followed the same selection routine for the corn dataset even though it contained no missing response values. To ensure a good spread of the CS observations, we used the Kennard-Stone selection algorithm (KS) 32 on the concatenated full data from all instruments. KS has previously been used for calibration transfer to ensure that the samples are representative 6,33 . We then randomly selected a number of the remaining observations for the TS. The remaining observations were used in the VS. Consequently, the number of selected observations for the mushroom substrate dataset were 25 for CS, 10 for TS and 33 for VS. The corn dataset also used 25 observations for CS and 10 for TS, but 45 for the VS. In order to evaluate the selection routine visually, we utilized color-coded multiblock score scatter plots for the largest joint components of the full multiblock models (Supporting information Figure S.5 for mushroom substrate, Figure S.10 for corn). To evaluate the performance of the calibration transfer we selected instruments to represent both master and slaves and used these for all calibration transfer methods. For the matching mushroom substrate dataset, we used the two typical benchtop instruments (Foss NIR Systems 6500 and Perten DA 7250) as masters in two separate calibration transfer scenarios. The remaining instruments in either case were acting as slaves. For the corn dataset all possible combinations were used, as the possible number of combinations was much lower. Calibration transfer methods For the intent of evaluating whether JUMBA can be used as a calibration transfer technique, we selected a subset of the established calibration transfer methods to compare against. Selected methods were readily available (either straightforward to implement or available in PLS Toolbox) and also had at most one hyperparameter to minimize the problem of selecting 10

ACS Paragon Plus Environment

Page 10 of 29

Page 11 of 29 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

optimal settings. The selected transfer methods were direct standardization (DS), piecewise direct standardisation (PDS) 8–10 , spectral space transformation (SST) 11 and transfer using orthogonal projections (TOP) 12 . In addition to these dedicated methods, we also used partial least squares regression (PLS) 17,34 as a reference for non-dedicated transfer methods. Calibration transfer using JUMBA In addition to the other transfer methods, we also investigated JUMBA for calibration transfer. JUMBA transfer uses the predicted scores of the slave instrument but the loadings of the master instrument to reproduce the master block (Equation 2). By default, the same JUMBA model is used regardless of which instrument is used as master or slave.

XM = (T nS × M fM ) × PM where: XM T nS

(2)

= Predicted master block = Predicted slave scores normalized to length 1

M fM,S = Multiplication factor (Difference in length between master and slave scores) PM

= Master loadings (only joint with slave block)

Compared to other techniques, JUMBA and TOP differ in that they standardize all instruments together, including the master instrument. TOP uses the transfer model to remove variation in the CS that is specific to the master instrument, while JUMBA uses it to remove non-joint variation. The non-joint variation in the CS is removed by inserting the CS into Equation 2, where the master instrument is considered both master and slave. Calibration transfer performance evaluation We used the root mean square error of prediction, RMSEP, to evaluate the performance of each calibration transfer method. The result from RMSEP is the expected error of a 11

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

prediction, presented in the same unit as the original response. As the VS normally require all observations to have corresponding response values but the mushroom substrate dataset lack several responses, we estimated responses using a similar methodology as described by Eskildsen et al. 35 . In their paper, estimated response reference values were calculated as the median of cross-validated values obtained for each of the analyzed instruments. In our paper, we fitted separate PLS models on the calibration data, one from each instrument. We then predicted the response values for the validation set for each model and used the mean value to estimate VS responses. As we wanted to treat both the matching mushroom substrate and corn datasets identically we performed this procedure on both to generate new responses. Setting hyperparameters We automatically tested all reasonable hyperparameter settings for the PDS, SST and TOP transfer methods and selected the ones with the lowest RMSEP value. Optimizing against RMSEP is in most cases not desirable as it can lead to overfitting and overly optimistic results, but we considered this acceptable within the scope of the current study. For PLS, we automatically decided the number of PLS components for each model using up to twelve components. The predicted explained variance by cross-validation (Q2Y) was used as selection criteria and not RMSEP. This selection procedure was used when creating PLS calibration models for each instrument and for the PLS transfer procedure. JUMBA transfer optimizes the integration model by selecting different settings for the number of pairwise joint components for each combination of master and slave instruments per response. Details on how the hyperparameters for the different methods were selected are given in the supporting information.

12

ACS Paragon Plus Environment

Page 12 of 29

Page 13 of 29 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Table 2: Table of summed explained variation in the FMD JUMBA model. Instrument Joint (all blocks) Other joint Unique E Foss 97.7% 1.7% 0.3% 0.2% VIAVI 98.8% 0.8% 0.4% 0.1% Perten 99.2% 0.7% 0.0% 0.1% Specim 94.1% 4.1% 1.6% 0.2% Tec5 98.4% 1.2% 0.3% 0.1%

Results Mushroom substrate dataset JUMBA integration of FMD We used JUMBA to integrate the blocks of NIR spectra from the five instruments, resulting in joint and unique components for each block (See supporting Figure S.6a). The model is strong, as the first eight joint components have high internal correlations between the instruments within the same component and limited correlations to other components. We also see that the amount of variance explained by the individual instruments in a joint component can be very different. As seen in Table 2 the multiblock model mostly found joint variance shared by all blocks. Inspection of the metadata correlation plot (the first seven components in Figure 1a, for full results see supporting information Figure S.6b) shows a strong relationship between the first joint component and mannose, while the second joint component mostly relates to different kinds of sugar. To investigate the causes of these correlations we created multiblock scatter plots. The second joint component is largely due to the separation between control and normal samples (Figure 1b). The control samples were treated differently, e.g. not used for cultivation, and retains much of the initial sugar in the substrate, explaining the observed effect. If we instead plot the first and third component and color by the sugar with the strongest correlation (mannose) we see a trend in the samples from low to high (Figure 1c).

13

ACS Paragon Plus Environment

Page 14 of 29

0.4

0.4

0

-0.2

0 -0.1

0.8 0.7 0.6

2 3 4 5)

N24 N4 LE6 LE22 LE21 N14 LE23 LE7 LE15 N12N49 N3 N44 N39 N25 N29 N32 N 26 N37 LE5 LE8 N16 N13 LE4 N27 N45 N38 N17 N19 N18 N15 N22 N34 N8 N43 N28 N30 N40 N23 N7 LE9 N46 N6 N47 N42 N20 N35 N50 N33 N9 LE2 N36 N2 N10 N48 LE3 LE1 N5 Control

Arabinose

XJ4 (1 2 3 4 5)

0.1

TJ3 (1

0.2

1.8

0.2

N4 LE18 LE17 N12N3 LE16 N24N2 N14 LE21 N32 N8 N1 LE22 N6LE23 N22N42 N13 N7N9 N23 N43 N39 LE15 N15 N49 N44 N16 N10 N34 N11 N28 N0 N37 N17 LE5 N19 N20 N5 LE6 N26 N33 N21 N25 LE4 N45 N31 N18 N27 N41 N38N46 N47 N40 N29 N30 LE7 N50 LE8 N35 N36 N48 LE9

0.9 2 3 4 5)

0.4

XJ3 (1 2 3 4 5)

1

0.2

0.6

1.9

0.3

1.1

0.8

0.1 0 -0.1

0.5

XJ5 (1 2 3 4 5) -0.4

-0.2

-0.2 0.4

XJ6 (1 2 3 4 5)

LEC0 N0

-0.3 XJ7 (1 2 3 4) -1

-0.4 -0.4

-0.3

-0.2

LE18 LE17 N1 LE16

N21 N41N31 N11 -0.1

TJ1 (1

(a)

0

0.3 0.2

0.1

0.2

0.3

0.4

1.6 1.5 1.4 1.3

LEC0

-0.6

-0.8

1.7 Mannose

0.3

XJ1 (1 2 3 4 5)

XJ2 (1 2 3 4 5)

2

1.2

1

Foss - 32.35% VIAVI - 67.05% Perten - 64.73% Specim - 24.58% Tec5 - 69.52% Foss - 42.35% VIAVI - 7.37% Perten - 5.93% Specim - 26.75% Tec5 - 24.64% Foss - 7.16% VIAVI - 17.86% Perten - 18.20% Specim - 15.34% Tec5 - 3.23% Foss - 11.73% VIAVI - 4.94% Perten - 8.10% Specim - 19.64% Tec5 - 0.47% Foss - 2.91% VIAVI - 1.27% Perten - 2.05% Specim - 6.31% Tec5 - 0.35% Foss - 0.64% VIAVI - 0.10% Perten - 0.08% Specim - 0.63% Tec5 - 0.11% Foss - 1.02% VIAVI - 0.67% Perten - 0.61% Specim - 4.08%

TJ2 (1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Total Sugars Glucose Xylose Mannose Arabinose Galactose Rhamnose Klason Lignin Acid Soluble Lignin Extractives Ash

Analytical Chemistry

1.2

-0.3 -0.4 -0.4

-0.3

-0.2

-0.1

TJ1 (1

2 3 4 5)

(b)

1.1

LE2 LE1 LE3 0

0.1

0.2

0.3

0.4

2 3 4 5)

(c)

Figure 1: Metadata correlation plot for FMD (a), showing the first seven joint components. Strong correlations are shown in deeper colors and larger circles, and the color indicates the direction of the correlation. The corresponding multiblock scatter plots are also shown. The first two joint components colored by arabinose (b) separates the control samples and the others in the second component. The first and third joint component (c) colored by mannose with an arrow showing the trend from low to high. Table 3: Table of summed explained variation in the MMD JUMBA model. Instrument Joint (all blocks) Remaining joint Unique E Foss 99.9% 0.0% 0.0% 0.1% VIAVI 99.7% 0.0% 0.2% 0.1% Perten 99.8% 0.0% 0.0% 0.2% Specim 99.5% 0.3% 0.2% 0.1% Tec5 99.5% 0.0% 0.4% 0.1% JUMBA integration of MMD We applied JUMBA on the matching mushroom substrate dataset (1102 to 1645 nm). Compared to FMD, we found this model to comprise even more joint variance, which is summarized in Table 3. The correlation matrix plot (supporting information Figure S.7a) reveals 11 joint components. The first components show a strong correlation and the loadings for the components are also overlapping, exemplified by the loadings of the second joint component (Figure 3a). The tenth joint component contains weaker correlations, but inspection of the corresponding loadings (Figure 3c) show a strong internal resemblance with the same peaks. They 14

ACS Paragon Plus Environment

Page 15 of 29 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

are therefore kept in the model. Further visual inspection of the loadings show that the Tec5 instrument appears to be noisier than the other four in later joint components such as the sixth (Figure 3b). The remaining variance is distributed in a number of small unique components. Similar to the FMD, for the MMD we selected components to view from the metadata correlation matrix plot (Figure 2a, full results in supporting information Figure S.7b) and created multiblock scatter plots. The first two joint components (Figure 2b) display distinct groups for the control samples as well as for the samples of the 2015 study. We expect that the samples from 2015 should be more decomposed considering that three harvests were performed that year, in contrast to a single harvest in 2016. We also see a clear trend from low to high xylose content. The first and third joint component, shown in Figure 2c, display a trend for glucose in the largest of the two distinctive groups. It also separates the group of controls, largely due to the first joint component. The observation LE8 appears to be an outlier at first glance, but it is clearly not a measurement error as the multiblock scatter plot showed it had near identical score values for the different instruments. Also, it is consistent to the wet chemical analysis where LE8 had the lowest glucose content. Calibration transfer for MMD To find a small number of representative responses, we created a two-component PCA model of the responses (n = 45, k = 11, supporting information Figure S.11). From the loading plot (See supporting information, Figure S.11b) we selected three representative responses (glucose, mannose and rhamnose) and used them for evaluating calibration transfer. As a baseline, we created separate calibration models using PLS between the CS and the three selected responses for each instrument. The RMSEP values for these models can be seen in Figure S.12 and represent a best-case scenario for calibration transfer. For JUMBA transfer and the MMD we found strong performance using default settings. 15

ACS Paragon Plus Environment

Page 16 of 29

0.4

0.5

20

1 0.3

X J1 (1

44

0.4 Controls

18

2 3 4 5)

42

Controls

0.3

0.2 Xylose trend N22 N42 N32 N12 N24 N2 N34 N29 N3 N33 N44 N25 N4 N43 N23 N38 N45 N28 N26 N39 N14 N13 N15 N19 N30 N49 N7N8 N18 N46 N35 N37 N27 N40 N6N9 N10 N17 N20 N36 N16 N50N5 N47 LE23 N48 LE22 LE21

TJ2 (1

LE16 LE18 LEC0 LE17 N1

-0.1 -0.2

LE6 LE8 LE9 LE5 LE7 LE15 LE4 LE3 LE1 LE2

-0.3

2 3 4 5)

-1

-0.4 -0.4

-0.3

-0.2

-0.1

0

TJ1 (1

(a)

0.1

14

12

10

8

0.2

N0 LEC0N1 LE16 LE18 LE17 N31 N11 N41 N21

0.1 0 -0.1 -0.2

2015 Study

0.3

LE1 N10 N8 N9 N5 N7

0.2

16 2 3 4 5)

0

2 3 4 5)

-0.5

X J4 (1

N0 N41 N21 N11 N31

Xylose

2 3 4 5)

2 3 4 5)

0

X J3 (1

0.1

TJ3 (1

0.5

X J2 (1

38

N13 LE2 N6 LE3N16 N15 N3 N20 N17 N14 LE21N27 N4 N19 N2 N18 N39 N43 N30 N35 LE15LE22 N49 N23 LE7 N12 N46 LE5 N28 N40 N26 N38 N25 N50 N36 N32 LE6 N42 N48 N47 N33 N44 LE4 N37 N45 LE23 N24 N29 N22 LE9 N34 Glucose trend

36

34

-0.3

32

-0.4

30

LE8 -0.5 -0.5 -0.4 -0.3 -0.2 -0.1 0

0.4

40

TJ1 (1

2 3 4 5)

(b)

0.1

0.2

0.3

0.4

Glucose

Foss - 51.18% VIAVI - 50.81% Perten - 53.06% Specim - 52.03% Tec5 - 66.99% Foss - 44.23% VIAVI - 44.19% Perten - 41.11% Specim - 42.84% Tec5 - 24.94% Foss - 2.97% VIAVI - 3.37% Perten - 3.32% Specim - 3.12% Tec5 - 5.78% Foss - 0.79% VIAVI - 0.64% Perten - 1.49% Specim - 0.57% Tec5 - 0.76%

0.5

2 3 4 5)

(c)

Figure 2: (a) Metadata correlation plot showing the first four joint components of the MMD model. Strong correlations are shown in deeper colors and larger circles, and the color indicates the direction of the correlation. (b) Multiblock scatter plot of the MMD, showing the first two joint components colored by xylose. A clear separation can be seen between the control samples and the others. (c) The first and third joint component colored by glucose, with an arrow showing the trend from low to high. 0.1

0

-0.05

-0.1

-0.15

-0.2 1100

1200

1300

1400

Wavelength

(a)

1500

1600

1700

0.1 Foss Specim

Foss VIAVI Perten Specim Tec5

4)

0.05

Joint component 10: PJ10 (1

Foss VIAVI Perten Specim Tec5

2 3 4 5)

0.05

Joint component 6: PJ6 (1

2 3 4 5)

0.1

Joint component 2: PJ2 (1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Total Sugars Glucose Xylose Mannose Arabinose Galactose Rhamnose Klason Lignin Acid Soluble Lignin Extractives Ash

Analytical Chemistry

0

-0.05

-0.1

-0.15 1100

1200

1300

1400

1500

1600

1700

0.05

0

-0.05

-0.1

-0.15 1100

1200

Wavelength

(b)

1300

1400

1500

1600

1700

Wavelength

(c)

Figure 3: Loadings of the MMD JUMBA model. The second component component (a) is strong with very high visual similarity. Component 6 (b) is less overlapping but still show strong similarities, and component 10 (c) show more noise but overall still have a high resemblance. As it already outperformed all other methods we did not further optimize the hyperparameters for JUMBA. The results of the calibration transfer can be observed in Figure 4 for the two different master systems, Foss and Perten. For reference we included the baseline RMSEP values for these two instruments. We can observe that there are differences in performance between the calibration transfer methods. There is also considerable variation in the performance between the instruments, with TOP generally displaying the largest spread. However, JUMBA transfer performs very consistently.

16

ACS Paragon Plus Environment

Page 17 of 29

RMSEP (Glucose)

3 2 1 0

0.3

RMSEP (Mannose)

Perten Foss Foss Perten VIAVI Perten VIAVI Foss Perten Foss Specim Perten Specim Foss Tec5 Perten Tec5 Foss

0.2 0.1 0 0.1

RMSEP (Rhamnose)

0.05

TO P

T SS

S PD

S D

JU

M BA

S PL

se

lin

e

0 Ba

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Figure 4: RMSEP for the different transfer methods on the MMD, including the baseline performance of the master instruments (Foss NIR Systems 6500 and Perten DA 7250). RMSEP for the different transfer methods for the corn dataset. Master systems are represented by the marker used and slave instrument by the color. The different instruments are slightly offset on the x-axis depending on the slave instrument. The transfer method legend has the slave instrument before the arrow and the master instrument after. JUMBA transfer is more consistent across different slave instruments compared to other transfer methods.

Corn dataset JUMBA integration of corn dataset The JUMBA integration of the corn dataset resulted in ten joint components containing all instruments and three joint between instruments MP5 and MP6 (See supporting information Figure S.8). The summed explained variation can be seen in Table 4. During analysis we detected that instrument M5 has weaker correlations compared to the other two instruments in joint components 8, 9 and 10 (See correlation matrix plot in supporting information Figure S.8a). To further analyze the instrumental differences, we created normalized score scatter plots between both the first and eighth component, along with the corresponding loading line plots (Figure 5). Even in the first component we can see 17

ACS Paragon Plus Environment

Analytical Chemistry

Table 4: Table of summed explained variation in the corn JUMBA model. Instrument Joint (all blocks) Remaining joint Unique E M5 99.6% 0.0% 0.2% 0.2% MP5 99.7% 0.0% 0.0% 0.1% MP6 99.7% 0.0% 0.0% 0.1% 4

4

3

3

2

2

MP6 TJ1

MP5 TJ1

0.06

1 0

0.04 0.02

M5 MP5 MP6

0 -0.02

1

-0.04

0

-0.06 -0.08

-1

-1

-2 -2 -1

-2 -2 -1

0

1

2

3

4

-0.1

M5 TJ1

0

1

2

3

4

-0.12 1000

1500

2000

2500

Wavelength

MP5 TJ1

(a)

(b)

(c)

3 2.5 2 1.5 1 0.5 0 -0.5 -1 -1.5 -2 -2.5

MP6 TJ8

0.12

MP5 TJ8

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 29

-2

-1

0

1

M5 TJ8

(d)

2

3

3 2.5 2 1.5 1 0.5 0 -0.5 -1 -1.5 -2 -2.5

0.1 0.08

M5 MP5 MP6

0.06 0.04 0.02 0 -0.02 -0.04 -0.06

-2

-1

0

1

2

3

-0.08 1000

1500

2000

2500

Wavelength

MP5 TJ8

(e)

(f)

Figure 5: Normalized scores between M5 and MP5 (a), MP5 and MP6 (b) and their corresponding loadings (c) for the first component in the corn dataset. Similarly, (d), (e) and (f) show the same plots but for the eight component. that the MP5 and MP6 instruments are more similar and display both stronger correlation (Figure 5b versus 5a). The loadings are also more similar (Figure 5c). The trend is even more pronounced for the eight component (5d, e, f). Calibration transfer for the corn dataset We used all four available responses for the calibration transfer evaluation. As with the MMD, we created a baseline using separate calibration models using PLS between the CS and the four responses for each instrument (See supporting Figure S.13). The results of the calibration transfer for the three instruments are displayed in Figure 6. Each instrument is used as master once, and the baseline value for the calibration model is also shown.

18

ACS Paragon Plus Environment

Page 19 of 29 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

We see apparent differences in the performance of the calibration methods, with overall strong performances especially for the JUMBA and TOP transfer methods. We also establish that calibration transfers involving the M5 instrument generally perform worse.

Calibration transfer performance summary In order to evaluate the aggregated calibration transfer performance for the two datasets, we calculated mean RMSEP values for all transfer methods and all seven responses. Additionally, upper 95% confidence interval limits of the mean RMSEP were calculated as a performance measure where the stability had been taken into account. The transfer methods were ranked in respect to their performances in these two measures, where the best method was assigned a rank of 1 and the worst a rank of 6. Average ranks could then be determined for the six transfer methods and the two performance measures. For the mean RMSEP, the ranking order (from best to worst) was JUMBA, TOP, SST, PDS / PLS and DS and for the 95% confidence interval upper limit of the mean RMSEP it was JUMBA, TOP, PLS, PDS / SST and DS. For full results, see additional information.

Discussion In all the presented JUMBA integrations we see that joint components containing all included instruments accounted for a large majority of the explained variation. We believe this is explained by the fact that the instruments all measure the same type of physicochemical properties. The FMD model had more joint components that did not include all instruments. It also had more unique components compared to other two models. This is expected as FMD contained different spectral ranges. Typically, the JUMBA integration extracted a high number of joint components from the ingoing blocks. It would normally be difficult to verify whether the components extracted using techniques such as PCA represent relevant physicochemical properties. However, JUMBA 19

ACS Paragon Plus Environment

RMSEP (Oil)

0

0.1

0

RMSEP (Protein)

MP6 MP5 M5 M5 MP6 M5 MP5 MP5 MP6 MP5 M5 MP6 MP5 MP6 M5

Page 20 of 29

0.5

0.2 0

RMSEP (Starch)

1 0.5

P TO

T SS

S PD

S D

BA JU M

S PL

se l

in

e

0 Ba

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

RMSEP (Moisture)

Analytical Chemistry

Figure 6: RMSEP for the different transfer methods for the corn dataset. Reference values for the master instruments are also shown. Master systems are represented by the marker used and slave instrument by the color. The different instruments are slightly offset on the x-axis depending on the slave instrument. The transfer method legend has the slave instrument before the arrow and the master instrument after. We can see that when the M5 instrument is used as master, most transfer methods perform relatively poorly, and when it is the slave this is generally also the case. integration allows us to inspect the visual similarity between the joint loadings. If the same underlying signal is the basis of a joint component, we can conclude that it is relevant as it is detected in multiple instruments. We saw an example of this in the second joint component of the MMD model (Figure 2a). While the explained variance was different between the blocks, the loadings were still visually very similar (Figure 3a). This proves they still shared the same underlying structures. The JUMBA integration with its related plots allowed us to assess instrumental similarities on two different levels that would not have been possible using for example PCA. In addition to the overall similarity, the integration allowed us to inspect variations on the observation level. For example, we saw that the Tec5 instrument in the mushroom substrate dataset and also the M5 instrument in the corn dataset were overall different. For the M5 instrument these differences were also reflected in the calibration transfer results later on. On the observation level we determined that a sample suspected to be an outlier (sample LE8 in the matching SMS dataset) was not due to a measurement error but rather due to

20

ACS Paragon Plus Environment

Page 21 of 29 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

actual chemical differences in the sample. If a sample would have deviated, i.e. produced a large difference in scores between instruments, it could for example be that the sample was involuntarily modified between instrument measurements or that it simply was incorrectly measured. Additionally to the specific benefits gained from the integration, JUMBA also provided general unsupervised analysis for finding trends and clusters in the data, as shown in the multiblock scatter plots. JUMBA integration models, which are based on the OnPLS algorithm, uses highly correlated but not identical joint scores called block scores, contrary to other methods which use identical scores for all blocks. As a consequence of not forcing identical scores, some degree of non-correlated variation is included in joint components, especially in those with weak internal correlations. Orthogonal variation, either explicit in unique components or included in later joint components, accounted for a small fraction of the total explained variance. If the findings in the current study can be extended to other cases, it implies techniques for calibration transfer are not mainly removing orthogonal variation per se, but are instead normalizing common underlying signals shared between the instruments. For calibration transfer we found the multiblock scatter plot could be used for visualizing the partitioning into transfer, calibration and validation sets (See supporting information, Figure S.5 and S.10). The multiblock scatter plot effectively displayed the joint structures for all included blocks simultaneously, which enabled us to ensure representative calibration transfer conditions. There was often considerable variation between the performance of the instruments for the dedicated transfer methods, but in this aspect JUMBA transfer excelled. The performance of JUMBA transfer was exceptional for the MMD dataset, where it outperformed other methods. Its performance for the corn dataset was also strong, with similar characteristics to the well-performing TOP. This performance and stability can most likely be credited to the use of joint structures, where the same model can be used for the transfer between all instruments, as was the case for the MMD. 21

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

TOP was the second best performer overall mostly thanks to its performance for the similar instruments in the corn dataset, while its performance suffered for the varied instruments in the MMD dataset. In addition to the difference in instrumentation, it is possible the clustering in the MMD had an impact on the performance of TOP and its use of average spectra for instrument standardization 12 . JUMBA transfer and TOP are related in the sense their calibration transfer procedure integrate several instruments at the same time. In comparison, JUMBA transfer may offer a more robust calibration transfer of more difficult datasets with its complete observation matching across the instruments instead of relying on average spectra. We found the portable instruments used in the mushroom substrate study to produce as accurate results as the others in terms of RMSEP. The measurement procedures for the two portable instruments were set up to be as consistent as possible, which may not be achievable in the field. While data integration revealed the Tec5 instrument to have noisy spectral profiles during the data integration, this was not reflected neither in its native PLS model nor in the calibration transfer results. On a side note, the RMSEP values for the native PLS models may overall be artificially lowered due to the procedure of the response value estimation. Values are to a small extent influenced by each model, which could imply the prediction of the validation set will be overly optimistic when using a certain model. We deem this effect to be minor as all other instruments have the largest influence in total. This does not impact the relative comparison between the native reference models and between the calibration transfer methods. The selection for the different sets (calibration, transfer and validation) were performed on the FMD and the transfer was done on the MMD, but as all transfer methods use the same sets we believe the comparison is still valid.

22

ACS Paragon Plus Environment

Page 22 of 29

Page 23 of 29 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Conclusions We found the data integration provided by JUMBA to be a comprehensive technique for simultaneous analysis of multiple blocks of NIR spectra. For both investigated datasets respectively, we found the instruments mostly to share the same underlying structures, albeit to a various extent. In addition to providing full exploration of the datasets, the simultaneous investigation could also detect differences in instrumental behavior, which were further reflected in the calibration transfer results for established methods. We successfully employed JUMBA as a novel method for calibration transfer. Overall, JUMBA transfer displayed the strongest calibration transfer performance among the tested methods, with excellent results for both the more diverse MMD as well as the for the more uniform corn dataset. The results demonstrate that JUMBA and its joint domain transfer can produce accurate and stable prediction results for all instruments in an multi-instrument environment. Compared to existing methods, JUMBA offers a complete package of analysis and integration coupled with leading calibration transfer performance. Also, the JUMBA joint domain and its concept could potentially be employed for establishing multi-instrument calibration models, reducing the amount of data collection needed for each instrument. Future work will further explore how to set up JUMBA for the purpose of calibration transfer and how the joint components and other possible hyperparameters ideally should be selected for optimal prediction accuracy. Its potential use for multi-instrument calibration modeling also needs to be examined.

Acknowledgements This work was supported by the Swedish strategic research programme eSSENCE and by the Swedish Research Council grant number 2016-04376. We would like to thank associate professor Johan Linderholm for providing the VIAVI MicroNIR instrument. We would also like to thank Frida Torell and Kristina Lundquist for constructive feedback on the manuscript. 23

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Competing interests The authors declare no competing interests.

Supporting Information Experimental section additonal information: mushroom substrate measurement procedures, selection of hyperparameters for calibration transfer, the JUMBA workflow and model algorithm schematics, multiblock visualization details. Additional result details: selection figures for model transfer and mushroom substrate dataset responses, full correlation matrix and metadata correlation plots for JUMBA models, full loadings for the MMD JUMBA model and reference RMSEP values for all responses.

References (1) Serrano, L.; Ustin, S. L.; Roberts, D. A.; Gamon, J. A.; Penuelas, J. Deriving water content of chaparral vegetation from AVIRIS data. Remote Sens. Environ. 2000, 74, 570–581. (2) Hunt Jr, E. R.; Rock, B. N. Detection of changes in leaf water content using near-and middle-infrared reflectances. Remote Sens. Environ. 1989, 30, 43–54. (3) Ceccato, P.; Gobron, N.; Flasse, S.; Pinty, B.; Tarantola, S. Designing a spectral index to estimate vegetation water content from remote sensing data: Part 1: Theoretical approach. Remote Sens. Environ. 2002, 82, 188–197. (4) Rambla, F. J.; Garrigues, S.; De La Guardia, M. PLS-NIR determination of total sugar, glucose, fructose and sucrose in aqueous solutions of fruit juices. Anal. Chim. Acta 1997, 344, 41–53.

24

ACS Paragon Plus Environment

Page 24 of 29

Page 25 of 29 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

(5) Osborne, B. G.; Fearn, T.; Hindle, P. H.; others, Practical NIR spectroscopy with applications in food and beverage analysis.; Longman scientific and technical, 1993. (6) De Noord, O. E. Multivariate calibration standardization. Chemom. Intell. Lab. Syst. 1994, 25, 85–97. (7) da Silva, V. H.; da Silva, J. J.; Pereira, C. F. Portable near-infrared instruments: Application for quality control of polymorphs in pharmaceutical raw materials and calibration transfer. J. Pharm. Biomed. Anal. 2017, 134, 287–294. (8) Wang, Y.; Veltkamp, D. J.; Kowalski, B. R. Multivariate Instrument Standardization. Anal. Chem. 1991, 2756, 2750–2756. (9) Griffiths, M. L.; Svozil, D.; Worsfold, P.; Hywel Evans, E. The application of piecewise direct standardisation with variable selection to the correction of drift in inductively coupled atomic emission spectrometry. J. Anal. At. Spectrom. 2006, 21, 1045. (10) Alam, T. M.; Alam, M. K.; McIntyre, S. K.; Volk, D. E.; Neerathilingam, M.; Luxon, B. A. Investigation of Chemometric Instrumental Transfer Methods for HighResolution NMR. Anal. Chem. 2009, 81, 4433–4443. (11) Du, W.; Chen, Z. P.; Zhong, L. J.; Wang, S. X.; Yu, R. Q.; Nordon, A.; Littlejohn, D.; Holden, M. Maintaining the predictive abilities of multivariate calibration models by spectral space transformation. Anal. Chim. Acta 2011, 690, 64–70. (12) Andrew, A.; Fearn, T. Transfer by orthogonal projection: Making near-infrared calibrations robust to between-instrument variation. Chemom. Intell. Lab. Syst. 2004, 72, 51–56. (13) Bouveresse, E.; Massart, D. L. Standardisation of near-infrared spectrometric instruments: A review. Vib. Spectrosc. 1996, 11, 3–15.

25

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(14) Liu, Y.; Cai, W.; Shao, X. Standardization of near infrared spectra measured on multiinstrument. Anal. Chim. Acta 2014, 836, 18–23. (15) Bin, J.; Li, X.; Fan, W.; Zhou, J.-h.; Wang, C.-w. Calibration transfer of near-infrared spectroscopy by canonical correlation analysis coupled with wavelet transform. The Analyst 2017, 142, 2229–2238. (16) Wold, S.; Esbensen, K.; Geladi, P. Principal component analysis. Chemom. Intell. Lab. Syst. 1987, 2, 37–52. (17) Wold, S.; Sjöström, M.; Eriksson, L. PLS-regression: A basic tool of chemometrics. Chemom. Intell. Lab. Syst. 2001, 58, 109–130. (18) Trygg, J.; Wold, S. Orthogonal projections to latent structures (O-PLS). J. Chemom. 2002, 16, 119–128. (19) Beebe, K. R.; Kowalski, B. R. An introduction to multivariate calibration and analysis. Anal. Chem. 1987, 59, 1007A–1017A. (20) Trygg, J. O2-PLS for qualitative and quantitative analysis in multivariate calibration. J. Chemom. 2002, 16, 283–293. (21) Löfstedt, T.; Trygg, J. OnPLS-a novel multiblock method for the modelling of predictive and orthogonal variation. J. Chemom. 2011, 25, 441–455. (22) Löfstedt, T.; Hoffman, D.; Trygg, J. Global, local and unique decompositions in OnPLS for multiblock data analysis. Anal. Chim. Acta 2013, 791, 13–24. (23) Reinke, S. N.; Galindo-Prieto, B.; Skotare, T.; Broadhurst, D. I.; Singhania, A.; Horowitz, D.; Djukanović, R.; Hinks, T. S.; Geladi, P.; Trygg, J.; Wheelock, C. E. OnPLS-Based Multi-Block Data Integration: A Multivariate Approach to Interrogating Biological Interactions in Asthma. Anal. Chem. 2018, 90, 13400–13408.

26

ACS Paragon Plus Environment

Page 26 of 29

Page 27 of 29 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

(24) Skotare, T.; Sjögren, R.; Surowiec, I.; Nilsson, D.; Trygg, J. Visualization of descriptive multiblock analysis. J. Chemom. 2018, e3071. (25) Sjögren, R.; Stridh, K.; Skotare, T.; Trygg, J. Multivariate patent analysis-Using chemometrics to analyze collections of chemical and pharmaceutical patents. J. Chemom. 2018, (26) Obudulu, O.; Mähler, N.; Skotare, T.; Bygdell, J.; Abreu, I.; Ahnlund, M.; Gandla, M.; Petterle, A.; Moritz, T.; Hvidsten, T.; Jönsson, L.; Wingsle, G.; Trygg, J.; Tuominen, H. A multi-omics approach reveals function of Secretory Carrier-Associated Membrane Proteins in wood formation of Populus trees. BMC Genomics 2018, 19 . (27) Williams, B. C.; McMullan, J. T.; McCahey, S. An initial assessment of spent mushroom compost as a potential energy feedstock. Bioresour. Technol. 2001, 79, 227–230. (28) Finney, K. N.; Ryu, C.; Sharifi, V. N.; Swithenbank, J. The reuse of spent mushroom compost and coal tailings for energy recovery: Comparison of thermal treatment technologies. Bioresour. Technol. 2009, 100, 310–315. (29) Wei, M.; Geladi, P.; Xiong, S. NIR hyperspectral imaging and multivariate image analysis to characterize spent mushroom substrate: a preliminary study. Analytical and BioAnal. Chem. 2017, 409, 2449–2460. (30) Hayes, D. J. Development of near infrared spectroscopy models for the quantitative prediction of the lignocellulosic components of wet Miscanthus samples. Bioresour. Technol. 2012, 119, 393–405. (31) Barnes, R. J.; Dhanoa, M. S.; Lister, S. J. Standard normal variate transformation and de-trending of near-infrared diffuse reflectance spectra. Appl. Spectrosc. 1989, 43, 772–777.

27

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(32) Kennard, R. W.; Stone, L. A. Computer aided design of experiments. Technometrics 1969, 11, 137–148. (33) Malli, B.; Birlutiu, A.; Natschläger, T. Standard-free calibration transfer - An evaluation of different techniques. Chemom. Intell. Lab. Syst. 2017, 161, 49–60. (34) Geladi, P.; Kowalski, B. Partial Least-Squares Regression - a Tutorial. Anal. Chim. Acta 1986, 185, 1–17. (35) Eskildsen, C. E.; Hansen, P. W.; Skov, T.; Marini, F.; Nørgaard, L. Evaluation of multivariate calibration models transferred between spectroscopic instruments: Applied to near infrared measurements of flour samples. J. Near Infrared Spectrosc. 2016, 24, 151–156.

28

ACS Paragon Plus Environment

Page 28 of 29

Page 29 of 29 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

for TOC only

29

ACS Paragon Plus Environment