The Influence of Correlated Calibration Samples on the Prediction

The effect of the presence of metabolism-induced con- centration correlations in the calibration samples on the prediction performance of partial leas...
0 downloads 9 Views 190KB Size
Anal. Chem. 2002, 74, 5227-5236

The Influence of Correlated Calibration Samples on the Prediction Performance of Multivariate Models Based on Mid-Infrared Spectra of Animal Cell Cultures Martin H. Rhiel,†,§ Michael I. Amrhein,‡ Ian W. Marison,† and Urs von Stockar*,†

Institute of Chemical Engineering, Swiss Federal Institute of Technology Lausanne (EPFL), 1015 Lausanne, Switzerland, and Online Control Ltd., Avenue de la Gare 10, 1003 Lausanne, Switzerland

The effect of the presence of metabolism-induced concentration correlations in the calibration samples on the prediction performance of partial least-squares regression (PLSR) models and mid-infrared spectra from Chinese hamster ovary cell cultures was investigated. Samples collected from batch cultures contained highly correlated metabolite concentrations as a result of metabolic relations. Calibrations based on such samples could only be used to predict concentrations in new samples if a similar correlation structure was present and failed when the new samples were randomly spiked with the analytes. On the other hand, such models were able to predict glucose correctly even if they were based on a spectral range in which glucose does not absorb, provided that the correlations in the calibration and in the new samples were similar. If however, samples from a calibration culture were randomly spiked with the main analytes, much more robust PLSR models resulted. It was possible to predict analyte concentrations in new samples irrespective of whether the correlation structure was maintained or not. Validity of all established models for any given use could be predicted a priori by computing the space inclusion and observer conditions. Predictions from these computations agreed in all cases with the experimental test of model validity. Recently, multivariate analysis of spectroscopic data has been successfully applied to gain analytical information for reaction monitoring in a number of fields, that is, polymerization,1 pharmaceuticals,2 refineries,3,4 distillation,5 food,6 and biopro* Corresponding author: Tel.: +41-21-693-3191. Fax: +41-21-693-3680. Email: [email protected]. † Swiss Federal Institute of Technology Lausanne. ‡ Online Control Ltd. § Current address: Process Development, Cytos Biotechnology AG, Wagistrasse 25, CH-8952 Zurich-Schlieren, Switzerland. (1) Dallin, P. Process Control Qual. 1997, 9, 167-172. (2) Buchanan, B. R.; Baxter, M. A.; Chen, T. S.; Qin, X. Z.; Robinson, P. A. Pharm. Res. 1996, 13, 616-621. (3) Buttner, G. Process Control Qual. 1997, 9, 197-203. (4) Chung, H.; Lee, J. S.; Ku, M. S. Appl. Spectrosc. 1998, 52, 885-889. (5) Van den Berg, F. W. J.; van Osenbruggen, W. A.; Smilde, A. K. Process Control Qual. 1997, 9, 51-57. (6) Hoyer, H. Process Control Qual. 1997, 9, 143-152. 10.1021/ac020165l CCC: $22.00 Published on Web 09/18/2002

© 2002 American Chemical Society

cesses.7 The advantages of spectroscopic sensors include simultaneous multianalyte determinations, in situ sterilizability, and low maintenance during operation. These advantages are particularly important for bioprocess monitoring. Since the many components in the mostly complex cell culture media typically have overlapping absorbance features, successful application of these sensors requires appropriate multivariate analysis. Multivariate forward calibration using, for example, partial least-squares regression (PLSR) is often used to obtain quantitative information from overlapping spectra.8-12 In the calibration step, spectra from selected spectral ranges are regressed against the corresponding concentrations of the analyte of interest. Validity of established calibration models are subsequently judged by estimating the concentrations of the analyte from spectra unseen in the calibration step. For real-time process monitoring in particular, it is necessary to judge the validity of the calibration models a priori, that is, before subsequent experiments are started. It is a well-known fact that for the correct concentration prediction of an analyte from a linear calibration model, the corresponding spectrum must lie in the space spanned by the calibration spectral data (termed space-inclusion condition) and the pure-component spectra of the absorbing species must be linearly independent.11,13 Meeting this condition requires an adequate design of experiments (DOE). DOE for multivariate calibration has been discussed previously.11,14-17 These designs are typically based on synthetic mixtures. By sufficiently varying the composition of these calibra(7) Hagman, A.; Sivertsson, P. Process Control Qual. 1998, 11, 125-128. (8) Sjo¨stro ¨m, M.; Wold, S.; Lindberg, W.; Person, J.-A.; Martens, H. Anal. Chim. Acta 1983, 150, 61-70. (9) Wold, S.; Albano, C.; Dunn, J., III; Edlund, U.; Esbensen, K.; Geladi, P.; Hellberg, S.; Johannson, E.; Lindberg, W.; Sjo¨stro¨m, M. In Chemometrics, Mathematics and Statistics in Chemistry; Kowalski, B. R., Ed.; D. Reidel Publishing Company: Dordrecht, The Netherlands, 1984; pp 17-95. (10) Haaland, D. M.; Thomas, E. V. Anal. Chem. 1988, 60, 1193-1202. (11) Martens, H.; Naes, T. Multivariate Calibration; John Wiley & Sons: New York, 1989. (12) ASTM Practice E 1655-94. Annual Book of ASTM Standards Vol. 03.06; ASTM: West Conshohocken, PA, 1995. (13) Amrhein, M.; Srinivasan, B.; Bonvin, D.; Schumacher, M. M. Chemom. Intell. Lab. Syst. 1999, 46, 249-264. (14) Araujo, P. W.; Brereton, R. G. Trends Anal. Chem. 1996, 15, 156-163. (15) Erikson, L.; Johansson, E.; Wikstro ¨m, C. Chemom. Intell. Lab. Syst. 1998, 43, 1-24. (16) Munoz, J. A.; Brereton, R. G. Chemom. Intell. Lab. Syst. 1998, 43, 89-105.

Analytical Chemistry, Vol. 74, No. 20, October 15, 2002 5227

tion mixtures, the space-inclusion condition can be met. In practical situations, such as in bioprocesses, however, synthetic mixtures may not be available because of proprietary compositions or high numbers of absorbing (intermediate) species. As a result, many investigators have used spectra from biotechnological reactions for which the associated composition was known (analyzed with an established reference analysis) and used this information to build calibration models. This technique makes sure that the influence of the many unknown components in a complex cell culture medium on the calibration model is correctly allowed for. Examples include near-infrared (NIR) spectroscopic data from microbial fermentations,7,18-24 NIR spectroscopic data from animal cell cultures,25-27 and midinfrared (MIR) spectroscopic data from microbial fermentations.28-30 In these applications, however, the correlation between the concentrations of absorbing species due to the metabolism (reaction) was not taken into account in the DOE for calibration. This technique, therefore, comprises the danger that the spaceinclusion condition is not met during actual measurement because of unforeseen alterations of the metabolism, leading to modifications in the correlations of the metabolites or to their break-down. If, however, the correlations are still valid, they may lead to estimating a nonabsorbing species that, in fact, has no spectral features in the spectral range selected31,32 or that is actually not present during calibration33 but is correlated to some absorbing species in the calibration step (phantom measurements).33 For example, Pekeler et al.31 could predict the concentrations of glucose from fluorescence spectra, although they were aware that glucose does not fluoresce. In this case, the estimation ability of the calibration models have been due to correlations between the concentration of the unobservable glucose and the concentration of other species in the reaction mixture that had measurable spectral features. From an analytical chemistry point of view, the concentration of the analyte is, thus, not really measured but modeled through the correlation structure. Although this feature (17) Swierenga, H.; Weijer de, A. P.; Wijk van, R. J.; Buydens, L. M. C. Chemom. Intell. Lab. Syst. 1999, 49, 1-17. (18) Eggeling, L.; Oberle, S.; Sahm, H. Appl. Microbiol. Biotechnol. 1998, 49, 24-30. (19) Hall, J. W.; McNeil, B.; Rollins, M. J.; Draper, I.; Thompson, B. G.; Macaloney, G. Appl. Spectrosc. 1996, 50, 102-108. (20) Kasprow, R. P.; Lange, A. J.; Kirwan, D. J. Biotechnol. Prog. 1998, 14, 318325. (21) Macaloney, G.; Draper, I.; Preston, J.; Anderson, K. B.; Rollins, M. J.; Thompson, B. G.; Hall, J. W.; McNeil, B. Trans. IChemE. 1996, 74, 212220. (22) Majara, M.; Mochaba, F. M.; Oconnorcox, E. S. C.; Axcell, B. C.; Alexander, A. J. Inst. Brewing 1998, 104, 143-146. (23) Schmidt, S.; Kircher, M.; Kasala, J.; Locaj, J. Bioprocess Eng. 1998, 19, 6770. (24) Yano, T.; Aimi, T.; Nakano, Y.; Tamai, M. J. Ferment. Bioeng. 1998, 85, 461-465. (25) Harthun, S.; Matischak, K.; Friedl, P. Anal. Biochem. 1997, 251, 773-78. (26) McShane, M. J.; Cote´, G. L. Appl. Spectrosc. 1998, 52, 1073-1078. (27) Yano, T.; Harata, M. J. Ferment. Bioeng. 1994, 77, 659-662. (28) Fayolle, P.; Picque, D.; Perret, B.; Latrille, E.; Corrieu, G. Appl. Spectrosc. 1996, 50, 1325-1330. (29) Fayolle, P.; Picque, D.; Corrieu, G. Vib. Spectrosc. 1997, 14, 247-252. (30) Tseng, D. Y.; Vir, R.; Traina, S. J.; Chalmers, J. J. Biotechnol. Bioeng. 1996, 52, 661-671. (31) Pekeler, T.; Lindemann, C.; Scheper, T.; Hitzmann, B. Chem.-Ing.-Tech. 1998, 70, 1610-1611. (32) Bro, R. Chemom. Intell. Lab. Syst. 1999, 46, 133-147. (33) Arnold, M. A.; Burmeister, J. J.; Small, G. W. Anal. Chem. 1998, 70, 17731781.

5228

Analytical Chemistry, Vol. 74, No. 20, October 15, 2002

may seem advantageous, in case that is desired to measure analytes with weak or no absorption characteristics, the use of such models may be risky, because they may break down if the correlation structures of the calibration samples do not match the ones in the validation or test samples. However, all these problems can be avoided by removing the correlations from samples coming from an actual reaction through special calibration designs. Such DOE have been proposed for chemical reactions13,34 and for cell cultures (biotechnological reactions).35-40 For reacting mixtures, Amrhein et al.13 formulated rigorous conditions for the initial concentrations and the yield coefficients (reaction or stoichiometric coefficients) of a new (prediction) set that must be fulfilled to predict absorbing analytes of interest correctly. In addition, they proposed several DOE methods for the calibration step to satisfy the space-inclusion condition. Despite the increasing popularity of multivariate analysis of spectroscopic data for process monitoring, there has not been an explicit analysis comparing correlated and uncorrelated samplebased calibrations for its validity to unseen process data. This study is aimed at raising awareness of process analysts to check the validity of a calibration model before a new batch progresses. Special attention is therefore given to the space-inclusion condition for concentration data and a novel, so-called observer condition is proposed for assessing the validity of “measuring” nonabsorbing analytes. For the experimental study, we used the production of recombinant human secretory component (SC) in Chinese ovary hamster cells (CHO) as a model reaction. On the basis of this system, the following questions were experimentally investigated: • How well does a calibration model based on samples in which the concentrations are correlated perform when used to predict concentrations in other samples exhibiting similar correlations? • How well do such models perform when they are used to predict uncorrelated concentrations? • How well do such models predict an analyte which does not absorb in the spectral range used for the calibration as long as the concentrations in the test samples are similarly correlated as in the ones used for calibration? • How well does a calibration model based on uncorrelated concentration data perform if applied to correlated and uncorrelated validation samples? • How do these models perform for predicting nonabsorbing analytes? • How can the validity of these models in all these cases be foreseen a priori? (34) Amrhein, M. Ph.D. Thesis 1861, Ecole Polytechnique Fe´de´rale de Lausanne, Lausanne, Switzerland, 1998. (35) Chung, H.; Arnold, M. A.; Rhiel, M.; Murhammer, D. W. Appl. Spectrosc. 1996, 50, 270-276. (36) Riley, M. R.; Rhiel, M.; Zhou, X.; Arnold, M. A.; Murhammer, D. W. Biotechnol. Bioeng. 1997, 55, 11-15. (37) Riley, M. R.; Arnold, M. A.; Murhammer, D. W.; Walls, E. L.; DelaCruz, N. Biotechnol. Prog. 1998, 14, 527-533. (38) Riley, R. R.; Arnold, M. A.; Murhammer, D. W. Appl. Spectrosc. 1998, 52. (39) Rhiel M.; Cohen, M. B.; Murhammer, D. W.; Arnold, M. A. Biotechnol. Bioeng. 2002, 77, 73-82. (40) Rhiel, M.; Ducommun, P.; Bolzonella, I.; Marison, I.; von Stockar, U. Biotechnol. Bioeng. 2002, 77, 174-185.

EXPERIMENTAL SECTION Cell Line, Culture Medium, Reaction Vessels, and Culture Conditions. CHO/SSF3 cells propagated in ChoMaster HP1 medium (Ferrucio Messi Cell Culture Systems, Zu¨rich, Switzerland) were used for all cell culture experiments. Cells were cultivated in (1) 250-mL spinner flasks (Tecnomara, Wallisellen, Switzerland) at a working volume of 100 mL, and (2) a 2-L stirred tank bioreactor (BioLafitte, St-Germain-en-Laye, France) with a working volume of 1 L. A detailed description may be found elsewhere.41 Time Line of Cell Culture Experiments and Sampling for Reference Analysis. Cell culture experiments were performed in parallel. That is, one bioreactor culture and four spinner flask cultures were started at the same time. All of these cultures were inoculated from the same inoculum source. Spinner flask cultures were sampled at 0, 16, 40, 51, 64, 75, 95, 114, and 140 h culture time with a 10-mL sample volume. The bioreactor culture was sampled similarly, except in addition, samples of 100-mL volume were taken at 0, 37, 61, 93, and 111 h culture time. These samples were termed large volume (LV) samples.37 Each sample was immediately centrifuged at 1000g for 6 min to separate the cells, and the supernatant was stored at -20 °C. Removing of the cells is sufficient to stop the cell culture reaction. After completion of the parallel experiment period, all samples were subjected to reference analysis and off-line spectra collection with a ReactIR spectrometer, as discussed later. Reference Analysis. Glucose and lactate concentrations were determined by HPLC analysis. A sample volume of 10 µL of the respective cell culture sample was injected onto a Sulpelco Gel H column (Sulpelco, Bellefonte, PA) equilibriated with a 0.02 N sulfuric acid solution at a flowrate of 0.8 mL/min. Elution peaks were detected with a refractive index detector (model HP 1047A, Hewlett-Packard, Waldbronn, Germany) and quantified with software supplied by Hewlett-Packard. Ammonia concentrations were determined by standard enzyme analysis (Boehringer Mannheim, Mannheim, Germany). Amino acids were determined by HPLC analysis using a system equipped with a programmable autosampler, diode array detector (DAD) (all Kontron Instruments, Zu¨rich, Switzerland), and a 4-µm particle size, reversed-phase column (Supersher 100 RP-18e, Merk, Darmstadt, Germany) according to a method developed earlier.41 Cell densities (biomass) were determined by counting the cells in a Neubauer counting chamber, which resembles a defined volume. Details are described elsewhere.42 Sample Preparation for Random Spike Addition. Each of the five LV samples collected during the bioreactor culture was aliquoted into 10 small-volume (SV) samples.37 Mixtures of (known) randomly assigned amounts of glucose, lactate, glutamine, ammonia, asparagine, arginine, and alanine were added to the SV samples. The resulting samples were termed spiked samples. The absolute concentrations of the spiked samples were computed from the known concentrations of the added mixtures and the measured concentrations of the LV samples. For the spiked (41) Ruffieux, P.-A. Ph.D. Thesis 1875, Ecole Polytechnique Fe´de´rale de Lausanne, Lausanne, Switzerland, 1998. (42) Ducommun, P.; Ruffieux, P.-A.; Furter, M.-P.; Marison, I.; von Stockar, U. J. Biotechnol. 2000, 78, 139-147.

samples, Figure 2e-h displays the concentrations of glucose, lactate, asparagine, and ammonia, respectively. FT-IR Setup and Spectra Collection. Single-beam spectra were collected with a ReactIR spectrometer (ASI Applied Systems, Millerville, MD) equipped with a MCT detector, an ATR diamond probe (DiComp 14.25′′ long, 0.625′′ diameter), and an optical conduit. Purge gas was supplied to the spectrometer housing, the optical conduit, and the probe shaft at 20 L/min with a Whatman FT-IR purge gas generator model 75-52 (Whatman International Ltd, Kent, England). A water single-beam reference spectrum was obtained at the beginning of each reaction. For off-line spectra collection, the ATR probe was immersed in the spiked samples that were contained in 50-mL Falcon centrifuge tubes (Becton Dickinson) immersed in a 37 °C water bath. This temperature was equal to the temperature during cell culture experiments in the bioreactor to minimize potential influences of different spectra collection conditions. Care was taken to avoid entrapment of air bubbles at the sensor tip. Each spectrum is the average of 128 scans at a data spacing of 1.9 cm-1. The spinner flask culture samples were randomized before spectra collection to avoid any possible time correlation between spectra collection time and original culture time. THEORETICAL PRELIMINARIES Forward calibration referring to regressing the absorbance data (calibration data) on the concentration data of the analytes of interest using partial least-squares regression (PLSR) was performed.11 A small selection of various forward calibration methods13 of spectral reaction data is illustrated in this paper. The corresponding theoretical preliminaries are limited here to a minimum and the corresponding proofs may be found in Amrhein et al.13 For pedagogical reasons, the equations below are based on the ideal noise-free case. Note, however, that the implications of the corresponding results will also hold for the noisy case. Calibration of Spectral Reaction Data for an Absorbing Analyte. Let a(k) denote the L-dimensional spectral (absorbance) vector of the L-channel ReactIR instrument at the observation instant k measuring a mixture with S species. Without notification, it is assumed that all S species absorb. For unit path length and Beer’s law being valid (spectral data depend linearly on molar concentrations),

aT(k) ) cT(k)E

(1)

where superscript T denotes the transpose of a vector or a matrix. E is the S × L pure-component coefficient matrix and c(k) is the S-dimensional vector of the S molar component concentrations at observation instant k. For K observations, eq 1 can be written in matrix form as

A ) CE

(2)

with A being the K × L spectral (absorbance) matrix and C being the K × S molar concentration matrix. Space-Inclusion Condition to be Satisfied by the Spectral Data. The concentrations of an absorbing analyte are predicted correctly from a new (validation) spectrum, an, using a forward Analytical Chemistry, Vol. 74, No. 20, October 15, 2002

5229

calibration model (based on the spectral calibration data matrix A) if E has rank S and an satisfies the following space-inclusion condition (e.g., in terms of the relative 2-norm13)

A ≡ ||anT(IL - A+A)||2/||an||2 e A,lim

(3)

where IL is the L-dimensional identity matrix; A+, the pseudoinverse of A, and A,lim, a threshold value to be specified. Equation 3 is similar to the well-known lack-of-fit or Mahalanobis test.11 Note that space-inclusion condition eq 3 is an a posteriori condition, since the new spectrum must be available for testing. Space-Inclusion Condition to be Satisfied by the Yield Coefficient Matrix and Inlet Concentration Data. A yield coefficient for biotechnological reaction refers to the consumption or production rate of a species in a reaction per consumption or production rate of a normalizing species. The coefficient of the normalizing species is defined as 1. The major difference between stoichiometric coefficients in chemical reaction and yield coefficients in biotechnological reactions is that, in chemical reactions, the coefficients are usually integers, whereas in biotechnological reactions, they are typically real numbers. However, yield and stoichiometric coefficients can be handled in an equivalent manner for the following theoretical results. On the basis of the principle of mass conservation, a spaceinclusion condition can be defined that takes into account the socalled extended yield matrix (Ne) of the calibration and new sets, that is, matrix composed of the initial (c0) and inlet (cin) molar concentrations and the yield coefficient matrices (N). Let the experimental concentration matrices and extended yield matrices of the calibration and new sets be defined as

Cx ≡ [Cin,1T ... Cin,mT | c0,1 ... c0,m]T, Ne ≡ [N1T ... NmT | CxT]T Cx,n ≡ [Cin,nT | c0,n]T, Ne,n ≡ [NnT | Cx,nT]T

(4)

where Cx,i (pi + m × S) and Cx,n (pn + 1 × S) are the experimental concentration matrices for the ith calibration set and the new set, respectively; Cin,i(pi × S) and Cin,n(pn × S), the corresponding molar concentration matrices of the pi and pn mixtures added to the aliquoted LV samples; c0,i and c0,n, the corresponding Sdimensional initial molar concentrations; m, the number of batch runs included in the calibration set; Ni (R × S) and Nn (R × S), the corresponding yield coefficient matrices; and Ne and Ne,n the corresponding extended yield matrices. Note that S is the union of significantly absorbing species in the calibration set and the new set. Then, the concentrations of an absorbing analyte are predicted correctly from a new spectrum using a forward calibration model (based on the spectral calibration data matrix A) if E has rank S, and each row vector of Cx,n, cx,n,k, satisfies the following spaceinclusion condition (e.g., in terms of the relative 2-norm13),

C ≡ ||cx,n,kT(Is - Ne+Ne)||2/||cx,n,k||2 ) 0 for all k ) 1, ..., pn + 1 (5) 5230

Analytical Chemistry, Vol. 74, No. 20, October 15, 2002

where IS is the S-dimensional identity matrix. Thus, the initial molar concentrations, the added molar concentrations, and the yield coefficients of the new set must lie in the row space spanned by the initial molar concentrations, the added molar concentrations, and the yield coefficients of the m runs of the calibration set. Note that in contrast to eq 3, space-inclusion condition eq 5 is an a priori condition, since the new spectrum need not be available for testing. Calibration of Spectral Reaction Data for a Nonabsorbing Analyte. Let Nea be the extended yield matrix corresponding to the S absorbing species of the calibration set. Let nena be the column vector for the nonabsorbing analyte containing the yield coefficients and the initial and spiked concentrations in the calibration set. For nonabsorbing analytes, nena needs to be appended to Nea.

Ne ≡ [Nea|nena]

(6)

Then the concentrations of a nonabsorbing analyte are predicted correctly from a new spectrum using a forward calibration model (based on the spectral calibration data matrix A) if (i) E has rank S, (ii) cx,n,k satisfies the space-inclusion condition eq 5 for the redefined extended yield matrix eq 6, and (iii) the following observer condition is satisfied.

rank(Nea) ) rank(Ne)

(7)

The observer condition eq 7 is similar to the condition found for the state estimation of unmeasured concentrations in (bio)reactions.43 In other words, the information about the correlation structure must be contained in the S absorbing species. Thus, the observer condition is trivially satisfied for absorbing analytes. For nonabsorbing analytes, however, although the space-inclusion conditions eqs 3 and 5 are satisfied, the observer condition might not be fulfilled, leading to incorrect concentration predictions. Calibration Procedure. QuantIR (ASI Applied Systems) was used for calibration and Matlab (The Mathworks Company) for the computation of the space-inclusion and observer conditions. All data sets were mean-centered for calibration. The optimum number of PLS factors was estimated from a PRESS plot (predictive residual error sum of squares). The PRESS value for a given calibration model was computed by iteratively leaving one sample out, building a calibration model with the remaining samples, and predicting the left-out sample (leave-oneout cross-validation). The optimum number of factors was chosen to be at the minimum PRESS.10 In the case for which no minimum was observed, the optimal number of PLS factors was chosen according to the parsimonious principle.10 The calibration statistics, R2, standard error of calibration (SEC), and standard error of prediction (SEP) were computed as detailed in the ASTM practice.12 For the noise-free case and Beer’s law being valid, the optimum number of PLS factors in the calibration model is typically less than the number of absorbing species (rank-deficient spectral data) as a result of the linear dependency of the concentrations (43) Bastin, G.; Dochain, D. On-line Estimation and Adaptive Control of Bioreactors; Elsevier: Amsterdam, 1990.

Table 1. Relation of the CHO Cell Culture Key Metabolites during Cellular Growth in Batch Culture

metabolite (species) cell density glucose lactate asparagine alanine ammonia glutamate aspartate a

concns yield coefficient initial end to glucose R2 to abbreviation [mM] [mM] [mM/mM] glucose X G L N A M E D

0.22 14.87 6.78 2.47 0.19 0.98 0.12 0.08

1.08 1.74 25.00 0.33 0.71 2.01 0.35 0.18

-0.062a 1 -1.404 0.162 -0.039 -0.079 -0.018 -0.006

0.918 1 0.991 0.998 0.952 0.993 0.881 0.849

(109 cells/L)/mM.

of the absorbing species in (bio)reactions.13 However, for data with additional variation sources, such as nonlinearities (e.g., baseline shifts due to day-to-day instrumental variations) and heteroscedastic noise, the optimum number of PLS is expected to be larger. This is because the additional variation sources will span an additional space independently of that of the extended yield matrices. RESULTS AND DISCUSSION Analysis of Concentration Data. The metabolite concentrations are governed by the cellular metabolism, a highly complex reaction network.41 Ruffieux41 described the main metabolism of Chinese hamster ovary (CHO) cells by 22 reactions, involving 27 species (metabolites) and biomass. All of the reactions occur inside the cell. Some metabolites, however, are exchanged with the extracellular environment, that is, glucose, glutamine, asparagine, glutamate, aspartate, alanine, ammonia, carbon dioxide, and oxygen. In addition, the biomass is increasing when the cells are growing. Typical concentration profiles of these metabolites during a batch culture are shown in Figure 1. While the cells grow (Figure 1a), the substrates glucose (Figure 1b) and asparagine (Figure 1c) are consumed. Lactate (Figure 1b), alanine (Figure 1c), and ammonia (Figure 1c) accumulate as byproducts. Not shown is the substrate glutamine, for which no analytical measurement was available. Also shown in Figure 1 are the respective metabolite relations to glucose in the cell growing phase. Regression analysis on these relations revealed that most of the metabolites are correlated to each other as a result of the coupled reactions (Figure 1e-h). The corresponding correlation coefficients (yield coefficients) for this cell culture reaction are summarized in Table 1. On the basis of the reaction pathway analysis,41 the correlation (yield) of lactate to glucose is obvious, because it is produced from glucose. The other metabolites, however, are not directly related.41 As a result of the pathway and correlation analyses, the reactive species of the 22 metabolic reactions are observed as one single global reaction. With respect to the measured metabolites, glucose (G), asparagine (N), cells (X), lactate (L), alanine (A), ammonia (M), glutamate (E), and aspartate (D), the global reaction appears as

with the capital letters referring to moles or units of the respective species in the same way as chemical formulas in equations of chemical reactions and yi being the yield coefficients in the respective reaction. However, the yield coefficients, yi, may change as a result of adaptation of the metabolism to altering internal and external conditions.44 In fact, slight changes in the yield coefficients occurred in this study even when the cultures were started at different initial conditions. Cell cultures 1-4 were each started with different ratios of the inoculum size to fresh culture medium. The metabolite profiles are displayed in Figure 2a-d, and the corresponding yield coefficients are summarized in Table 2. It can be seen that the yield coefficient for lactate ranges from -1.40 to -1.87 mM lactate/mM glucose for cell culture reactions 2 and 1, respectively (Table 2). The average yield coefficient for lactate on glucose is -1.60 with a 11.4% relative standard deviation. Thus, concentration values of unaltered samples obtained during cell culture reactions are correlated.

G + yNN w yXX + yLL + yAA + yMM + yEE + yDD (8)

(44) Zeng, A.-P.; Hu, W.-S.; Deckwer, W.-D. Biotechnol. Prog. 1998, 14, 434441.

Figure 1. Typical metabolite concentration profiles (a-d) and their relations to glucose (e-h) during batch cultures of the discussed CHO/SSF3 cells. (The metabolite relations to glucose (e-h) are shown for only the culture phase containing glucose).

Analytical Chemistry, Vol. 74, No. 20, October 15, 2002

5231

Table 2. Metabolite Yield Coefficients with Respect to Glucose

metabolite

1

cell density lactate asparagine alanine ammonia glutamate

-1.87 0.28 -0.02 -0.13 -0.05

a

batch run 2 3 4 -0.06 -1.40 0.16 -0.04 -0.08 -0.02

-0.08 -1.48 0.16 -0.04 -0.09 -0.02

-0.05 -1.59 0.17 -0.04 -0.09 -0.03

5 -0.06 -1.69 0.16 -0.07 -0.09 -0.05

av yield coeff RSD [mM/mM] % -0.06a -1.61 0.18 -0.04 -0.10 -0.03

18.4 11.4 27.7 49.0 22.2 42.6

(109cells/L)/mM.

Concentration randomization in cell culture samples may be accomplished by spiking the samples with randomly selected amounts of specific analytes. This procedure has been used by Riley et al.37 and was used here as described in the Experimental Section. The resulting concentration values after spike additions are shown in Figure 2e-h for glucose, lactate, asparagine, and ammonia, respectively. Thus, concentration information of cell cultures may be available in 2 forms: (1) unaltered cell culture samples containing correlated concentration values and (2) spiked cell culture samples containing randomized concentration values. Analysis of Spectral Data. Figure 3a shows absorbance spectra from samples taken at different cell culture times. It can be observed that the spectral features around 1563 and 1119 cm-1 increase, but the spectral features at 1078 cm-1 decrease. Comparing these absorbance features with the absorbances of the pure component metabolite dissolved at 10 g/L in a water solution, it is obvious that the 1563 and 1119 cm-1 correspond to lactate absorbance and the 1078 cm-1 corresponds to glucose absorbance (Figure 3b). Spectra of the other metabolites (asparagine, ammonia, glutamine, and alanine) are also shown resulting from 10 g/L pure component solutions. However, the concentrations of the metabolites under actual cell culture reaction conditions are much lower. They correspond to 0.33, 0.71, 2.01, 0.35, and 018 mM for asparagine, alanine, ammonia, glutamate, and aspartate, respectively (Table 1). Thus, the absorbance features under actual conditions are becoming less significant. Figure 3c displays the pure component absorbance spectra scaled to the above listed maximum concentration values under the actual cell culture conditions (Table 1). Thus, it is obvious that only glucose and lactate are absorbing significantly. However, for the spiked samples also, added ammonia (M), asparagine (N), and alanine (A) do absorb significantly, as judged by visual inspection (Figure 3d), and are considered when spiked samples are used for calibration (Table 4). On the basis of Figures 3a-c, the spectral region of 1800 and 800 cm-1 may be divided into 2 distinct spectral regions: region 1 spans from 1150 to 950 cm-1 and contains all glucose spectral information. However, the glucose absorbance bands overlap with lactate bands. Region 2, on the other hand, spans from 1700 to 1500 cm-1 and is void of glucose absorbance features. It contains predominantly the lactate spectral variation occurring during cell culture reactions. Choice of the Calibration and Validation (New) Sets. The two different spectral information regions (overlapping absor5232

Analytical Chemistry, Vol. 74, No. 20, October 15, 2002

bance, no absorbance) and the two different types of concentration information (correlated, uncorrelated) were used to establish four different calibration models. The statistics of each of these calibration models is summarized in Table 3. Calibration model cal-corr-a (corr ) correlated concentration data; a ) absorbing spectral features) is based on the glucose concentrations from 27 samples (Table 3) collected during 3 batch runs, that is, batch runs 1, 3, and 4 (Figure 2a-d). The concentration relation between glucose and lactate in these samples is visualized in a scatter plot of lactate-versus-glucose concentration (solid dots in Figure 4). The regression factor on these 27 concentration pairs is -1.44, with an R2 value of 0.990 (Table 3). This value is indeed similar to the average yield coefficient obtained from all performed cell cultures (Table 2). The spectral information used in model cal-corr-a is from the 1150-950 cm-1 spectral range, which contains major glucose absorbances overlapping with lactate absorbances (Figure 3c). PLSR analysis of the K ) 27 × L ) 105 matrix resulted in an optimum calibration model at five PLS factors with a standard error of calibration (SEC) of 0.27 mM (Table 3). Calibration model cal-corr-n (corr ) correlated concentration data; n ) no absorbing features in spectra) is based on the same 27 samples described for model cal-corr-a (Table 3). In contrast to model cal-corr-a, however, the spectral information of the 17001500 cm-1 spectral range is used (Table 3). This spectral range does not contain any glucose absorbances (Figure 3c). The majority of the absorbances under reaction conditions do correspond to lactate (Figure 3c). PLSR analysis resulted in an optimal calibration model at six PLS factors with an SEC of 0.51 mM and R2SEC value of 0.992 (Table 3). Thus, a calibration model for glucose could be established from a spectral region void of glucose absorbance. The basis of this model is probably the correlation structure between glucose and lactate, which is -1.44 with an R2, yL of 0.990 (Table 3). Calibration model cal-rand-a (rand ) randomized concentration data; a ) absorbing spectral features), in contrast, is based on the glucose concentration from 34 samples spiked with randomized concentrations (Figure 2e-h). That the concentration values of glucose and lactate are indeed uncorrelated can be found by visual inspection of the corresponding scatter plot (Figure 4) and the resulting R2, yL value of 0.086 (Table 3). The spectral information for this calibration set was again the 1150-950 cm-1 spectral range described in the previous paragraph. PLSR analysis resulted in an optimal calibration model at six PLS factors with an SEC of 0.44 mM and an R2SEC value of 0.989 (Table 3). Calibration model cal-rand-n (rand ) randomized concentration data; n ) no absorbing features in spectra) is based on the same 34 spiked samples described in calibration model cal-rand-a. In contrast to model cal-rand-a, however, the 1700-1500 cm-1 spectral range was used (Table 3). As described in the previous paragraph, this spectral region does not contain glucose information (Figure 3c). PLSR analysis resulted in an optimal calibration model at 11 PLS factors with an SEC of 0.66 mM and R2SEC value of 0.979 (Table 3). Model establishment in this case is surprising, because neither spectral glucose absorbance nor any concentration correlation exists (R2yL ) 0.086, Table 3). It could not be identified on which kind of information structure this model was based.

Figure 2. Metabolite concentrations in reference samples taken directly from cell cultures (a-d) and in spiked samples (e-h). The 36 reference samples of cell cultures 1 (closed circles), 2 (closed triangles), 3 (closed squares), and 4 (closed diamonds) and the 50 spiked samples (open circles) were distributed for PLS calibration model development and validation as discussed in the text. The 50 spiked samples are based on the “large volume” samples taken from culture 5 (crosses) performed in a 2 -L bioreactor (see Experimental Section).

To illustrate the validity limits when the above-described calibration models are applied to different kinds of unseen spectra, two validation sets were composed: Validation set val-corr contains nine samples of the cell culture reaction 2, which was not used for model building. The name reflects the fact that the metabolite concentrations in these samples are correlated (Figure 4, open circles). This is confirmed by the correlation coefficient (yield

coefficient) being -1.40 mM lactate/mM glucose and an R2 value of 0.991 (Table 1). Validation set val-rand contains 16 spiked samples distinct from cal-rand. The name reflects the fact that the concentration values in this set are randomized (Figure 4, open squares), resulting in a correlation coefficient of -0.2 mM lactate/mM glucose with an R2 value of 0.017. Analytical Chemistry, Vol. 74, No. 20, October 15, 2002

5233

Figure 3. (a) Mid-infrared absorbance spectra during batch culture of CHO cells. (b) Pure component absorbance features of 10 g/L solutions. (c) Pure component spectra scaled to the maximum concentration value obtained under culture conditions. (d) Spectra of spiked samples containing the maximum amount of the corresponding component. (Spectra in (b-d) are offset to facilitate visibility). Table 3. Summary of Discussed Glucose Calibration Models and Resulting Model Statistics model name

Ta

Kb

yLc

R2, yLd

spectral range [cm-1]

Le

Af

PLSg

cal-corr-a cal-corr-n cal-rand-a cal-rand-n

C C R R

27 27 34 34

-1.44 -1.44 -0.8 -0.8

0.990 0.990 0.086 0.086

1150-950 1700-1500 1150-950 1700-1500

105 105 105 105

A N A N

5 6 6 11

model statistics SECh R2SECi 0.27 0.51 0.44 0.66

0.997 0.992 0.989 0.979

a Concentration type: C ) correlated; R ) randomized. b Number of samples. c Correlation factor (yield coefficient) to lactate. d R2 value of correlation to lactate. e Number of channels. f Absorbance type: A ) absorbing with overlap; N ) nonabsorbing. g Number of PLS factors. h Standard error of calibration [mM]. i R2 value of calibration results.

Table 4. Summary of Validation Statistics When Each Calibration Model Is Applied to the Sets Containing Samples with Correlated Concentrations (val-corr) and Randomized Concentrations (val-rand)a cal model

val set

species considered for computing obs

C

obs fulfilled

a priori validity

SEP

R2

a posterior validity

cal-corr-a cal-corr-a cal-corr-n cal-corr-n cal-rand-a cal-rand-a cal-rand-n cal-rand-n

val-corr val-rand val-corr val-rand val-corr val-rand val-corr val-rand

G, L G, L, A G, L, M, N G, L, M, A, N G, L, A G, L, A G, L, M, N, A G, L, M, A, N

0.00 0.18 0.00 0.17 0.00 0.00 0.00 0.00

yes yes yes no yes yes no no

yes no yes no yes yes no no

1.66 13.31 3.23 16.87 5.03 0.73 8.91 6.66

0.957 0.451 0.912 0.032 0.986 0.982 0.007 0.009

yes no yes no yes yes no no

a The validation statistics include concentration space inclusion condition ( ), observer condition (obs), standard error of prediction [mM] C (SEP), and the corresponding R2 values. Among the species considered for computing obs are glucose (G), lactate (L), alanine (A), ammonia (M), asparagine (N).

Results from Correlated Concentration-Based PLSR Models. Calibration models cal-corr-a and cal-corr-n are based on unaltered samples obtained from batch runs 1, 3, and 4 (Figure 2a-d). 5234 Analytical Chemistry, Vol. 74, No. 20, October 15, 2002

When model cal-corr-a is applied to data set val-corr, the space inclusion condition and observer condition indicate validity a priori, that is, before the actual experiment (Table 4). This is obvious, because the yield coefficients of the calibration and validation sets

Table 5. Summary of Key Results with Respect to Concentration Information and Spectral Information Used to Build PLS Calibration Models spectral information

Figure 4. Scatter plot of glucose and lactate concentration distributions in the calibration and validation sets. Calibration set, “cal-corr” (solid dots) containing the unaltered cell culture reaction samples from runs 1, 3, and 4. Calibration set, “cal-rand” (solid squares) containing randomized concentration spiked samples. The validation samples from run 2 for unaltered samples (open circles) and randomized concentration spiked samples (open squares) are also included.

are similar (Table 2). A posteriori, validity is then confirmed by the experimental data. Almost all predicted concentration values of data set val-corr lie on the unity line (open circles in Figure 5a), resulting in a standard error of prediction (SEP) of 1.66 mM with an R2 value of 0.957 (Table 4). As might be expected, a calibration model based on correlated samples measures concentrations correctly in samples, in which similar correlations exist. In contrast, application of model cal-corr-a is not valid for the data set val-rand. This is already indicated a priori, because the space inclusion condition is not fulfilled (Table 4). A posteriori validity testing by computing the SEP and R2 value of the prediction also confirms the inability to predict glucose concentration values (diamonds in Figure 5a), which do not obey the correlation structure of the data used in model cal-corr-a. Inspection of the concentration correlation plot for this case shows that all concentrations are predicted in a cloud lying far below the unity line (Figure 5a), resulting in SEP and R2 values of 13.31 mM and 0.451, respectively. This demonstrates that models based on correlated calibration samples fail when applied to samples devoid of these correlations. Using the model cal-corr-n, the space inclusion and observer condition are fulfilled when applied to data set val-corr (Table 4). This a priori validity test is confirmed by inspecting the concentration correlation plot for this case (circles in Figure 5b). Best estimations are made for high glucose concentrations (Figure 5b), but the measurement does not quite go down to 0 for samples devoid of glucose (Figure 5b). Nevertheless, an SEP of 3.23 mM with an R2 of 0.912 was obtained (Table 4). The results demonstrate convincingly that on the basis of correlations between glucose and lactate, glucose can correctly be predicted, although it does not absorb in the spectral range used (1700-1500 cm-1). From an analytical chemistry standpoint, these valid estimations must be considered “computed” rather than “measured”. Estimating the uncorrelated glucose concentrations in data set val-rand with model cal-corr-n, however, is not possible (Table 4). A very high SEP of 16.87 mM and lower R2 value of 0.032 resulted for data set val-rand when model cal-corr-n was applied (Table 4). This is not surprising, since model cal-corr-n “computes” glucose through the correlation structure to lactate. This correlation structure, however, is removed by the randomized concentration spiked samples.

concn information correlated uncorrelated

overlapping absorbance

risky, since models were valid only for same-kind correlation

no risk, because models were always valid

no absorbance

risky, since models were valid only for same-kind correlation

no valid model was possible

Thus, correlated sample-based calibration models are valid only for the same type of correlation. Such models fail when the correlation structure is removed. Results from Randomized Concentration-Based PLSR Models. Calibration models cal-rand-a and cal-rand-n are based on randomized concentration-spiked samples prepared as described in the Experimental Section (Figure 2e-h). Estimations from model cal-rand-a are valid for data set valcorr (Table 4). This was already indicated a priori by the observer and space inclusion conditions (Table 4). A posteriori, the estimations for data set val-corr contained a constant offset (Figure 5c), resulting in an SEP of 5.03 mM with an R2 of 0.986 (Table 4). This offset may be due to a baseline variation, since spectra for data sets val-corr and cal-rand were collected on different days. A better estimation was obtained for data set val-rand, as manifested by a low SEP of 0.73 mM (Table 4). All 16 concentrations could be predicted accurately (Figure 5c). Compared to the results for model cal-corr-a, the model cal-rand-a can be used for both sample types; that is, it is independent of a correlation structure. Thus, a calibration model based on randomized calibration samples will correctly measure the concentrations in new samples whether they are correlated or not. Using model cal-rand-n to predict the concentrations in data sets val-corr and val-rand is not correct. This can already be judged by the a-priori-computed observer and space-inclusion conditions, neither of which is fulfilled (Table 4). In fact, the estimations for both validation sets result in random clouds (Figure 5d) manifested by R2 values of 0.007 and 0.009 for data sets val-corr and val-rand, respectively (Table 4). From an analytical chemistry standpoint, this is not surprising, because neither spectral glucose information nor any concentration correlation structure was present. It is impossible to measure analyte concentrations if the analyte does not absorb and if no correlations exist in the calibration samples that could be exploited by the model. CONCLUSIONS This paper investigated the effects of calibrating MIR absorbance spectra of samples with and without metabolism-induced concentration correlations. Table 5 summarizes the main results of this paper. Calibration models based on samples taken directly from a culture prepared for calibration purposes include the necessary background spectral features contributed by unknown metabolites and species always present in cellular cultures. However, the main metabolite concentrations are correlated, and such models can, thus, predict concentrations only in new samples with a similar correlation structure. They are to be avoided for Analytical Chemistry, Vol. 74, No. 20, October 15, 2002

5235

Figure 5. Prediction results for unaltered samples (open circles) and spiked samples (open diamonds) when the following calibration models are applied: (a) cal-corr-a, (b) cal-corr-n, (c), cal-rand-a, and (d) cal-rand-n.

calibration model development, since this structure may change as a result of unforeseen reasons. On the other hand, correlated-sample-based calibrations may yield useful information on an analyte that does not absorb so long as the correlation structure in the calibration samples and in the test samples is similar. To build more robust models, it is recommended to randomly spike samples from a calibration batch culture with the main analytes. Calibration models based on randomized samples predict concentrations in new samples correctly irrespective of whether the correlation structure is maintained or not. However, the analytes must exhibit distinct absorption features in this case. If the species of interest does not absorb, calibration models based on randomized calibration samples completely fail to predict the analyte in new samples, whether they are correlated or not. For any calibration model, correlated or uncorrelated, its validity can be predicted a priori by the observer and spaceinclusion conditions before a new experiment. For nonabsorbing or not significantly absorbing analytes, commonly applied spectra projection criteria (e.g., Mahalanobis distance) will fail to predict nonvalidity of concentration predictions. The observer condition, however, will help to detect such situations.

5236

Analytical Chemistry, Vol. 74, No. 20, October 15, 2002

The correlation issue may be applicable to the general field of process analytical chemometrics. Chemical reactions obey concentration correlations as a result of the underlying stoichiometry and kinetics. In addition, physical effects, such as light scattering, may correlate with reaction progress. For multivariate calibrations, however, it is recommended that correlations in the calibration samples be avoided in order to ensure model robustness and to provide a real “measurement” rather than a “computed value” of an analyte. This may prevent false interpretation in the case of changing correlation structures and should enhance confidence in the meaningfulness of the predicted value from multivariate calibration models. ACKNOWLEDGMENT Support from the Swiss National Science Foundation (SNF) is gratefully acknowledged.

Received for review March 13, 2002. Accepted July 7, 2002. AC020165L