Quantitative Structure Retention Relationship ... - ACS Publications

Mar 20, 2013 - Analytical Quality by Design Framework: Simultaneously Accounting ... Pfizer Worldwide Research & Development, Eastern Point Road, Grot...
4 downloads 4 Views 3MB Size
Article pubs.acs.org/IECR

Quantitative Structure Retention Relationship Models in an Analytical Quality by Design Framework: Simultaneously Accounting for Compound Properties, Mobile-Phase Conditions, and Stationary-Phase Properties Koji Muteki,* James E. Morgado, George L. Reid, Jian Wang, Gang Xue, Frank W. Riley, Jeffrey W. Harwood, David T. Fortin, and Ian J. Miller Pfizer Worldwide Research & Development, Eastern Point Road, Groton, Connecticut 06340, United States ABSTRACT: Quantitative structure retention relationships (QSRRs) can play an important role in enhancing the speed and quality of chromatographic method development. This paper presents a novel (compound-classification-based) QSRR modeling strategy that simultaneously accounts for the analyte properties, mobile-phase conditions, and stationary-phase properties. It involves the adoption of two models: (A) partial-least-squares discriminate analysis (PLS-DA) to classify compounds into subclasses having similar interactive relationships between the mobile-phase conditions and stationary phase; (B) L partial least squares (L-PLS) to predict the compound’s retention time based on the mobile-phase conditions, stationary phase, and compound properties. For the retention time of a compound to be modeled, the most favorable compound class is identified in an optimization framework that simultaneously minimizes both the compound misclassification rate (based on PLS-DA) and the retention time prediction error (based on L-PLS) through a mixed-integer optimization. The proposed QSRR model (L-PLS with compound classification) significantly improves the retention time predictability compared with traditional QSRR or L-PLS models without compound classification. When combined with the linear solvation energy relationship parameters (using Abraham coefficients) as the column properties, the approach allows the following: (1) prediction of (new, never analyzed) compound retention times under chromatographic conditions (columns and mobile-phase conditions) used to train the model; (2) prediction of (previously analyzed under training conditions) compound retention times under chromatographic conditions that have not been previously evaluated; (3) optimization of the chromatographic conditions (mobile-phase and column selection) to maximize critical pair resolution, including new compounds; (4) enhanced mechanistic understanding of the interactive retention relationship between compounds, the mobile phase, and the column (e.g., compound retention mechanism). The effectiveness of the proposed modeling strategy will be demonstrated through two practical pharmaceutical applications in supercritical fluid chromatography and reversed-phase liquid chromatography.

1. INTRODUCTION Analytical quality by design (AQbD) is well established in the pharmaceutical industry. At a high level, the aim of AQbD is to design a quality, robust method that consistently delivers the intended performance using an efficient and systematic workflow.1 The concept, principle, and general workflow are in the public domain.1−4 Retention and separation in chromatography is the result of differential partitioning or adsorption of analytes between mobile and stationary phases. The distribution of compounds between these phases is determined by the analyte properties (e.g., hydrogen-bonding group, pKa, log D), column properties, and mobile-phase composition. The process of relating chromatographic retention times to the analyte (compound) structure is commonly referred to as the quantitative structure (chromatographic) retention relationship (QSRR).5 The main objectives of QSRR include the following: (1) to develop an understanding of the separation mechanism (for instance, to select a new column with equivalent performance and orthogonal selectivity or to maximize the column characteristics responsible for the retention and minimize the overall run time) and (2) to predict retention for a new compound (analyte). © XXXX American Chemical Society

The QSRR has been extensively researched over the past decade. Nord et al.6 applied a partial least-squares (PLS) model for the prediction of liquid chromatographic retention times of steroids by three-dimensional (3D) structure descriptors. Hancock et al.7 performed a comparative study of multiple linear regression, PLS, and classification and regression trees using 83 basic drugs. Bolanca et al.8 developed a gradient elution retention time model in ion chromatography using radial basis function artificial neural networks. Among QSRRs, the linear solvation energy relationship (LSER) model using Abraham coefficients9,10 has gained acceptance as a general tool to explore factors affecting the retention time in chromatographic systems.11 The LSER model is based on excess molar refraction (E), McGowan volume (V), dipolability/polarizability (S), hydrogen-bonding acidity (A), and hydrogen-bonding basicity (B) (and Fields et al. extended LSER to a sixth term, D, the Special Issue: John MacGregor Festschrift Received: December 14, 2012 Revised: March 20, 2013 Accepted: March 20, 2013

A

dx.doi.org/10.1021/ie303459a | Ind. Eng. Chem. Res. XXXX, XXX, XXX−XXX

Industrial & Engineering Chemistry Research

Article

degree of ionization molecular descriptor12) for parameters of analytes. Researchers have used databases of these column properties to improve QSRR models.13−15 As summarized in many review papers,5,16−19 QSRR can improve the speed of chromatographic method development19 and quality of the resultant methods. Several practical limitations remain that prevent the widespread adoption of QSRR. First, most disclosed QSRR models are built upon a single mobile- and stationary-phase condition. Thus, retention time prediction would only provide local optimization with the fixed mobile- and stationary-phase combination, and the “universal optimal condition” (e.g., the “best” chromatographic condition across multiple mobile and stationary phases) may not be identified. This is especially important because the sample mixture to be analyzed changes during the life cycle of the project because of process, formulation, or raw material changes. New impurities may emerge, or some existing impurities may no longer need to be controlled. As a consequence, the prior optimized column and mobile-phase combination may no longer be valid or the most efficient analysis condition. Typically, additional experiments are performed to reoptimize the method based on the new sample set. Therefore, it would be beneficial to build a QSRR model that would allow retention prediction across the matrix of columns, mobile phases, and analyte properties. In addition, the retention time prediction accuracy of disclosed QSRR models may not be sufficient to provide a reliable method or optimization guidance. While some QSRR studies have reported reasonable retention time prediction results,6,8,21 many others noted significant model prediction errors.22,23 Kaliszan5 described that the QSRRs derived from liquid chromatographic retention data are generally of low statistical significance. More recently, Morgado et al.20 also presented that the traditional QSRR model had poor predictability and thus could not be used for optimization purposes. One of the significant causes of modeling error is the lack of, or poorly selected, compound properties that are incorporated into the models. A plethora of physicochemical properties related to hydrophobicity (e.g., log P, pKa , solubility parameters), topological electronic index,24 electrostatic and quantum-chemical indices,25,26 and 3D descriptors6 have been proposed and applied. More recently, Gramatica et al.27 presented a QSRR study using weighted holistic invariant molecular descriptors, which represent the whole 3D molecular structure in terms of size, shape, symmetry, and atom distribution. The appropriate incorporation of these additional descriptors would help improve the QSRR modeling accuracy. Another cause of modeling errors is due to the inclusion of compounds with widely diverse molecular characteristics into a single QSRR model. For example, the retention times of acidic and basic analytes are strongly (and inversely) related to the aqueous mobile-phase pH in RPLC (nonpolar stationary and aqueous/organic mobile phases).28,29 Fields et al.12 present a modified LSER model as one alternative to accommodate ionizable analytes. A graphical representation of the retention time (log D) change for acidic and basic analytes versus pH is shown in Figure 1. In the acidic environment (low pH), acids are protonated and thus neutral in charge. As the pH increases, the acids deprotonate (ionize) and become less hydrophobic; i.e., their log D decreases, and they elute sooner in reversedphase liquid chromatography (RPLC) than the neutral species. Conversely, the opposite occurs with basic analytes (the bases

Figure 1. Change in the acidic and basic compound retention time as a function of the eluent pH.

ionize at low pH and are neutral at high pH). (Note molecular properties significantly impact the point at which an analyte ionizes. The analyte’s pKa is the point at which 50% of the species is ionized and neutral.) Therefore, the retention times for acidic and basic compounds are inversely correlated with the pH in RPLC. If compounds with such different molecular characteristics are forced into a single model, the model would unavoidably result in significant prediction error. This pH dependability is often observed, and other similar relationships exist between analytes and mobile- and stationary-phase conditions that impact retention (e.g., column, solvents, temperature). To address these modeling error sources, QSRR models built around compound classification (into subgroups/classes that have similar characteristics) are proposed and applied to retention time prediction. Wang et al.30 presented an univariate compound classification method based on log D profile similarity (as a function of the pH), prior to QSRR modeling to increase the robustness of QSRR prediction. The apparent pH of the mobile phase changes during the solvent gradient31 and can be a major source of retention time prediction error for ionizable compounds analyzed in gradient elution. Because many other compound properties would collectively contribute to the analyte’s retention behavior in addition to the pH (e.g., hydrogen bonding), the use of univariate classification may not provide the most accurate retention time predictions. However, the work did successfully demonstrate effective visualization of the chromatographic conditions providing both qualitative specificity and method robustness. This paper presents a novel (compound-classification-based) QSRR modeling strategy that simultaneously accounts for analyte properties, mobile-phase conditions, and stationaryphase properties. It involves the adoption of two models: (A) partial-least-square discriminate analysis (PLS-DA) to classify compounds into subclasses having similar interactive relationships between the mobile-phase conditions and stationary phase; (B) L partial least squares (L-PLS) to predict the compound’s retention time based on the mobile-phase conditions, stationary phase, and compound properties. For the retention time of a compound to be modeled, the most favorable compound class is identified in an optimization framework that simultaneously minimizes both the compound misclassification rate (based on PLS-DA) and the retention time prediction error (based on L-PLS) through mixed-integer optimization. The proposed QSRR model (L-PLS with compound classification) significantly improves the retention time predictability compared with traditional QSRR or L-PLS models without compound classification. When combined with the LSER parameters (using Abraham coefficients) as the B

dx.doi.org/10.1021/ie303459a | Ind. Eng. Chem. Res. XXXX, XXX, XXX−XXX

Industrial & Engineering Chemistry Research

Article

Figure 2. Overall data structure available from chromatographic method development.

columns for the analysis is sufficient to find the optimal columns previously tested. However, it does not allow prediction of the retention time on new, never tested columns. For such a case, LSER consisting of five parameters (electron loan pairs, dipole-type, hydrogen-bonding basicity, hydrogenbonding acidity, cavity formation, and dispersion interactions)12,13,23,34 or a PQRI database35,36 can be applied to characterize the column nature. Including these LSER parameters as the column properties for the S and SDB matrices will allow the compound retention prediction on new columns [e.g., evaluate if alternative columns that have never been used in the past may be able to improve the chromatographic method performance (e.g., critical-pair resolution, run time, elution order, analyte retention) via a simulation, without additional experiments]. The application of “an expert in the box” would allow for several potential “optimal” and robust chromatographic conditions to be identified for subsequent experimental confirmation. This development would be highly useful and practical and allow better use of scientists’ time in obtaining an appropriate chromatographic method. An important feature of this data structure is no common dimension over all data matrices. Y has one dimension in common with Z and S but no dimension common to XT. This kind of L-shaped data structure can occur in situations when auxiliary data such as XT are obtained from dif ferent sources. Muteki and MacGregor37 proposed a L-PLS regression (L-PLSR) algorithm for the L-shaped data structures with applications to mixture modeling.38−41 However, it is quite different in terms of modeling objectives and information flow and unsuitable for this modeling problem. Martens et al.42 presented another L-PLSR algorithm for the L-shaped data structure that occurs in food marketing data analyses. Because the objective of this L-PLSR algorithm is to relate two different information matrices [SZ] and XT to the Y matrix in Figure 2, it can be applied. This L-PLSR algorithm enhances the variable interpretation because of the simultaneous modeling using [SZ] and XT on the Y matrix in Figure 2. However, this L-PLSR method (proposed by Martens et al.42) does not allow for retention time predictions in unevaluated space (e.g., with new compounds, new mobile-phase conditions, or new stationary phases). Therefore, another modeling approach is required to address these gaps. 2.2. QSRR Model Accounting for Mobile- and StationaryPhase Conditions as Well as Compound Properties. A QSRR modeling approach that simultaneously accounts for

column properties, the approach allows the following: (1) prediction of (new, never analyzed) compound retention times under chromatographic conditions (columns and mobile-phase conditions) used to train the model; (2) prediction of (previously analyzed under training conditions) compound retention times under chromatographic conditions that have not been previously evaluated; (3) optimization of the chromatographic conditions (mobile-phase and column selection) to maximize critical pair resolution, including new compounds; (4) enhanced mechanistic understanding of the interactive retention relationship between compounds, the mobile phase, and the column (e.g., compound retention mechanism). The effectiveness of the proposed modeling strategy will be demonstrated through two practical pharmaceutical applications in supercritical fluid chromatography (SFC) and RPLC.

2. METHODS 2.1. Data Structures and Modeling Problems. The overall data structure generally available from chromatographic method development is depicted in Figure 2. The Y matrix refers to the (N × K) retention time of compounds used in the experiments. N is the total number of experiments (chromatographic runs), and K is the total number of compounds to be analyzed in the key predictive sample set. The (N × L) Z matrix contains L mobile-phase conditions used during the experiments. The mobile-phase condition matrix Z may include not only intrinsic measurement conditions (e.g., temperature, pH, solvents, gradient time profile condition) but also some additional variables calculated based on fundamental models such as gradient eluent parameters,32 steepness of gradients,33 etc. The (K × M) X matrix consists of M compound descriptors that correspond to the K compounds used in the experiments. The (KK × M) XDB matrix contains all possible KK compounds in the database. The X matrix is a subset of XDB. The (N × R) S matrix contains R column properties used in the past chromatographic experiments. The (C × R) SDB matrix contains all C columns (e.g., stationary phase, vendor, dimensions) in the database, and the S matrix is a subset of SDB. A simple column property will be a binary flag matrix that represents the column used at each experiment as a binary flag. For example, if columns A−C were tested at the first, third, and second batches (chromatographic conditions), respectively, the binary flag matrix consisting of columns A−C can be represented as [1 0 0; 0 0 1; 0 1 0]. This expression of C

dx.doi.org/10.1021/ie303459a | Ind. Eng. Chem. Res. XXXX, XXX, XXX−XXX

Industrial & Engineering Chemistry Research

Article

framework described in section 2.4). For example, for three compounds and two classes, if compound 1 belongs to class 2 and compounds 2 and 3 belong to class 1, then Ck,j will be [0 1, 1 0, 1 0]. The Ck,j itself will be determined through mixedinteger optimization described in the next section. Ck,j is related to the compound property matrix X by a PLS. The PLS models are built separately depending on the compound class j, based on eq 2

the mobile- and stationary-phase conditions as well as the compound properties is presented. First, the L-shaped data structure of Figure 2 is unfolded and arranged into a multiblock data structure depicted in Figure 3. The mobile-phase condition

Cj = f j (X) + ε

CR j =

matrix Z and stationary-phase condition S are repeatedly used to augment over the K compounds, while the compound property matrix Xk is augmented. Xk is a number k compound property matrix made by repeating the number k compound property vector over the N experiments. The augmented matrix XZS is obtained as [SZX1, SZX2, ..., SZXK], which consists of (N × K) samples with (R + L + M) properties. The retention time vector RT is made by rearranging the retention time matrix Y so as to synchronize the retention time corresponding to each XZS sample. The augmented data structure has one dimension in common over all data matrices. Next, ordinary PLS43−45 is applied to represent the relationship between the augmented matrix XZS and the retention vector RT. The retention time RT is represented as a function of the mobile-phase conditions, stationary phases, and compound properties. J PLS models are built separately using eq 1 (J is the total number of compound classes). The class itself will be determined by a mixed-integer optimization framework described in section 2.4. The number jth PLS retention vector model is shown in eq 1. (j = 1, 2, ..., J )

(2)

where Cj is a binary vector of class j and is a subset of Ck,j. At each PLS-DA class model in eq 2, a threshold of the predicted binary class vector Ĉ j is determined to classify the compounds into 1 or 0, so as to minimize the misclassification rate in terms of both TYPE-I and TYPE-II46,47 errors. Once the PLS-DA models are built, they will be used to predict a suitable class for a new compound. The misclassification rate CRj at each class model is quantitatively calculated in eq 3.

Figure 3. Unfolded and rearranged data structure for PLS modeling.

RTj = f j (XZSj) + ε

(j = 1, 2, ..., J )

total number of misclassified compounds (%) per class j total number of compounds (K ) (3)

2.4. Optimal Compound Classification through Mixed-Integer Optimization. To achieve an optimal compound classification, there are two key criteria depending on the total number of classes J (1 ≤ J ≤ K): (1) the predictability of the retention time; (2) the flexibility/capability of the prediction and classification of new compounds. The scheme is depicted in Figure 4. Because the total number of

(1)

Figure 4. Change of (1) the predictability of the retention time and (2) the flexibility/capability of classification of new compounds as a function of the total number of compound classes.

where RTj is a vector of retention times that consists of only compounds belonging to class j and is a subset of RT. XZSj is a matrix that consists of only compounds belonging to class j and is a subset of the augmented matrix XZS. Because the PLS has been extensively published,43−45 the PLS is not described in this paper. The main reason why PLS is selected for this study is to effectively deal with the multicollinear relationship among the compound properties, the mobile-phase conditions, and the stationary phases and provide a more robust prediction on the retention of compounds. 2.3. PLS-DA for Compound Classification. A PLS-DA46,47 is used to classify the K compounds into J subclasses (K ≥ J). The attribute of compound class is represented as a binary flag matrix Ck,j consisting of binary values (0 or 1), to show which class the individual compound belongs to. If a compound k belongs to a class j, then the Ck,j value is equal to 1. If not, the Ck,j value is equal to 0. As a constraint, the individual compound can belong to only ONE class (this constraint will be accounted for in a mixed-integer optimization

classes J is equal to 1 (i.e., ONE L-PLS model without compound classification), the predictability of the retention time would be low because diverse compound physical− chemical properties will exist in a model. However, this model would be straightforward and may be flexible because it covers a wide range of compound properties. On the other hand, because the total number of classes J is equal to the total number of compounds K (i.e., one compound per class), there is no variability of the compound properties, and the L-PLS model corresponds to a PLS model between mobile-phase conditions and stationary phases ([SZ]) and the retention time matrix of known compounds Y. This approach is commonly used as a traditional design of experiments (DOE) model in AQbD. This PLS model (Y = f([SZ]) will be suitable for optimization of the separation condition (generally involving a subtle change of the mobile-phase conditions) based only on D

dx.doi.org/10.1021/ie303459a | Ind. Eng. Chem. Res. XXXX, XXX, XXX−XXX

Industrial & Engineering Chemistry Research

Article

w2 were equally selected to find robust models that can minimize both the total retention time error RT err and the total misclassification rate Cerr. However, these weighting values can be changed depending on the modeling needs. For example, one may want to build a QSRR model having higher predictability only on the specific training samples, without accounting for the prediction of new compounds. In such a case, the weighting value w1 will be greater than w2. Through the above modeling and optimization, the optimized models (L-PLS and PLS-DA) will provide robust prediction on the retention time of compounds. In addition, the proposed models combined with the LSER parameters (using Abraham coefficients) as the column properties can be used to explore suitable columns in the early stages of method development.

the set of known/fixed compounds, which will occur particularly in the late stage of development. However, the PLS model (Y = f([SZ])) does not allow the retention time prediction for new compounds. Therefore, an intermediate number of classes will be best for balancing model flexibility and retention predictability (good retention time predictability and good capability for predicting the retention time of compounds into untested chromatographic conditions). Considering the above two criteria, an optimal number of compound classes is determined using an optimization framework. This approach simultaneously minimizes both the total prediction error of the compound’s retention time (by the L-PLS model in eq 1) and the total misclassification rate of compounds (by PLS-DA models in eqs 2 and 3). The optimization framework is shown in eq 4.

3. EXPERIMENT The effectiveness of the proposed modeling strategy will be demonstrated using two separation examples. The first example is a SFC retention prediction study utilizing 21 training and 4 validation compounds analyzed via gradient elution. The main purpose of this study was to increase the speed and quality of method development and optimization through building a general QSRR model for retention time prediction. A generic set of chromatographic conditions were used in this study (i.e., three columns, three additives, and two modifiers), and the compounds selected covered a wide range of physical−chemical properties. The second example is a RPLC study employing 11 training and 1 validation compounds analyzed via gradient elution. The retention times of the validation compound were examined over the method operable design region (MODR, the design space of the chromatographic method). Subsequent to this retention time prediction, the validation compound was synthesized and the retention times were experimentally determined. 3.1. Instrumentation and Chromatographic Conditions for SFC. The SFC study was performed using a Waters Corp. (Milford, MA) Thar supercritical fluid chromatograph equipped with a high-pressure pump, UV detector, auto sampler, and backpressure regulator set at 150 bar. The flow rate was 3.5 mL/min, and the column temperature was 40 °C. All experiments were performed via gradient elution. The chromatographic conditions are given in Table 1a. Three columns [2-ethylpyridine (2-EP; Princeton Inc.), silica (Phenomenex Inc.), and Luna HILIC (Phenomenex Inc.)] were selected (R = 3), and two organic modifiers (100% methanol and 97%:3% methanol/water) and three additives [10 mM ammonium acetate, 0.2% (v/v) trifluoroacetic acid (TFA), and 0.2% (v/v) isopropylamine (IPA)] were used (L = 8). With the full factorial design of these factors, the total number of experiments was 24 chromatographic runs (N = 24). The compounds51 tested in the SFC study are shown in Table 1b. The 21 training compounds (K = 21) were selected and consist of acidic, basic, neutral polar, and neutral nonpolar compounds. The 4 validation compounds were then selected so that the properties were not too similar or too different from the 21 training compounds while considering the available test compounds. Statistical molecular design based on PCA scores,52−54 taking into account all possible compounds (database), is a more suitable approach to selecting validation compounds because it can maximize the total explanatory compound property space. In our case, the available compounds

minimize (w1RTerr + w2Cerr) Ck , j

s.t. J

RTerr =

∑ (RTj − RTpre)j 2 ,

RTprej = f j (XZSj)

j=1 J

Cerr =

CR j is calculated based on Cĵ = f j (X)

∑ CR j, j=1

J

K

∑ Ck ,j = 1,

∑ Ck , j ≥ α ,

j=1

k=1

Ck , j ∈ {0, 1} (4)

where J is the total number of classes, α is the minimum number of compounds per class, and w1 and w2 are the weight values of the objective terms of the total retention time error RT err and the total misclassification rate C err , respectively. The total retention time error RTerr and the total misclassification rate Cerr are obtained from eqs 1−3, respectively. The optimized variable is a binary flag matrix Ck,j. In most cases, the two objective terms RTerr and Cerr will be naturally reduced by a suitable compound classification. A general observation in chromatography is compounds of similar properties will have similar retention behavior, that is, a similar partition ratio between the mobile and stationary phases. As a constraint, the individual compound can belong to only one class (∑Jj=1Ck,j = 1). The J and α parameters are given as fixed constant values prior to the optimization computation. A minimum number of compounds per class (α) should be included in the optimization framework to capture a range of compound properties within a model for improved retention time predictability and to have the flexibility/capability of prediction and classification on the retention time of new compounds. The minimum number of compounds per class α should be ≥3. To ensure that a suitable α value is selected, it is recommended to vary the α value in the optimization to maximize the retention time predictability. The total number of classes J should be also be suitably determined while considering the possible clustering of compounds, the total number of compounds K, and the minimum number of compounds per class α. Solutions to the modeling and mixed-integer optimization problems48 were obtained using Matlab49 and GAMS.50 In the practical implementation, arbitrary weighting values w1 and E

dx.doi.org/10.1021/ie303459a | Ind. Eng. Chem. Res. XXXX, XXX, XXX−XXX

Industrial & Engineering Chemistry Research

Article

Table 1. SFC Experimental Conditions (b) Compound List

(a) Separation Conditions column

mobile phase (A) mobile-phase modifier and additive (B)

outlet pressure (bar) column temperature (°C) flow rate (mL/min) detection run time elution mode gradient program time time time time

point point point point

0 1 2 3

2-EP (Princeton Inc.; 4.6 × 250 mm, 5 μm) silica (Phenomenex Inc.; 4.6 × 250 mm, 5 μm) HILIC diol (Phenomenex Inc.; 4.6 × 250 mm, 5 μm) CO2 methanol

training compounds

methanol + NH4OCOCH3 (10 mM) methanol + TFA (0.1%) methanol + IPA (0.2%) methanol/water (97:3) methanol/water (97:3) + NH4OCOCH3 (10 mM) methanol/water (97:3) + TFA (0.1%) methanol/water (97:3) + IPA (0.2%) 150 40 3.5 UV 12.5 min for standards and samples gradient time (min) %A %B 0 0.5 12.5 12.6

95 95 50 95

5 5 50 5

validation compounds

Table 2. RPLC Experimental Conditions

aqueous mobile phase (A) organic mobile phase (B) injection volume (μL) flow rate (mL/min) detection elution mode gradient time point 1 organic (%) gradient time point 1 (min) gradient time point 2 organic (%) gradient time point 2 (min) pH temperature (°C)

compound name

Comp-1 Comp-2 Comp-3 Comp-4 Comp-5 Comp-6 Comp-7 Comp-8 Comp-9 Comp-10 Comp-11 Comp-12 Comp-13 Comp-14 Comp-15 Comp-16 Comp-17 Comp-18 Comp-19 Comp-20 Comp-21 Comp-22 Comp-23 Comp-24 Comp-25

ibuprofen estradiol prednisolone theophylline theobromine fenoprofen caffeine sulfamethoxazole sulfaguanidine sulfadimethoxine adenine prednisone uracil cortisone naproxen thymine ketoprofen hydrocortisone sulfaquinoxaline sulfamethazine hypoxanthine cytosine phenoxybenzamine 1-naphthylacetic acid 4-butylbenzoic acid

category acidic neutral neutral basic basic acidic basic neutral neutral neutral basic neutral basic neutral acidic basic acidic neutral neutral neutral basic basic basic acidic acidic

nonpolar nonpolar

polar polar polar nonpolar nonpolar

nonpolar polar polar

ammonium acetate, pH 6.65). The current study was focused on predicting the retention times for one new analyte under six gradient profile conditions: gradient time point 1 organic (% initial organic solvent), gradient time point 1 (hold time of the initial organic solvent in minutes), gradient time point 2 organic (% final organic solvent), gradient time point 2 (time in minutes to get to the final organic solvent %), pH, and temperature. In the data analysis, in addition to these six factors, three additional fundamental variables were calculated from these six factors, and set parameters (gradient elution parameter32 and the steepness and time difference of the gradient33) were included (L = 9). The DOE consisted of 40 runs (N = 40), while accounting for all of the main effects and two factor interactions. The chemical structures for 11 compounds (K = 11) were listed by Reid et al.,1 and these are structurally similar to each other. The one new compound structure is not shown; however, it is similar in structure to the other compounds. 3.3. Compound Properties. The compound properties used in the above studies are listed in Table 3. A total of physical−chemical properties (M = 16) related to chromatographic retention (e.g., log D, pKa) were used. The physical− chemical properties were obtained from ACDLab software.55

were limited, and therefore selection of the compounds was performed by separation scientists. 3.2. Instrumentation and Chromatographic Conditions for RPLC. The RPLC study was performed on a Water’s Acquity UPLC system (Milford, MA) equipped with a binary solvent pump, a sample manager, a photodiode-array UV detector, a single quadropole detector, and four position temperature-controlled column managers. The flow rate was 0.4 mL/min, and the experiments were performed via gradient elution. The chromatographic conditions are listed in Table 2.

system column

no.

Waters Acquity UPLC system (Waters Inc.) Waters Acquity UPLC BEH C18 (2.1 × 100 mm, 1.7 μm) 10 mM ammonium acetate acetonitrile 2 0.4 UV at 260 nm and electron-spin mass spectrometry gradient 13−23 (center: 18) 0.5−1.5 (center: 1) 52−62 (center: 57)

4. RESULT The results of two examples (SFC and RPLC) are described below. 4.1. SFC Example. 4.1.1. Preliminary Analysis (PCA and PLS; SFC Study). Prior to optimization, some preliminary studies with PCA/PLS are used to understand the overall data characteristic. The result of the PCA score plot using the compound property matrix X is shown in Figure 5. The scores of the training compounds are well dispersed without a dense

7.5−8.5 (center: 8) 6−7 (center: 6.5) 27−37 (center: 32)

In a previous study,1 the chromatographic method MODR was established, including the column (Waters Acquity UPLC BEH C18), organic solvent (acetonitrile), and buffer/pH (10 mM F

dx.doi.org/10.1021/ie303459a | Ind. Eng. Chem. Res. XXXX, XXX, XXX−XXX

Industrial & Engineering Chemistry Research

Article

a subset of compounds with similar properties were selected for retention time modeling. 4.1.2. Result of Optimal Compound Classification (SFC Study). Some parameters were fixed during optimization because of the preliminary analysis described in section 4.1.1. The total number of classes (J) was taken as “3” because compound retention was mainly affected by the columns rather than the mobile-phase conditions. The minimum number of compounds per class (α) was taken as “3”. The four PLS components for L-PLS and the three PLS components for PLS-DA at each class were selected to avoid the complexity of optimization computation (although these PLS components can be simultaneously accounted for as the optimized variables during optimization, it will lead to a more complicated computation). However, after optimization, it was confirmed that there is no significant effect on the prediction result even with the slight change of these PLS components (e.g., ±1). The results of optimal compound classification on the training compounds are shown in Table 4 (the classification result of the validation compounds will be discussed in the next section). The training compounds were classified into three classes (A−C). Each compound belongs to only one class because of the constraint in eq 4. The result of compound classification in Table 4 makes sense from the standpoint of the PLS model result (Y = f([SZ])) because the three compound classes approximately match the three clusters categorized using the PLS loading distance among the compounds (Figure 6). The class A group consists of mostly acidic and neutral compounds, has relatively higher hydrophobicity (e.g., log P > 0), and includes Lewis acids. The class A compounds have a strong affinity with the 2-EP column. The pyridine ligand is a base (proton acceptor), and the (Lewis base) pyridine functional group would result in a net positive charge in the presence of a carbon dioxide and methanol mobile phase.56 Therefore, the 2-EP column is expected to preferentially partition with the Lewis acid analytes (increased retention) and repel the Lewis base (positively charge) compounds (reduced retention). A similar result was presented by White and Burnett.57 This mechanistic interpretation will be further corroborated using the model information from L-PLS (section 4.1.4). The class B group contains five basic compounds that are either polar (e.g., log P ≤ 0) or Lewis bases. The class B compounds (consisting of the basic compounds) have a strong affinity with the silica column. The data are similar to the results of Bui et al.58 These basic compounds interact with (acidic) silanol groups, leading to increased adsorption and retention on the silica column. The class C group consists of three basic or neutral compounds. The properties of the HILIC diol phase are somewhat similar to those of silica (polar although less acidic). The three compounds classified into class C have much lower pKa basicity. For the class C group, additional compounds will be added to further investigate and understand the structure− retention mechanism. Note: An unknown source of variability can be an interaction between the analyte and residual silanol groups in each stationary phase. All stationary phases in this study are based upon silica. During the bonding of the ligand (if applicable) and end-capping (if applicable), some, but not all, silanol groups are functionalized; the remaining silanol groups are termed “residual”. The availability of these residual silanol groups for chromatographic separation is dependent upon the synthesis

Table 3. Compound Properties Used for the Two Studies (SFC and RPLC) molecular descriptors pKa-MA pKa-MB log P MW PSA FRB HDonors HAcceptors rule of 5 molar refractivity (cm3) molar volume (cm3) parachor (cm3) index of refraction surface tension (dyn/cm) density (g/cm3) polarizability (10−24 cm3)

complementary strength of acidity strength of basicity partition coefficient (octanol/water) molecular weight molecular polar surface area: defined as the sum of the surfaces of polar atoms in a molecule number of freely rotatable bonds number of hydrogen-bond donors (H-donors) number of hydrogen-bond acceptors (H-acceptors) Lipinski’s rules molar refractivity molar volume scientific quantity defined according to the formula P = γ(surface tension)(1/4)M/d number that describes how light, or any other radiation, propagates through that medium contractive tendency of the surface of a liquid that allows it to resist an external force mass per unit volume ratio of the induced dipole moment of an atom to the electric field that produces this dipole moment

overlap, and the scores of the validation compounds are located at the edge of the score variability of the training compounds (e.g., the compounds selected are approximately suitable). Next, the PLS model between the mobile-phase conditions, stationary phases ([SZ]), and retention time matrix of known compounds Y (Y = f([SZ])) without the compound properties X is built to investigate the overall variable relationship of experimental data (i.e., Y, SZ). The column property matrix S is represented as a binary flag matrix. This PLS model corresponds to a traditional DOE model used in AQbD. Because the PLS model is built directly on the retention times of specific compounds, it does not include the potential error of the compound properties in the prediction of retention. Therefore, this PLS model will give the best retention time prediction for the specific compounds and provide reliable/ consistent information on the relationship between the mobileand stationary-phase conditions and the retention time of specific compounds. The loading plot of the PLS model is shown in Figure 6. The average total variance on the retention time of all compounds explained by the PLS model (two PLS components) is 82.6% (R2Y), and therefore this loading plot corresponds to an overall variable relationship. The loadings of all compounds are relatively well dispersed, which means the selection of compounds suitably contained a wide range of compound properties. The stationary phases (2-EP, silica, and HILIC) have a dominate contribution on the retention times of all compounds (demonstrated by the larger loadings), while the mobile phases (Z1−Z6) have a smaller effect on the retention times (smaller loadings). Analyte retention times on the columns differ and are related to the properties of the compound. For example, sulfamethazine (Comp-20) and hydrocortisone (Comp-18) have a strong relationship with the 2-EP column, and theobromine (Comp-5) and caffeine (Comp-7) have a strong relationship with the silica column. If all of these compounds were included into one model, the predictability of the resultant model would be lower than that if G

dx.doi.org/10.1021/ie303459a | Ind. Eng. Chem. Res. XXXX, XXX, XXX−XXX

Industrial & Engineering Chemistry Research

Article

Figure 5. PCA score plot of compound properties (SFC study).

Figure 6. PLS loading plot of the PLS model (Y = f([SZ])).

4.1.3. Result of PLS-DA (SFC Study). The classification results of the training compounds using PLS-DA models are shown in Table 5. The detailed plots of the classification results on class A−C compounds are shown in parts a−c of Figure 7, respectively. The threshold of each class was determined so as to minimize the misclassification rate of both TYPE-I and

conditions and column age/previous analyses conditions and varies by vendor and stationary phase. There are “probe” molecules used to experimentally determine these values; however, these molecules do not solely probe silanols. For the purposes of this paper’s method, residual silanol content was not considered. H

dx.doi.org/10.1021/ie303459a | Ind. Eng. Chem. Res. XXXX, XXX, XXX−XXX

Industrial & Engineering Chemistry Research

Article

Cytosine (class B), 1-naphthylacetic acid (class A), and 4-butylbenzonic acid (class A) were classified as intuitively expected because cytosine is basic and 1-naphthylacetic acid and 4-butylbenzonic are acidic. However, phenoxybenzamine was classified as class A irrespective of its basicity, implying that its hydrophobicity (e.g., log P = 3.7) is a more dominant factor in the retention mechanism. When classifying additional compounds using PLS-DA models, there will be the potential for intermediate/boundary values as described above (or even classification into multiple classes). For example, a new compound may be classified as both class B and C. In such a situation, both models can be used for retention time prediction, followed by experimental confirmation to validate the result. The retention time prediction from both models should provide similar values because of the implied similarity of the compound’s physical− chemical properties to both class B and C compounds. The PLS-DA models can help to understand which compound properties have an impact on the compound classification (with implications to the retention mechanism as well). The PLS-DA coefficient plots of classes A and B are shown in parts a and b of Figure 8, respectively. These variables have been normalized with the mean center and unit variance; thus, each variable coefficient can be reasonably compared to another. For example, compounds with higher log P and lower pKa-MA values (compounds that are hydrophobic and acidic/ neutral in nature) are classified as class A compounds, while compounds with higher pKa-MA values, low log P, and hydrogen-bond donors (proton acceptors are compounds that are more polar and basic in nature) are classified as class B. 4.1.4. Result of L-PLS (SFC Study). The retention time prediction results for the training and validation compounds using L-PLS with and without compound classification are shown in parts a and b of Figure 9, respectively. The data set consisted of 504 (21 compounds × 24 separation conditions) retention time training data points and 96 (4 compounds × 24 separation conditions) retention time validation data points. The root-mean-square error (RMSE) for both compound sets is shown in Table 7. Note that the validation samples have larger prediction errors compared to the training compounds. This is due to the validation compounds having compound properties slightly different compared to the training set (the scores of validation samples are located toward the edges of the training set range, as shown in Figure 5). The predictability of the model will be improved with the inclusion of additional compounds in the model space. Overall, L-PLS with compound classification provided better predictability on the retention time for both training and validation compounds (>2X), compared to L-PLS without compound classification (e.g., utilizing one model). Therefore, compound classification contributed to improved compound retention time predictability. L-PLS with compound classification can provide useful information about which analyte properties have greater impact on a compound’s retention time. For example, the model coefficient of L-PLS with classification (class A) is shown in Figure 10. On the 2-EP column, major factors in the retention (mechanism) include hydrogen bonding (donor) and higher log P values for class A compounds. Nonpolar compounds (larger log P) have a stronger adsorption to the 2-EP column (less polar column than the silica or diol stationary phase) and result in longer retention times. The 2-EP ligand is an hydrogen-bond acceptor, so hydrogen-bond donors

Table 4. Result of Compound Classification (SFC Study) result of compound classification (A−C)

category training compounds

validation compounds

Comp-1

acidic

A

Comp-2 Comp-3 Comp-4 Comp-5 Comp-6 Comp-7 Comp-8 Comp-9 Comp-10 Comp-11 Comp-12 Comp-13 Comp-14 Comp-15 Comp-16 Comp-17 Comp-18 Comp-19 Comp-20 Comp-21 Comp-22

neutral nonpolar neutral nonpolar basic basic acidic basic neutral polar neutral polar neutral polar basic neutral nonpolar basic neutral nonpolar acidic basic acidic neutral nonpolar neutral polar neutral polar basic basic

A A B B A B A C A B A C A A C A A A A B B

Comp-23 basic Comp-24 acidic Comp-25 acidic

A A A

Table 5. Results of the PLS-DA Model for Three Classes (A−C; SFC Study)

Class-A Class-B Class-C total

total number of compounds assigned

number of compounds misclassified

misclassification rate (%)

13 5 3 21

0 0 0 0

0 0 0 0

TYPE-II errors. All compounds were well classified without misclassification, indicating that optimization (eq 4) worked as intended. The appropriate classification of intermediate/boundary compounds into one specific class may be challenging. For example, theophylline was classified as class B (Figure 7a,b). However, the predicted class B PLS-DA for theophylline (0.58) was closer to the class B threshold (0.51) compared to any other class B compound. The predicted class A PLS-DA for theophylline (0.25) was closer to the class A threshold (0.6) compared to any other class B compound. Intuitively, this classification makes sense as because (1) theophylline is more hydrophobic relative to the other class B compounds (e.g., log P = −0.02) and more closely resembles a “typical” class A compound physicochemically and (2) the loading plot of theophylline (Figure 6) is located at the class B edge (proximity to class A), which indicates that theophylline would have slightly different separation/retention mechanisms compared to other class B compounds. The results of the class values predicted for validation compounds using the PLS-DA models are listed in Table 6. Each validation compound was classified into a single class. I

dx.doi.org/10.1021/ie303459a | Ind. Eng. Chem. Res. XXXX, XXX, XXX−XXX

Industrial & Engineering Chemistry Research

Article

Figure 7. Results of the PLS-DA (determined by optimization vs predicted) on the SFC study.

Table 6. Result of PLS-DA on Four Validation Compounds (SFC Study) predicted by the class A PLS-DA model threshold of the model cytosine phenoxybenzamine 1-naphthylacetic acid 4-butylbenzoic acid

compound over the class A threshold?

0.600 −0.110 1.195 0.671 0.797

predicted by the class B PLS-DA model

compound over the class B threshold?

0.510 no yes yes yes

predicted by the class C PLS-DA model

compound over the class C threshold?

result

no no no no

B A A A

0.370

0.808 0.013 0.230 −0.064

yes no no no

0.368 −0.169 0.076 0.257

Figure 8. PLS-DA model coefficient plots of Class-A (a) and Class-B (b) (SFC study).

for each column used in this study were obtained from Poole.13 These values were subsequently used in the L-PLS model with compound classification to investigate if the model predictability can be improved and if the model can be applied to the prediction of compound retention times on new columns. The results of L-PLS with compound classification using either the binary flag matrix or LSER parameters are shown in Table 8. Note that the L-PLS with compound classification using the LSER parameters provided similar or better results in terms of R2Y (%) and Q2Y (%) of the retention time and the explained variance of X space (%), compared to using the binary flag matrix. Therefore, L-PLS with compound classification in

are likely to interact in an attractive manner, leading to increased retention. Consequently, the L-PLS model with compound classification can help elucidate the separation mechanism. 4.1.5. L-PLS with Compound Classification Combining LSER Parameters (SFC Study). During the analysis above, the binary flag matrix that represents the column used at each experiment has been employed. However, this approach does not allow the prediction of retention times on new columns. The five LSER parameters [(1) electron loan pairs, (2) dipole type, (3) hydrogen-bonding basicity, (4) hydrogen-bonding acidity, and (5) cavity formation and dispersion interactions] J

dx.doi.org/10.1021/ie303459a | Ind. Eng. Chem. Res. XXXX, XXX, XXX−XXX

Industrial & Engineering Chemistry Research

Article

Figure 9. Prediction result of the retention time on (a) 21 training compounds and (b) 4 validation compounds (SFC study).

Table 7. Retention Time Prediction Result (with/without Compound Classification: SFC Model) data training on 21 compounds with 24 separation conditions (i.e., 504 retention time points)

method

Table 8. Result of L-PLS with Compound Classification Using the Binary Flag Matrix versus Abraham Coefficients as Column Properties

RMSE

Class-A Class-B Class-C

L-PLS without classification 0.5697 (ONE model)

case 1: using the binary flag matrix as the column property

L-PLS with classification 0.2321 validation on 4 compounds (on a new L-PLS without classification 1.4107 (ONE model) compound) with 24 separation conditions (i.e., 96 retention time points) L-PLS with classification 0.7955

case 2: using the Abraham coefficients as column properties

R2X (%)

64.5

66.3

71.6

R2Y (%) Q2Y (%) R2X (%)

89.2 86.7 67.3

89.3 82.5 69.5

98.4 96.9 72.6

R2Y (%) Q2Y (%)

88.7 86.9

89.2 84.2

98.4 96.7

interest prior to performing chromatographic experiments, thereby enhancing the speed of method development and performance of the final method. 4.2. RPLC Example. 4.2.1. Preliminary PLS Analysis (RPLC Study). The PLS model between mobile-phase conditions Z and the retention time matrix of known compounds Y is built to investigate the overall variable relationship of experimental data. The loading plot of the PLS model Y = f(Z) is shown in

addition to the LSER parameters may provide good compound retention time prediction on new columns that have similar column characteristics [e.g., polyamide gel bonded silica, (aminopropyl)siloxane-bonded silica13]. The result of this study will be reported in a future communication. Ultimately, it is anticipated that L-PLS with compound classification combined with LSER parameters will be useful for the prediction of columns that will separate compounds of

Figure 10. Model coefficient of L-PLS with classification (Class-A, SFC study). K

dx.doi.org/10.1021/ie303459a | Ind. Eng. Chem. Res. XXXX, XXX, XXX−XXX

Industrial & Engineering Chemistry Research

Article

Figure 11. The average total retention time variance for the 11 compounds is 96.0% (R2Y). The PLS model has only two components; therefore, the loading plot represents an overall variable relationship. Note that the loadings of the compounds are in a narrow region because all are structurally very similar and the experiments were performed on a single column over narrow changes in the chromatographic conditions (the conditions represent the MODR, the area where acceptable chromatographic performance is obtained). This type of situation may arise in late-stage pharmaceutical development, where a new impurity is observed or hypothesized. L-PLS with compound classification will be investigated for improvement in compound retention time predictability. 4.2.2. Result of Optimal Compound Classification (RPLC Study). Some optimization parameters were determined below. Analogous to the SFC study, the total number of classes (J) was set at 3, and the minimum number of compounds per class (α) was also set at 3. The three PLS components for L-PLS and the two PLS components for PLS-DA at each class were selected during optimization. The results of the training compound classification are shown in Table 9. Compounds with similar PLS loading vectors were classified into the same classes (Figure 11). Note that the PLS loading vectors for all compounds are similar because the compounds are structurally similar. 4.2.3. Result of PLS-DA (RPLC Study). The result of the compound classification study of PLS-DA on the training compounds is shown in Table 10. Parts a−c of Figure 12 show plots of the classification results for class A−C compounds, respectively. All compounds except Comp-3 were classified without error. Comp-3 was misclassified as class B (Figure 12b; TYPE-II error) while also being appropriately classified into

Table 9. Result of Optimal Compound Classification (RPLC Study) result (class) Comp-1 Comp-2 Comp-3 Comp-4 (API) Comp-5 Comp-6

A A A B B B

result (class) Comp-7 Comp-8 Comp-9 Comp-10 Comp-11 NewComp

B B C C C A

Table 10. Result of PLS-DA (RPLC Study) total number of compounds assigned Class-A Class-B Class-C total

3 5 3 11

number of compounds misclassification misclassified rate (%) 0 1 0 1

0 20 0 9.1

class A (Figure 12a). Comp-3 is an intermediate/boundary compound between classes A and B, as demonstrated in Figure 11. The PLS-DA models used in eq 4 may not necessarily provide a perfect classification (i.e., zero misclassification rate) because the objective of optimization is to minimize the total misclassification rate of compounds as well as the total prediction error of retention time. As long as suitable compound properties are selected and the chromatographic experiments are carefully controlled, optimization will find the “best” compound classification (the one that minimizes the total misclassification rate and total retention time prediction error).

Figure 11. PLS loading plots of the PLS model [Y = f(Z)] (RPLC study). L

dx.doi.org/10.1021/ie303459a | Ind. Eng. Chem. Res. XXXX, XXX, XXX−XXX

Industrial & Engineering Chemistry Research

Article

Figure 12. Result of PLS-DA on three classes (RPLC study).

Figure 13. Prediction results of retention time of 11 training compounds using L-PLS with/without compound classification on (a) all compounds, (b) Class-A, (c) Class-B, (d) and Class-C (RPLC study).

4.2.4. Result of L-PLS (RPLC Study). The overall retention time prediction results for the entire training set (440 data points: 11 compounds × 40 separation conditions) using L-PLS with and without compound classification are shown in

Figure 13a. The retention time prediction results of the training set plotted into class A (120 data points: 3 compounds × 40 separation conditions), class B (200 data points), and class C (120 data points) subcategory models are shown in parts M

dx.doi.org/10.1021/ie303459a | Ind. Eng. Chem. Res. XXXX, XXX, XXX−XXX

Industrial & Engineering Chemistry Research

Article

b−d of Figure 13, respectively. The RMSE on the training compounds is shown in Table 11. Overall, L-PLS with compound classification provided significantly better retention time predictability for the training compounds (>10X for RMSE) compared to L-PLS without compound classification (single model). Therefore, compound classification enhanced the compound retention time predictability. As shown in Figure 13, the L-PLS model built with all compounds without compound classification resulted in both a larger prediction error and predicted retention time biases for class A (higher retention predicted than actual) and class B (lower retention predicted than actual). Class B predicted

retention was not biased. This is consistent with Figure 11 (loading plot of the PLS model), which demonstrates that each compound class (and each compound) has different characteristics in terms of the relationship between mobile-phase conditions and retention time. 4.2.5. Validation Result of the New Compound (RPLC Study). Validation experiments using a new compound with 6 separation conditions (a subset of the experimental conditions run in the DOE) were performed to investigate the predictability of the proposed QSRR models. The results of the predicted class value of the new compound based on the above three PLS-DA models are listed in Table 12. The new compound was solely classified into class A, and therefore the class A L-PLS model was used to predict retention times (Figure 14). The RMSE on the new (validation) compound is shown in Table 11. The predicted retention time result for this compound has a slight bias versus the experimental result (Figure 14), and the L-PLS model without compound classification had a larger validation error (Table 11). The bias may be due to low diversity (minimal variability in the chemical properties, lack of inclusion of experimental variability associated with performing experiments on different days, with different equipment, columns, systems, mobile-phase preparations, analyst, etc.) of the three compounds included in the class A L-PLS model. To improve the model predictability in the future, additional compounds will be added and the model retrained.

Table 11. Result of L-PLS with/without Compound Classification on 11 Training Compounds with 40 Separation Conditions (i.e., 440 Retention Time Points) and 1 New Compound with 6 Separation Conditions (i.e., 6 Retention Time Points) data

method

RMSE

training on 11 compounds with 40 separation condition (i.e., 440 retention time points)

L-PLS without classification (ONE model)

0.4534

validation (on a new compound) with 6 separation conditions

L-PLS with classification L-PLS without classification (ONE model) L-PLS with classification

0.0389 0.4543 0.2704

Table 12. Result of PLS-DA on a New Compound (RP-LC Study) predicted by the class A PLS-DA model threshold of model NewComp

compound over the class A threshold?

0.50 0.61

predicted by the class B PLS-DA model

compound over the class B threshold?

0.63 yes

predicted by the class C PLS-DA model

compound over the class C threshold?

result

no

A

0.78

0.38

no

−0.22

Figure 14. Prediction result of retention time of 1 new validation compound (with 6 separation conditions) using L-PLS with compound classification (RPLC study). N

dx.doi.org/10.1021/ie303459a | Ind. Eng. Chem. Res. XXXX, XXX, XXX−XXX

Industrial & Engineering Chemistry Research

Article

(4) Bhatt, D. A.; Rane, S. I. QbD approach to analytical RP_HPLC method development and its validation. Int. J. Pharm. Pharm. Sci. 2011, 3 (1), 179−187. (5) Kaliszan, R. QSRR: Quantitative Structure (Chromatographic) Retention Relationship. Chem. Rev. 2007, 107, 3212−3246. (6) Nord, L. I.; Fransson, D.; Jacobsson, S. P. Prediction of liquid chromatographic retention times of steroids by three-dimensional structure descriptors and partial least squares modeling. Chemom. Intell. Lab. Syst. 1998, 44, 257−269. (7) Hancock, T.; Put, R.; Coomans, D.; Heyden, Y. V.; Everingham, Y. A performance comparison of modern statistical techniques for molecular descriptor selection and retention prediction in chromatographic QSRR studies. Chemom. Intell. Lab. Syst. 2005, 76, 185−196. (8) Bolanca, T.; Cerjan-Stefanovic, S.; Lusa, M.; Regelja, H.; Loncaric, S. Development of gradient elution retention model in ion chromatography by using radial basis function artificial neural networks. Chemom. Intell. Lab. Syst. 2007, 86, 95−101. (9) Abraham, M. H.; Chadha, H. S.; Whiting, G. S.; Mitchell, R. C. An analysis of water−octanol and water−alkane partitioning and the Δlog p parameter of seiler. J. Pharm. Sci. 1994, 83, 1085−1100. (10) Abraham, M. H.; Ibrahim, A.; Zissimos, A. M. J. Determination of sets of solute descriptors from chromatpgraphic measurements. J. Chromatogr. A 2004, 1037 (1−2), 29−47. (11) West, C.; Lesellier, E. A unified classification of stationary phases for packed column supercritical fluid chromatography. J. Chromatogr. A 2008, 1191, 21−39. (12) Fields, P. R.; Sun, Y.; Stalcup, A. M. Application of a modified linear-solvation energy relationship (LSER) model to retention on a butylimidazolium-based column for high performance liquid chromatography. J. Chromatogr. A 2011, 1218, 467−475. (13) Poole, C. F. Review: Stationary phases for packed-column supercritical fluid chromatography. J. Chromatogr. A 2012, 1250, 157− 171. (14) West, C.; Lesellier, E. Chemometric methods to classify stationary phases for achiral packed column supercritical fluid chromatography. J. Chemom. 2012, 26, 52−65. (15) Lesellier, E. Review: Retention mechanisms in super/subcritical fluid chromatography on packed columns. J. Chromatogr. A 2009, 1216, 1881−1890. (16) Kaliszan, R. Quantitative Structure−Retention Relationships applied to reversed-phase high-performance liquid chromatography. J. Chromatogr. A 1993, 656, 417−435. (17) David, V.; Medvedovici, A. Structure−retention correlation in liquid chromatography for pharmaceutical applications. J. Liq. Chromatogr. Relat. Technol. 2007, 30 (5−7), 761−789. (18) Nikitas, P.; Pappa-Louisi, A. Review: Retention models for isocratic and gradient elution in reversed-phase liquid chromatography. J. Chromatogr. A 2009, 1216, 1737−1755. (19) Kaliszan, R.; Baczek, T. QSAR in chromatography: Quantitative Structure−Retention Relationships (QSRRs). Recent Advances in QSAR Studies; Springer Science: Berlin, 2010; Chapter 8, pp 223−259. (20) Morgado, J. E.; Muteki, K.; Reid, G. L.; Fortin, D. T.; Harwood, J. F.; Wang, J.; Xue, G. Comparative Investigation of the Predictive Capability of an L-PLS Model versus ACD Labs Commercial Software using a Typical Pharmaceutical Compound. CoSMoS Annual Conference, 2012. (21) Lippa, K. A.; Sander, L. C.; Wise, S. A. Chemometric studies of polycyclic aromatic hydrocarbon shape selectivity in reversed-phase liquid chromatography. Anal. Bioanal. Chem. 2004, 378 (2), 365−77. (22) Dimov, N.; Stoev, S. A new approach to structure−retention relationships. Acta Chromatogr. 1999, 9, 17−18. (23) Vitha, M.; Carr, P. W. Review: The chemical interpretation and practice of linear salvation energy relationships in chromatography. J. Chromatogr. A 2006, 1126, 143−194. (24) Ośmiałowski, K.; Halkiewicz, J.; Kaliszan, R. Quantum chemical parameters in correlation analysis of gas−liquid chromatographic retention indices of amines. J. Chromatogr. 1986, 361, 63−69.

However, overall and as a concept, L-PLS with a compound classification approach provided good predictability on the new compound retention time. In numerous instances, the retention time of a compound needs to be predicted for an established method or screening conditions (e.g., predicted degradation product, new process related impurity, authentic standard not available). An advantage of L-PLS with compound classification is that it may not require (many or any) additional chromatographic experiments to predict the retention of new compounds because the relationship between chromatographic conditions and physical− chemical properties of known compounds has been established experimentally. L-PLS with compound classification can be used for optimization purposes to maximize the chromatographic peak resolution for a set of compounds (including the new compounds). Thus, QSRR can contribute to enhancing the speed and quality of chromatographic method development.

5. CONCLUSION A novel compound-classification-based QSRR modeling strategy that can simultaneously account for mobile-phase conditions, stationary phases, and compound properties was demonstrated using SFC and RPLC separation examples. This QSRR model (L-PLS with compound classification) significantly improved the retention time predictability of compounds compared to traditional QSRR models and L-PLS models without compound classification. The proposed method was applied to enhance the understanding of the chromatographic retention mechanisms. The approach can be used to optimize separation conditions to maximize peak resolutions, including for new compounds. The proposed QSRR model combined with LSER parameters as the column properties can be used to explore additional column space. These are key advantages of L-PLS utilizing compound classification. This study is foundational, and work will continue toward the development of a fully computational approach to identifying initial target columns and chromatographic conditions “by simulation” for compound sample sets (from initial method development through method optimization) for use throughout the life cycle of the project.



AUTHOR INFORMATION

Corresponding Author

*Tel.: (860) 686-2748. Fax: (860)441-5423. E-mail: koji.muteki@ pfizer.com. Notes

The authors declare no competing financial interest.

■ ■

ACKNOWLEDGMENTS The authors thank Loren Wrisley and Pfizer Inc. for support of this work. REFERENCES

(1) Reid, G. L.; Cheng, G.; Fortin, D. T.; Harwood, J. F.; Morgado, J. E.; Wang, J.; Xue, G. Reversed-Phase Liquid Chromatographic Method Development in an Analytical Quality by Design Framework. J. Liq. Chromatogr. Relat. Technol. 2013, 10.1080/ 10826076.2013.765457. (2) Graul, T. W.; Barnett, K. L.; Bale, S. J.; Gill, I.; Hanna-Brown, M., Ende, D. J., Eds. Chemical Engineers in the Pharmaceutical Industry: R&D to Manufacturing; John Wiley & Sons: New York, 2011; p 545. (3) Vogt, F. G.; Kord, A. S. Development of quality-by-design analytical methods. J. Pharm. Sci. 2011, 100, 797−812. O

dx.doi.org/10.1021/ie303459a | Ind. Eng. Chem. Res. XXXX, XXX, XXX−XXX

Industrial & Engineering Chemistry Research

Article

(25) Karelson, M.; Lobanov, V. S.; Katritzky, A. R. Quantumchemical descriptors in QSAR/QSPR studies. Chem. Rev. 1996, 96, 1027−1043. (26) Lucic, B.; Trinajstic, N.; Sild, S.; Karelson, M.; Katritzky, A. R. A New Efficient Approach for Variable Selection Based on Multiregression: Prediction of Gas Chromatographic Retention Times and Response Factors. J. Chem. Inf. Comput. Sci. 1999, 39, 610−621. (27) Gramatica, P.; Navas, N.; Todeschini, R. 3D-modelling and prediction by WHIM descriptors. Part 9. Chromatographic relative retention time and physic-chemical properties of polychlorinated biphenyls (PCBs). Chemom. Intell. Lab. Syst. 1998, 40, 53−63. (28) Molnar, I. Computerized design of separation strategies by reversed-phase liquid chromatography: development of DryLab software. J. Chromatogr. A 2002, 965, 175−194. (29) Monks, K.; Molnar, L.; Rieger, H. J.; Bogati, B.; Szabo, E. Quality by design: Multidimensional exploration of the design space in high performance liquid chromatography method development for better robustness before validation. J. Chromatogr. A 2012, 1232, 218− 230. (30) Wang, C.; Skibic, M. J.; Higgs, R. E.; Watoson, I. A.; Bui, H.; Wang, J.; Cintron, J. M. Evaluating the performances of quantitative structure−retention relationship models with different sets of molecular descriptors and databases for high-performance liquid chromatography predictions. J. Chromatogr. A 2009, 1216, 5030− 5038. (31) Sýkora, D.; Tesarová, E.; Armstrong, D. W. Practical Considerations of the Influence of Organic Modifiers on the Ionization of Analytes and Buffers in Reversed-Phase LC. LCGC North Am. 2002, 20 (10), 974−981. (32) Drylab Chromatography Reference Guide (manual); LCResources Inc.: Walnut Creek, CA, 2000. (33) Snyder, L. R.; Dolan, J. W.; Gant, J. R. Gradient elution in highperformance liquid chromatography. I. Theoretical basis for reversedphase system. J. Chromatogr. 1979, 165, 3−30. (34) West, C.; Fougere, L.; Lesellier, E. Combined supercritical fluid chromatographic tests to improve the classification of numerous stationary phases used in reversed-phase liquid chromatography. J. Chromatogr. A 2008, 1189, 227−244. (35) Bidlingmeyer, B.; Chan, C. C.; Fastino, P.; Henny, R.; Koerner, P.; Maule, A. T.; Marques, R. C.; Neue, U.; Pappa, H.; Sander, L.; Santasania, C.; Snyder, L. HPLC column classification. Pharmacopeial Forum; The United States Pharmacopeial Convention Inc.: Rockville, MD, 2005; Vol. 31, Issue 2 (Mar−Apr). (36) Snyder, L. R.; Maule, A.; Heebsch, A.; Cuellar, R.; Paulson, S.; Carrano, J.; Wrisley, L.; Chan, C. C.; Pearson, N.; Dolan, J. W.; Gilroy, J. Choosing an equivalent replacement column for a reversed-phase liquid chromatographic assay procedure. J. Chromatogr. A 2004, 1057, 49−57. (37) Muteki, K.; MacGregor, J. F. Multi-block PLS modeling for Lshape data structures with applications to mixture modeling. Chemom. Intell. Lab. Syst. 2007, 85, 186−194. (38) García-Muñoz, S.; Polizzi, M. WSPLSA new approach towards mixture modeling and accelerated product development. Chemom. Intell. Lab. Syst. 2012, 114, 116−121. (39) Muteki, K.; MacGregor, J. F.; Ueda, T. Rapid Development of New Polymer Blends: The Optimal Selection of Materials and Blend Ratios. Ind. Eng. Chem. Res. 2006, 45 (13), 4653−4660. (40) Muteki, K.; MacGregor, J. F. Optimal purchasing of raw materials: a data-driven approach. AIChE J. 2008, 54 (6), 1554−1559. (41) Muteki, K.; MacGregor, J. F.; Ueda, T. Mixture designs and models for the simultaneous selection of ingredients and their ratios. Chemom. Intell. Lab. Syst. 2007, 86, 17−25. (42) Martens, H.; Anderssen, E.; Flatberg, A.; Gidskehaug, L. H.; Hoy, M.; Westad, F.; Thybo, A.; Martens, M. Regression of a data matrix on descriptors of both its rows and of its columns via latent variables: L-PLSR. Comput. Stat. Data. Anal. 2005, 48 (1), 103−123. (43) Höskuldsson, A. PLS regression methods. J. Chemom. 1988, 2, 211−228.

(44) Burnham, A. J.; MacGregor, J. F.; Viveros, R. Frameworks for latent variable regression. J Chemom. 1996, 10, 31−45. (45) Martens, H.; Tormod, N. Multivariate calibration; John Wiley & Sons: New York, 1991. (46) Baker, M.; Rayens, W. Partial least squares for discrimination. J. Chemom. 2003, 17, 166−173. (47) Geladi, P.; Grahn, H. Multivariate Image Analysis; John Wiley and Sons: Chichester, U.K., 1996; pp 34−44. (48) Bentley, J.; Kawajiri, Y. Prediction−Correction Method for Optimization of Simulated Moving Bed Chromatography. AIChE J. 2012, DOI: 10.1002/aic.13856. (49) Matlab manual; Mathworks Inc.: Natick, MA, 2007. (50) GAMSThe Solver Manuals; GAMS Development Corp.: Washington, DC, 1993. (51) Brunelli, C.; Zhao, Y.; Brown, M.; Sandra, P. Pharmaceutical analysis by supercritical fluid chromatography: optimization of the mobile phase composition on a 2-ethylpyridine column. J Sep. Sci. 2008, 31 (8), 1299−1306. (52) Eriksson, L.; Johansson, E.; Muller, M.; Wold, S. On the selection of the training set in environmental QSAR analysis when compounds are clustered. J. Chemom. 2000, 14, 599−616. (53) Eriksson, L.; Johansson, E.; Kettaneh-Wold, N.; Wold, S. Multiand Megavariate Data Analysis: Principles and Applications; Umetrics Academy: Umeå, 2001 (54) Linusson, A.; Wold, S.; Norden, B. Statistical molecular design of peptoid libraries. Mol. Divers. 1999, 4, 103−104. (55) ACDLab manual; ACDLabs Inc.: Toronto, Ontario, Canada, 2008. (56) Jessop, P. G.; Heldebrant, D. J.; Li, X.; Eckert, C. A.; Liotta, C. L. Nature 2005, 436, 1102. (57) White, C.; Burnett, J. Integration of supercritical fluid chromatography into drug discovery as a routine support tool II. Investigation and evaluation of supercritical fluid chromatography for achiral batch purification. J. Chromatogr. A 2005, 1074, 175−185. (58) Bui, H.; Masquelin, T.; Perum, T.; Castle, T.; Dage, J.; Kuo, M. S. Investigation of retention behavior of drug molecules in supercritical fluid chromatography using linear salvation energy relationships. J. Chromatogr. A 2008, 1206, 186−195.

P

dx.doi.org/10.1021/ie303459a | Ind. Eng. Chem. Res. XXXX, XXX, XXX−XXX