Prediction of Optimal Salinities for Surfactant Formulations Using a

Jun 15, 2015 - consists of the injection of combinations of surfactant (S), .... arbitrary target specification) matching functionalities,20 .... SMAR...
2 downloads 0 Views 738KB Size
Subscriber access provided by NEW YORK UNIV

Article

Prediction of optimal salinities for surfactant formulations using a QSPR approach Christophe Muller, Ana G Maldonado, Alexandre Varnek, and Benoit Creton Energy Fuels, Just Accepted Manuscript • DOI: 10.1021/acs.energyfuels.5b00825 • Publication Date (Web): 15 Jun 2015 Downloaded from http://pubs.acs.org on June 19, 2015

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Energy & Fuels is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

Prediction of optimal salinities for surfactant formulations using a QSPR approach. Christophe Muller,1 Ana G. Maldonado,2,† Alexandre Varnek,3 Benoit Creton1,* 1

2

3

IFP Energies nouvelles, 1 et 4 avenue de Bois-Préau, 92852 Rueil-Malmaison, France. Solvay - Laboratory of the Future, 178 avenue du Dr Schweitzer, 33600 Pessac. France.

Laboratory of Chemoinformatics, UMR 7140 CNRS/UniStra, 1 rue Blaise Pascal, 67000 Strasbourg, France.

* To whom the correspondence should be addressed. E-mail: [email protected] RECEIVED DATE (to be automatically inserted after your manuscript is accepted if required according to the journal that you are submitting your paper to)

ABSTRACT. Each oil reservoir could be characterized by a set of parameters such as temperature, pressure, oil composition, brine salinity, etc. In the context of the chemical Enhanced Oil Recovery (EOR), the selection of high performance surfactants is a challenging and time consuming task as this strongly depends on reservoir's conditions. The situation becomes even more complicated if the surfactant formulation is a blend of two or more surfactants. In the present work, we report Quantitative Structure-Property Relationships (QSPR) correlating surfactants' structures and their composition in a mixture with optimal salinity (Sopt), corresponding to minimal interfacial tension in the reference brine/surfactants/n-dodecane

ACS Paragon Plus Environment

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

system, at T = 313 K and P = 0.1 MPa. Particular attention was paid to selected families of surfactants: α-Olefin Sulfonate (AOS), Internal Olefin Sulfonate (IOS), Alkyl Ether Sulfate (AES) and Alkyl Glyceryl Ether Sulfonate (AGES). The models were built and validated on the database containing Sopt values for 75 surfactants' formulations. Molecular structures of amphiphilic molecules were encoded by Functional Group Count Descriptors (FGCD), ISIDA Substructural Molecular Fragment (SMF) descriptors, and Codessa Molecular Descriptors (CMD). For mixtures, descriptors were calculated as linear combinations of descriptors of individual compounds weighted by their mass fractions in mixtures. Different machine-learning methods - Support Vector Machine (SVM), Partial Least Squares (PLS) Regression and Random Subspace (RS) - have been used for the modeling. Both global (on the entire database) and local (on individual families) models have been built. Models display reasonable accuracy (about 0.2 logSopt units) which is comparable with the experimental error of measured Sopt. Our results show that suggested approach can be successfully used to build predictive models for relatively small datasets of mixtures of chemical compounds.

KEYWORDS. Optimal salinity, surfactant mixtures, QSPR, support vector machine, partial least squares, random subspace.

INTRODUCTION The actual capacity of oil extraction can be roughly estimated to 30-60% of the original reservoir's content. Three distinct phases can be identified during the process of oil extraction: (i) the primary recovery which is the consequence of the natural pressure in the reservoir, (ii) the secondary recovery consists in the injection of water or gas into the reservoir to move the oil to the wellbore, and after this stage more than half of the oil is often left in place, (iii) the tertiary recovery also called Enhanced Oil Recovery (EOR),1 includes various techniques such as the

ACS Paragon Plus Environment

Page 2 of 31

Page 3 of 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

chemical EOR.2 The chemical EOR consists in the injection of combinations of Surfactant (S), Surfactant/Polymer (SP), or Alkaline/Surfactant/Polymer (ASP) to move the trapped oil to the wellbore. In these combinations, alkalis are used to reduce surfactant adsorption, surfactants - to reduce the oil/water Interfacial Tension (IFT), and polymers - to improve the sweep efficiency. Surfactants are amphiphilic molecules composed of lipophilic (Tail, labelled T) and hydrophilic (Head, labelled H) groups, indicating they are both oil and water soluble. High performance of surfactants for chemical EOR is explained by a strong reduction of the oil/water IFT which allows the residual oil to move. Surfactants belonging to the families of α-Olefin Sulfonate (AOS), Internal Olefin Sulfonate (IOS), Alkyl Ether Sulfate (AES) and Alkyl Glyceryl Ether Sulfonate (AGES) have already been identified as promising candidates for chemical EOR.3,4,5 In nearly all applications involving surfactants, mixtures as opposed to pure compounds are considered which can be attributed to the difficulty of preparing pure singlespecie surfactants and synergism.6,7 Selection of a surfactant formulation, a blend of two or more surfactants, for EOR applications is a challenging and time consuming task considering that each potentially eligible reservoir exhibits different conditions such as the oil composition, brine salinity, temperature, etc. Application of computational approaches may significantly reduce the costs of the selection of optimal mixtures of surfactants. Recent studies have shown that techniques grouped under the acronym QSPR (Quantitative Structure-Property Relationship) represent efficient tools for rapid estimations of surfactants’ properties.8,9 In chemical EOR, one of the key properties of surfactants is the optimal salinity (Sopt) which indicates the salinity corresponding to a minimum of IFT in a system oil/brine/surfactant. More precisely, Sopt is reached when equals amount of oil and water become soluble in the middle phase microemulsion.10,11 While this property is important during the selection of surfactant formulation in chemical EOR, only few predictive models of Sopt are reported in the literature. Thus, Barnes et al. have proposed a predictive model for the optimal salinity for a series of IOS

ACS Paragon Plus Environment

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

surfactants: Sopt = -57*DPSA/MW + 81, where Sopt is expressed as the percentage of salt in the system, DPSA is the differential charged surface area and MW the molecular weight of surfactants.12 Statistical parameters (determination coefficient, R²) were reported only for data fitting (R²= 0.95), no external validation was performed which does not allow one to draw any conclusion on predictive performance of this model on new data. Recently, Moreau et al.13 used functional group count descriptors to predict optimal salinity (S*, natural logarithm of Sopt expressed in g.100mL-1) for four families of surfactants: AOS/IOS, AES/AGES, and AOS/IOS/AES/AGES. Reasonable performances were obtained, but the chosen training and test sets selected with the help of Principal Components Analysis (PCA) are structurally very similar which, thus, overestimates predictive performance of the model. In this study, we report the development of QSPR models for S* of AOS, IOS, AES and AGES surfactants using the database by Moreau et al.13 Different machine-learning methods, i.e. Support Vector Machine (SVM), Partial Least Squares (PLS) Regression and Random Subspace (RS) coupled with three different types of descriptors were used: Functional Group Count Descriptors (FGCD), ISIDA Substructural Molecular Fragments (SMF) and CODESSA Molecular Descriptors (CMD). The paper is organized as follows: we first present a description of experimental data, descriptors, and used machine-learning methods; then, an analysis of developed QSPR models for S* is provided; and the last section gives some conclusions and perspectives.

MATERIALS AND METHODS Experimental data. Four families of surfactants were considered in this study: α-Olefin Sulfonate (AOS) (alkene, hydroxyalkane and vinylidene species), Internal Olefin Sulfonate (IOS) (alkene, hydroxyalkane and vinylidene species), Alkyl Ether Sulfate (AES) and Alkyl Glyceryl Ether Sulfonate (AGES). Totally, 75 surfactant mixtures were extracted from industrial products and associated S* values

ACS Paragon Plus Environment

Page 4 of 31

Page 5 of 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

are presented in Table 1. The numbers of carbon atoms in alkyl chains of surfactants constituting these mixtures range from 14 to 30 for AOS and IOS families, and from 8 to 17 for AES and AGES families. For a sake of confidentiality aspects, the chemical structures as well as mixtures’ composition associated to experimental S* values are not given in this article. 36 olefin sulfonate samples, based on 13 alpha-olefin and 9 internal olefin raw materials were produced using a falling film sulfonation pilot unit.13 14 additional samples were synthetized from some of these raw materials using significant modification of sulfonation parameters. 24 AES were obtained, using pilot alkoxylation and subsequent sulfonation, or purchased. These alcohols, issued from different synthesis procedures, exhibited different degree or nature of branching. The average amounts of propylene oxide (PO) and ethylene oxide (EO) units per alkyl chain vary from 4 to 14 and 0 to 4, respectively. In 15 AGES surfactants obtained from episulfonation of alkoxylated alcohols, the number of PO and EO units varies in the same ranges as for AES. Notice that any above mentioned synthetic procedure leads to a mixture rather than to single individual compound. For instance, sulfonation of an olefin feedstock using falling film sulfonation results in the formation of a variety of different species (hydroxyalkane sulfonate, alkene sulfonate, and disulfonate species). Thus, the AOS/IOS family is represented by 36 mixtures involving 54 individual compounds, in which one mixture is roughly composed from 15 to 20 compounds. The AES/AGES family consists of 39 mixtures involving 98 different individual compounds, and, generally, a mixture is composed of a maximum of 6 individual compounds. Generic molecular structures for AOS, IOS, AES and AGES species are given in Figure 1. All Sopt values were measured for the same oil (n-dodecane), brine (water + NaCl), and conditions of temperature (T = 313 K) and pressure (P = 0.1 MPa), see details in the reference 13. The quality of Sopt measurements was ensured by a robotic platform14 providing very small statistical noise

ACS Paragon Plus Environment

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

in data (relative error = 0.4 g.100mL-1). QSPR models were developed for S* = lnSopt where optimal salinity expressed in g.100mL-1. This choice is directly related to the Salager’s law indicating the logarithm of the optimal salinity varies linearly with various characteristics of brine/surfactants/oil systems.15 Optimal salinity values vary according to some intuitive structure-property rules.2 For instance, S* decreases with the size of surfactant’s tail and increases with the number of hydrophilic groups. Analysis of experimental data (see Table 1) showed that four mixtures (#53, 69, 70, 71) did not follow these rules and, therefore, they were discarded from the database. Similarly, surfactant #24 representing structural and activity outlier was also discarded. Thus, the dataset used for the modeling contains 70 mixtures (see Table 1). This dataset contains two groups of structurally similar compounds: the first one combining AOS and IOS families (35 mixtures) and the second one combining AES and AGES families (35 mixtures). S* values vary in the ranges 1.39 – 2.81 and 0.26 – 2.85 for AOS/IOS and AES/AGES subsets, respectively (see Figure 2). The data analysis shows that the optimal salinity of surfactants decreases when increasing the number of PO or decreasing the number of EO groups, and the length of surfactant chain. QSPR models were built both on AOS/IOS and AES/AGES subsets (local models), and on the entire set of 70 mixtures (global models).

Descriptors for mixtures. Three types of descriptors were considered in this study. The first one - Functional Group Count Descriptors (FGCD) - are counts of selected atoms and/or molecular fragments identified as relevant from chemical intuition. Such a simple representation of compounds has been shown to provide relevant descriptors for QSPR modeling.16,17,18,19 Descriptors used in this study (X1 to X38) are given in Table 2. FGCD from X19 to X38 named as "Me2Me", i.e. the methyl-tomethyl spacing, were used to encode information on both the amount and the location of branching in the hydrocarbon skeleton of surfactants' tail. For instance, Me2Me-1 counts

ACS Paragon Plus Environment

Page 6 of 31

Page 7 of 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

fragments in which two methyl groups are separated by one atom. Simplified Molecular Input Line Entry Specification (SMILES) was assigned to each individual compound in surfactants' mixtures. FGCD were computed using the Open Babel’s SMARTS (SMILES Arbitrary Target Specification) matching functionalities,20 and the SMARTS codes used are given in Table 2. The second type of descriptors includes the ISIDA Substructural Molecular Fragments (SMF) generated using the ISIDA Fragmentor program.21,22,23 SMF combines two types of molecular subgraphs: sequences (I) and augmented atoms (II). Sequences are defined by a succession of atoms and bonds (AB), atoms only (A), or bonds only (B) connecting two atoms in the molecular graph. Only the shortest path connecting two atoms is used. For the atom pairs (AP) subtype, only terminal atoms and topological distance between them are explicitly annotated. For each type of sequences, fragments of length ranging from nmin=2 atoms to nmax=15 atoms were generated. An extended augmented atom represents a given atom with its neighboring including atoms and bonds (AB), atoms only (A), bonds only (B), or atom pairs (AP). More explicitly, the neighboring corresponds to the concatenation of sequences with the length from nmin=2 to nmax=10 atoms starting from the selected atom. An example of SMF descriptors is given in Figure 3. The third type of descriptors – Codessa Molecular Descriptors (CMD) – gathers a set of features that encode structural and chemical information. For some descriptors, the required inputs are the three dimensional (3D) structures of individual compounds. Since all surfactants considered in this study (Figure 1) represent very flexible molecules, their conformational search has been performed using the Hyperchem-7 program.24 First we carried out gas phase molecular dynamics simulation followed by selection of several low energy structures. At the next step, each of them was optimized using a MM+ force field25 and the Polak Ribiere geometric optimization algorithm. Finally, the lowest energy conformer has been selected for 3D descriptors calculations. The Codessa PRO software26 was then used to compute 167 CMD

ACS Paragon Plus Environment

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 31

belonging to four different descriptor classes: (i) constitutional (molecule composition, etc.), (ii) topological (connectivity, paths, shape, etc.), (iii) geometrical (inertia moment, surface, volume, 3D shadows, etc.), (iv) and electrostatic (charge distribution, electronegativity, polarity, etc.).27 Descriptor values for mixtures were calculated as linear combinations of related descriptors for individual compounds weighted by associated mass fractions, xi. For instance, for a given descriptor X1, the mixture descriptor X1mix, is defined as follows: N

X1mix = ∑ x i . X1i ,

(1)

i =1

where N is the number of compounds in the mixture. Equation (1) was already used to compute mixtures descriptors in the modeling of properties such as flash points16 and normal boiling point28 using FGCD, SMF, and CMD. The amount of synthesis residues in the mixtures of surfactants has been accounted for and used as a descriptor labeled "Free Oil". The multidimensional space formed by descriptors can be used to define the applicability domain (AD) of QSPR models. Among the numerous existing approaches,29 we defined the AD by Euclidean distance between a test object (mixture of surfactants) and training set objects in the multidimensional descriptors space.

Machine-Learning Methods. Three machine learning methods were used in QSPR modeling: Partial Least Squares (PLS), Random Subspace (RS), and Support Vector Machine (SVM). The Partial Least Squares (PLS)30 is a well-known multivariate data analysis technique used to build multi-linear models. PLS applies new variables known as latent variables, by constructing orthogonal sets of linear combinations of descriptors, resulting in a new space of fewer dimension. The Weka31 software was used to derive PLS models. The number of latent variables was optimized in 10-fold cross-validation by minimizing Root Mean Square Error (RMSE). Random Subspace,32 as implemented in Weka, is a method which randomly chooses a subset of descriptors from the original set and constructs a model with a selected regression algorithm.

ACS Paragon Plus Environment

Page 9 of 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

This step is repeated N times (N is a user defined parameter) resulting in ensemble of individual models. Finally, for each mixture a consensus calculation were performed by averaging values predicted by individual models. RS is considered to be efficient if the number of descriptors is larger than the number of samples in the dataset. The number of randomly chosen descriptors was optimized in 10-fold cross-validation providing with the smallest RMSE. The number of computed models for the consensus modeling was empirically assessed by the number of descriptors in the initial set of descriptors. The Support Vector Machine,33 from the Libsvm package34 was used for Ɛ-SVM regression models generation with a linear kernel. This method finds a linear function into high dimensional feature space induced by the kernel. Parameters of method were optimized in 10-fold crossvalidation in order to get the best RMSE for models constructed on each pool of descriptors. Ɛ ranged from 0.05 to 0.20 incrementing by 0.05 and cost was tested for values between 0.5 to 10 incrementing by 0.5.

Statistical validation of models. External validation of a model is required to ensure its ability to predict properties of “new” compounds (in the applicability domain of the model), i.e. those which were not used in model building.35 Its popular version is n-fold cross-validation (n-CV) in which the entire data set is split on approximately equal n portions. On each fold, an ensemble of (n-1) portions forms the modeling set on which the model is built followed by its validation on the remaining portion (test set). This procedure is repeated n times choosing at each new fold another portion of data as a test set. In such a way, each object (compound or mixture) in the entire data set is predicted. In this work n-CV procedure has been applied twice (see the workflow in Figure 4). General model’s performance has been assessed in 5-fold cross-validation. However, in order to optimize parameters of the methods (see the previous section), an additional 10-CV repeated 3 times was applied to the modeling set on each fold. Thus, on each fold of 5-CV a set of parameters

ACS Paragon Plus Environment

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 31

minimizing RMSE at 10-CV on the modeling set (4/5 of data) was used to build a “final” model validated on a test set (1/5 of data). The model’s performance was estimated by Root Mean Square Error (RMSE), as follows:  1  =  ( , −  , )² 

(2)

where N is the number of data points in the ensemble of all test sets, i.e., overall size of the parent dataset; yexp is the experimental value of S* and ypred is the predicted value of S*. Notice that cross-validation of the models for mixtures is more complicated than that for individual compounds. According to Muratov et al.,36 one should consider different strategies: “point out”, “compound out” and “mixture out”. Here we applied only “point out” (i.e., classical

n-fold cross-validation) because of small sizes of data sets and large number of the mixtures components used in the modeling.

RESULTS AND DISCUSSION In this section, we report various QSPR models used to assess S* of AOS/IOS/AES/AGES surfactants. Three machine-learning methods (PLS, RS, SVM) coupled to three types of descriptors (FGCD, SMF, CMD) lead to nine types of models: PLS-FGCD, PLS-SMF, PLSCMD, RS-FGCD, RS-SMF, RS-CMD, SVM-FGCD, SVM-SMF, SVM-CMD. Global models were built on the entire set of 70 surfactants’ mixtures, whereas local models – on its two subsets containing structurally similar compounds: AOS/IOS surfactants (35 mixtures) and AES/AGES surfactants (35 mixtures). All models were built and validated according to the workflow presented in Figure 4. Hereafter, we discuss predictive performances of local and then global QSPR based models for S*, statistical parameter values are reported in Table 3 and Table 4 and, for internal and external validation, respectively.

ACS Paragon Plus Environment

Page 11 of 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

AOS/IOS modeling Mean RMSE values computed in 10-fold internal cross-validation (Table 3) range from 0.161 to 0.191, 0.185 to 0.201, and 0.195 to 0.213 for SMF, FGCD and CMD based models, respectively. Performances of RS based models are similar for the three investigated types of descriptors. For PLS and SVM based models, SMF descriptors outperform FGCD and CMD. This is not surprising because SMF provides a more flexible solution than other methods. Indeed, for a given dataset one can generate only one set of FGCD or CMD descriptors, and several hundreds of sets of SMF descriptors corresponding to different size, topology and explicitness of encoding atoms and bonds. Thus, for the particular dataset and machine-learning method, the program may select the most appropriate fragments type. The best local models for AOS/IOS surfactants involve augmented atoms pairs of length ranging from 8 to 9 (IIAP(8-9)) in case of the RS method, and augmented atom pairs of length of 10 (IIAP(10-10)) for PLS and SVM. Notice that atom pair fragments encode only sequences of carbons atoms, or carbons atoms and one sulfur atom. These descriptors allow one to distinguish difference between the two families AOS and IOS, and between their subtypes (i.e. hydroxyalkane sulfonate, alkene sulfonate, and disulfonate species). The augmented atom can account for molecular ramification and in this way determines, for example, whether the sulfonate group is attached at the end of the olefin (AOS) or not (IOS). Mean RMSE values computed in 5-fold external cross-validation (Table 4) lie in between 0.159 to 0.226. This is consistent with RMSE ranges observed for the internal cross-validation procedure. Similarly to conclusions drawn from internal validation, PLS-SMF (RMSE=0.159) and SVM-SMF (RMSE=0.164) models exhibit the best performances. It is interesting to note that for the mixture #14 (experimental S* = 1.39) S* = 1.70 and 1.83 were predicted using SVMSMF and PLS-SMF models, respectively. According to Euclidean distance in the space of SMF descriptors, mixture #13 (experimental S* = 1.70) is the most similar to mixture #14, which

ACS Paragon Plus Environment

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

explains values predicted by models. Mixture #24 (experimental S* = -0.10) discarded from the database, is as expected outside of the applicability domains of SVM-SMF and PLS-SMF models leading to S* = 1.18 and 1.38, respectively.

AES/AGES modeling Mean RMSE values computed in 10-fold internal cross-validation (Table 3) range from 0.186 to 0.237, 0.251 to 0.338, and 0.292 to 0.347 for SMF, FGCD and CMD based models, respectively. As a function of machine-learning method, the models performances vary as RS < PLS ≈ SVM. The best model is the SVM-SMF (RMSE = 0.186) involving as descriptors augmented atoms pairs of length ranging from 3 to 8 atoms (IIAP(3-8)), notice that 3 represents the upper limit to encode an ethylene oxide group, see Figure 1. Mean RMSE values computed in 5-fold external cross-validation (Table 4) lie between 0.180 for the SVM-SMF model to 0.333 for the SVM-CMD model. The range of RMSE values is in accordance with those observed in internal cross-validation procedure. Performances of models built for AES/AGES are worse than those of models built for AOS/IOS mixtures. Predictions of SVM-SMF model follow intuitive structure-property rules (see experimental data section), e.g., for mixtures #54, #55 and #56 which differ solely from their numbers of propylene oxide groups (6, 8, and 10, respectively) predicted S* = 1.67, 1.39 and 1.16. Moreover, for mixture #53, similar to mixtures #54, #55 and #56, and having 4 propylene oxide groups, the SVM-SMF model predicts a S* of 1.89 which is in accordance with intuitive structure-property rules. Similarly, for mixtures #69 and #70 having 7 and 10 propylene oxide groups, respectively, the SVM-SMF model returns S* = 3.20 and 2.70, in line with intuitive structure-property rules. For mixtures #69 and #71 differing by the number of ethylene oxide groups (0 and 4), the SVM-SMF model predicts an increase of S* with the number of ethylene oxide groups: S* = 3.20 for mixture #69 and S* = 3.58 for mixture #71. For mixture 38 (experimental S* = 2.27) the SVM-SMF model returns S* = 1.69. Large prediction error can be explained by the fact that the Euclidean

ACS Paragon Plus Environment

Page 12 of 31

Page 13 of 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

distance between mixture 38 and its nearest neighbor is much larger than other inter-mixture distances.

AOS/IOS/AES/AGES modeling Mean RMSE values computed in 10-fold internal cross-validation (Table 3) range from 0.197 to 0.216, 0.228 to 0.261, and 0.273 to 0.291 for SMF, FGCD and CMD models, respectively. SMF descriptors outperform FGCD and CMD. For SMF, performance of machine-learning methods vary as RS < PLS ≈ SVM. The best models are the PLS-SMF (RMSE = 0.197) and SVM-SMF (RMSE = 0.199). PLS-SMF models were built on sequences of atoms and bonds containing from 9 to 13 atoms (IAB(9-13)) whereas SVM-SMF models were built on sequences of atoms and bonds containing only 13 atoms (IAB(13-13)). Mean RMSE values computed in 5-fold external cross-validation (Table 4) lie between 0.215 for PLS-SMF and SVM-SMF models) to 0.328 for PLS-CMD model. The range of RMSE values roughly agrees with that observed in internal cross-validation procedure. For mixture #38 (experimental S* = 2.27), SVM-SMF predicts S* = 1.69 (AES/AGES local model) or 1.72 (global model). According to Euclidean distances this mixture is very dissimilar as compared to other modeling set objects. Figure 5 presents a scatter plot of S* predicted using the SVM-SMF model vs. experimental S*, for mixtures belonging to test sets and mixtures discarded from the database. Mixture #24 (experimental S* = -0.10) was predicted to have an optimal salinity of 1.05 and 1.69 by SVM-SMF and PLS-SMF models, respectively. One may see that predicted values differ and both of them are also different from the experimental value. This can be easily explained by the fact that mixture #24 is an activity and structural outlier. For four other mixtures discarded from the dataset, similar conclusions based on intuitive structure-property rules can be drawn.

ACS Paragon Plus Environment

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 6 shows the percentage of mixtures having S* values found in successive 10% bins. It can be seen that 64% of mixtures in the dataset have predicted S* values within 10% of the reported experimental values, 91% are predicted within 20% of the experimental values, and 98% are predicted within 30% of the experimental values. To conclude, global models perform similarly to local models. In particularly, this means the mixtures well predicted with global models are also well predicted with the local models; this is also true for poorly predicted mixtures.

CONCLUSIONS AND PERSPECTIVES In this paper, we report predictive QSPR models built on a data set containing S* values for AOS, IOS, AES, and AGES surfactant mixtures in reference conditions (n-dodecane, brine (water + NaCl), T = 313 K and P = 0.1 MPa). Three different types of descriptors (FGCD, SMF and CMD), coupled with three machine learning methods (SVM, PLS and RS) were used. Two types of models were obtained: global models on the entire data set and local models on subsets containing families of structurally similar surfactants: AOS/IOS and AES/AGES. Performances of models have been assessed in external 5-fold cross-validation. Prediction error of the best models is about 0.2 logSopt units) which is comparable with the experimental error of obtaining Sopt.. Models based on SMF descriptors outperform those applying CMD and FGCD descriptors whatever the machine-learning method is used. SVM and PLS models perform similarly and always better than RS models. It has been demonstrated that global models built on four families of surfactants leads to similar predictions as compared to local AOS/IOS and AES/AGES models. However, the applicability domain of global models is larger than that of local models, and, therefore, global models are recommended as a prediction tool for a wide variety of surfactants. Compared to the global model based on a genetic algorithm approach coupled with FGCD (GFA-FGCD) reported by Moreau et al.,13 we have found similar RMSE values. It should,

ACS Paragon Plus Environment

Page 14 of 31

Page 15 of 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

however, be noted that the Moreau et al. models have been validated on one sole test set containing specifically selected compounds whereas we used a much more statistically robust external cross-validation procedure in which each molecule from the parent set was predicted. This work is a solid step in the efforts of in silico determination of optimal salinity of promising candidate surfactants providing low to ultra-low interfacial tension. Further progress in this area is related to the extension of the surfactants families to be considered and to the accounting for temperature and brine composition as variables in QSPR models for optimal salinity of surfactants.

Corresponding Author * To whom the correspondence should be addressed. E-mail: [email protected]

Present Addresses † Present address: TNO Technical Sciences / Industrial Innovation, PO Box 6235, 5600 HE, Eindhoven, The Netherlands.

ACKNOWLEDGMENT Authors would like to gratefully thank Drs. Benjamin Herzhaft, Patrick Moreau, Mikel Morvan, Destremaut-Oukhemanou Fanny, Gilles Marcou, and Yannick Peysson for the fruitful discussions.

ACRONYMS AND ABBREVIATIONS 3D, Three Dimensional; A, Atoms; AB, Atoms and Bonds; AP, Atom Pairs; AD, Applicability Domain; AOS, α-Olefin Sulfonate; ASP, Alkaline/Surfactant/Polymer; B, Bonds; CMD, Codessa Molecular Descriptors; CV, Cross-Validation; DPSA, Differential charged surface area; EO, Ethylene Oxide; EOR, Enhanced Oil Recovery; FGCD, Functional Group Count Descriptors; GFA, Genetic Function Approximation; H, Head; IFT, Interfacial Tension; IOS, Internal Olefin Sulfonate; MW, Molecular Weight; PC, Principal Components; PCA, Principal Component Analysis; PLS, Partial Least Squares; PO, Propylene Oxide; QSPR, Quantitative Structure-Property Relationship; R2, coefficient of determination; RMSE, Root Mean Square Error; RS, Random Subspace; S, Surfactant; SMARTS, SMILES Arbitrary Target Specification;

ACS Paragon Plus Environment

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

SMF, Substructural Molecular Fragment; SMILES, Simplified Molecular Input Line Entry Specification; Sopt, Optimal Salinity; S* natural logarithm of Sopt in g.100mL-1; SP, Surfactant/Polymer; SVM, Support Vector Machines; T, Tail.

ACS Paragon Plus Environment

Page 16 of 31

Page 17 of 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

Figure 1. Generic molecular structure of AOS (A) and IOS (B) alkene species, and AES (C) and AGES (D) used in this study. R1 and R2 are alkyl chains. n and m varies from 0 to 14. All compounds in this study are based on one of the proposed scaffold.

(A)

(B)

(C)

(D)

ACS Paragon Plus Environment

Energy & Fuels

Figure 2. Distribution of S* values in AOS/IOS and AES/AGES subsets.

AOS/IOS AES/AGES

12

10

Frequency

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 31

8

6

4

2

0 ]0.

o 0t

] ] ] ] ] ] ] ] ] 1.3 o 1.5 o 1.7 o 1.9 o 2.1 o 2.3 o 2.5 o 2.7 o 3.0 t t t t t t t t 5 3 7 3 9 5 1 7 ]1. ]1. ]1. ]2. ]1. ]2. ]2. ]2.

S*

ACS Paragon Plus Environment

Page 19 of 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

Figure 3. Example of SMF fragments of length of 3 atoms. Sequences and extended augmented atoms are generated for the blue paths and are starting from the carbon atom highlighted in blue.

ACS Paragon Plus Environment

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 4. The modeling workflow used in this work. An individual QSPR model is built using selected machine-learning method (PLS, RS or SVM) and descriptors (FGCD, SMF or CMD) on the training set of each fold in both internal (10-CV) and external (5-CV) cross-validation procedures followed by its validation on the test set of this fold.

ACS Paragon Plus Environment

Page 20 of 31

Page 21 of 31

Figure 5. Scatter plot of S* predicted using the global SVM-SMF (RMSE = 0.216) models vs. experimental S*. Database denotes the 70 mixtures, each mixture belonging to one of the five test sets.

#71 3

#69 Pred. S*

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

2

#53 #70 #24

1

Database Discarded mixtures

0 0

1

2

Exp. S*

ACS Paragon Plus Environment

3

Energy & Fuels

Figure 6. Histogram of frequencies versus 10% S* error bins, errors were calculated using S* predicted with the global SVM-SMF models. 70

60

Frequency (%)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

50

40

30

20

10

0

0]

o1

t [0

] ] ] ] ] ] 70 60 50 40 30 20 to to to to to to 0 0 0 0 0 0 ]6 ]5 ]4 ]3 ]2 ]1

Error on S* (%)

ACS Paragon Plus Environment

Page 22 of 31

Page 23 of 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

Table 1. Experimental S* values for AOS, IOS, AES and AGES surfactants used in our database. Surfactant

S*

mixture 1 mixture 2 mixture 3 mixture 4 mixture 5 mixture 6 mixture 7 mixture 8 mixture 9 mixture 10 mixture 11 mixture 12 mixture 13 mixture 14 mixture 15 mixture 16 mixture 17 mixture 18

2.54 2.74 2.81 2.60 2.35 2.14 1.72 1.54 1.79 1.70 2.20 2.35 1.70 1.39 1.74 2.71 2.40 2.35

mixture 37 mixture 38 mixture 39 mixture 40 mixture 41 mixture 42 mixture 43 mixture 44 mixture 45 mixture 46 mixture 47 mixture 48

2.38 2.27 2.26 2.47 2.35 2.43 0.45 0.26 1.76 1.31 1.03 2.10

mixture 61 2.54 mixture 62 2.33 mixture 63 2.85 mixture 64 2.59 mixture 65 2.63 mixture 66 2.75 mixture 67 2.21 mixture 68 2.78 a discarded mixtures.

Surfactant AOS/IOS mixture 19 mixture 20 mixture 21 mixture 22 mixture 23 mixture 24a mixture 25 mixture 26 mixture 27 mixture 28 mixture 29 mixture 30 mixture 31 mixture 32 mixture 33 mixture 34 mixture 35 mixture 36 AES mixture 49 mixture 50 mixture 51 mixture 52 mixture 53 a mixture 54 mixture 55 mixture 56 mixture 57 mixture 58 mixture 59 mixture 60 AGES mixture 69 a mixture 70 a mixture 71 a mixture 72 mixture 73 mixture 74 mixture 75

S* 2.30 2.67 2.53 2.08 1.87 -0.10 1.87 1.87 1.87 1.70 1.72 1.70 1.65 1.81 1.90 1.84 1.79 1.87 1.65 2.10 2.54 2.23 1.39 1.65 1.33 1.03 1.79 1.43 1.65 1.57 2.83 2.80 2.65 2.13 1.79 1.63 1.39

ACS Paragon Plus Environment

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Table 2. List of the functional group count descriptors used to characterize surfactants, and their associated symbols. SMARTS codes are also provided. Group Symbol SMARTS code H X1 [H] C X2 [C] O X3 [O] -SO4 X4 [OX2H0]-[SX4H0](=[O])(=[O])[OX1H0] -SO3 X5 [!O]-[SX4H0](=[O])(=[O])[OX1H0] -CH3 X6 [CX4H3] -CH2X7 [CX4H2] >CHX8 [CX4H1] >C< X9 [CX4H0] >C=C< X10 [CX3H0]=[CX3H0] >C=CHX11 [CX3H0]=[CX3H1] -HC=CHX12 [CX3H1]=[CX3H1] -H2C-O-CH2X13 [CX4H2]-[OX2H0]-[CX4H2] >HC-O-CH2X14 [CX4H1]-[OX2H0]-[CX4H2] >HC-O-CH< X15 [CX4H1]-[OX2H0]-[CX4H1] >C-OH X16 [CX4H1]-[OX2H1] nbPO X17 [OX2H0]-[CX4H2]-[CX4H1](-[CX4H3]) nbEO X18 [OX2H0]-[CX4H2]-[CX4H2]-[OX2H0] Me2Me-1 X19 [CX4H3]-*-[CX4H3] Me2Me-2 X20 [CX4H3]-*~*-[CX4H3] Me2Me-3 X21 [CX4H3]-*~*~*-[CX4H3] Me2Me-4 X22 [CX4H3]-*~*~*~*-[CX4H3] Me2Me-5 X23 [CX4H3]-*~*~*~*~*-[CX4H3] Me2Me-6 X24 [CX4H3]-*~*~*~*~*~*-[CX4H3] Me2Me-7 X25 [CX4H3]-*~*~*~*~*~*~*-[CX4H3] Me2Me-8 X26 [CX4H3]-*~*~*~*~*~*~*~*-[CX4H3] Me2Me-9 X27 [CX4H3]-*~*~*~*~*~*~*~*~*-[CX4H3] Me2Me-10 X28 [CX4H3]-*~*~*~*~*~*~*~*~*~*-[CX4H3] Me2Me-11 X29 [CX4H3]-*~*~*~*~*~*~*~*~*~*~*-[CX4H3] Me2Me-12 X30 [CX4H3]-*~*~*~*~*~*~*~*~*~*~*~*-[CX4H3] Me2Me-13 X31 [CX4H3]-*~*~*~*~*~*~*~*~*~*~*~*~*-[CX4H3] Me2Me-14 X32 [CX4H3]-*~*~*~*~*~*~*~*~*~*~*~*~*~*-[CX4H3] Me2Me-15 X33 [CX4H3]-*~*~*~*~*~*~*~*~*~*~*~*~*~*~*-[CX4H3] Me2Me-16 X34 [CX4H3]-*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*-[CX4H3] Me2Me-17 X35 [CX4H3]-*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*-[CX4H3] Me2Me-18 X36 [CX4H3]-*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*-[CX4H3] Me2Me-19 X37 [CX4H3]-*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*-[CX4H3] Me2Me-20 X38 [CX4H3]-*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*-[CX4H3] Molecular weight MW a a

no SMARTS code has been used to calculate the molecular weight.

ACS Paragon Plus Environment

Page 24 of 31

Page 25 of 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

Table 3. Internal performance (RMSE) of selected local and global models for S* as a function of machine-learning methods and descriptors. Local models 1 were built on the set of AOS/IOS surfactants (35 mixtures), local models 2 on the set of AES/AGES surfactants (35 mixtures), global models on the entire parent set (70 mixtures).

descriptors SMF FGCD CMD

Local 1 0.167 0.195 0.213

PLS Local 2 0.205 0.268 0.293

Global 0.197 0.228 0.291

Local 1 0.191 0.201 0.195

RS Local 2 0.237 0.338 0.347

Global 0.216 0.261 0.273

ACS Paragon Plus Environment

Local 1 0.161 0.185 0.196

SVM Local 2 0.186 0.251 0.292

Global 0.199 0.229 0.284

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 31

Table 4. External performance (RMSE) of selected local and global models for S* as a function of machine-learning methods and descriptors. Local models 1 were built on the set of AOS/IOS surfactants (35 mixtures), local models 2 on the set of AES/AGES surfactants (35 mixtures), global models on the entire parent set (70 mixtures).

Descriptors SMF FGCD CMD

Local 1 0.159 0.226 0.218

PLS Local 2 0.198 0.254 0.298

Global 0.215 0.253 0.328

Local 1 0.216 0.224 0.203

RS Local 2 0.190 0.299 0.313

Global 0.224 0.285 0.323

ACS Paragon Plus Environment

Local 1 0.164 0.203 0.196

SVM Local 2 Global 0.180 0.216 0.256 0.249 0.333 0.302

Page 27 of 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

REFERENCES AND NOTES

1

Alvarado, V.; Manrique, E. Enhanced Oil Recovery. Field Planning and Development

Strategies. Gulf Professional Pub./Elsevier: Burlington, USA, 2010. 2

Sheng, J.; Modern Chemical Enhanced Oil Recovery. Gulf Professional Pub./Elsevier:

Burlington, USA, 2010. 3

Zhao, P.; Jackson, A.C.; Britton, C.; Kim, D. H.; Britton, L. N.; Levitt, D. B.; Pope, G. A.

Development of high-performance surfactants for difficult oils. SPE international 2008, 113432. 4

Hirasaki, G. J.; Miller, C. A., Puerto, M. Recent advances in surfactant EOR. SPE

international 2008, 115386. 5

Buijse, M. A.; Prelicz, R. M.; Barnes, J. R.; Cosmo, C. Application of internal olefin

sulfonates and other surfactants to EOR. Part 2: The design and execution of an ASP field Test. SPE international 2010, 129769. 6

Rubingh, D. N.; Holland, P. M. Cationic Surfactants: Physical Chemistry , Surfactant science

series, vol. 37, Dekker, New York, 1991. 7

Huibers, P. D. T.; Shah, D. O. Evidence for synergism in nonionic surfactants mixtures:

Enhancement of solubilization in water-in-oil microemulsions Langmuir 1997, 13, 5762-5765. 8

Creton, B.; Nieto-Draghi C.; Pannacci, N. Prediction of surfactants’ properties using

multiscale molecular modelling tools: A review. Oil Gas Sci. Technol. 2012, 67, 969-982. 9

Zhu, Y.; Hou, Q.; Wang, Z.; Jian, G.; Zhang, Q. Can we use some methods to design

surfactants with ultralow oil/water interfacial tension? SPE International 2013, 164100.

ACS Paragon Plus Environment

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

10

Rosen, M., J. Surfactants and Interfacial Phenomena. Third Edition. John Wiley & Sons:

Hoboken, NJ, USA, 2004. 11

Winsor, P., A. Solvent properties of amphiphilic coumpounds. Butterworths, London, 1954.

12

Barnes, J. R., Dirkzwager, H.; Smit, J. P.; On, A.; Navarrete R. C.; Ellison, B. H.; Buijse, M.

A. Application of internal olefin sulfonates and other surfactants to EOR. Part 1: Structureperformance relationship for selection at different reservoir conditions. SPE international 2010, 129766. 13

Moreau, P.; Oukhemanou, F.; Maldonado, A.; Creton, B. Application of quantitative

structure-property relationship (QSPR) method for chemical EOR surfactant selection. SPE international 2013, 164091. 14

Morvan, M.; Koetitz, R.; Moreau, P.; Pavageau, B.; Rivoal, P.; Roux, B. A combinatorial

approach for identification of performance EOR surfactants. SPE International 2008, 113705. 15

Salager, J. L.; Morgan, J. C.; Schechter, R. S.; Wade, W. H.; Vasquez, E. Optimum

formulation of surfactant/water/oil systems for minimum interfacial tension or phase behavior. Soc. Petrol. Eng. J. 1979, 19, 107-115. 16

Saldana, D. A.; Starck, L.; Mougin, P.; Rousseau, B.; Creton, B. Prediction of flash points

for fuel mixtures using machine learning and a novel equation. Energy Fuels 2013, 27, 38113820. 17

Saldana, D. A.; Starck, L.; Mougin, P.; Rousseau, B.; Creton, B. On the rational formulation

of alternative fuels: melting point and net heat of combustion predictions for fuel compounds using machine learning methods. SAR and QSAR in Environmental Research 2013, 24, 259-277.

ACS Paragon Plus Environment

Page 28 of 31

Page 29 of 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

18

Saldana, D. A.; Starck, L.; Mougin, P.; Rousseau, B.; Ferrando, N.; Creton, B. Prediction of

density and viscosity of biofuel compounds using machine learning methods. Energy Fuels 2012, 26, 2416-2426. 19

Saldana, D. A.; Starck, L.; Mougin, P.; Rousseau, B.; Pidol, L; Jeuland, N.; Creton, B. Flash

point and cetane number prediction for fuel compounds using quantitative structure property relationship (QSPR) methods. Energy Fuels 2011, 25, 3900-3908. 20

SMARTS−A

Language

Information

for

Describing

Systems

Inc.:

Molecular

Patterns; Daylight

Laguna

Niguel,

Chemical CA,

http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html (accessed in 2012). 21

freely available at http://infochim.u-strasbg.fr/

22

Varnek, A.; Fourches, D.; Hoonakker, F.; Solov'ev, V. P., Substructural fragments: an

universal language to encode reactions, molecular and supramolecular structures. J. Comput. Aid. Mol. Des. 2005, 19 (9-10), 693-703. 23

Ruggiu, F.; Marcou, G.; Varnek, A.; Horvath, D., ISIDA Property-Labelled Fragment

Descriptors. Mol. Inf. 2010, 29 (12), 855-868. 24

Hypercube, Inc. HyperChem for Windows (Molecular Modeling System), Version 7.5, 2002,

htp://www.hyper.com/ 25

Hocquet, A.; Langgårt, M. An evaluation of the MM+ force field. J. Mol. Model. 1998, 4,

94-112. 26

Codessa, Version 2.642, 1994, http://www.codessa-pro.com/

ACS Paragon Plus Environment

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

27

Katritzky, A. R.; Lobanov, V. S.; Karelson, M.; Murugan, R.; Grendze, M. P.; Toomey, J. E.

Comprehensive descriptors for structural and statistical analysis.1. Correlations between structure and physical properties of substituted pyridines. Rev. Roum. Chim. 1996, 41, 851-867. 28

Solov'ev, V.P.; Oprisiu, I.; Marcou, G.; Varnek., A. Quantitative structure-property

relationship (QSPR) modeling of normal boiling point temperature and composition of binary azeotropes. Ind. Eng. Chem. Res. 2011, 50, 14162-14167. 29

Roy, K.; Kar, S.; Ambure, P. On a simple approach for determining applicability domain of

QSAR models. Chemom. Intell. Lab. Syst. 2015, 145, 22-29. 30

Boulesteix, A.L. and K. Strimmer, Partial least squares: a versatile tool for the analysis of

high-dimensional genomic data. Brief. Bioinform. 2007. 8 (1): p. 32-44. 31

Mark Hall, E. F. Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, Ian H. Witten,

The WEKA Data Mining Software: An Update. SIGKDD Explorations 2009, 11 (1), 10-18. 32

Tin Kam Ho The Random Subspace Method for Constructing Decision Forests. IEEE Trans.

Pattern Anal. Mach. Intell. 1998, 20(8), 832-844 33

Cristianini, N.; Shawe-Taylor, J. An Introduction to Support Vector Machines and Other

Kernel-based Learning Methods., Eds; Cambridge University Press: Cambridge, United Kingdom, 2000. 34

Chang, C.-C.; Lin, C.-J.LIBSVM: a library for support vector machines. ACM Trans. Intell.

Syst. Technol. 2011, 2 (3) 27:1– 27:27. 35

Gramatica, P. Principles of QSAR models validation: internal and external. QSAR Comb. Sci.

2007, 5, 694-701.

ACS Paragon Plus Environment

Page 30 of 31

Page 31 of 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

36

Muratov, E. N.; Varlamova, E. V.; Artemenko, A. G.; Polishchuck, P. G.; Kuz'min, V. E.

Existing and developing approaches for QSAR analysis of mixtures. Mol. Inf. 2012, 31, 202-221.

ACS Paragon Plus Environment