Discriminating Lacustrine and Marine Organic Matter Depositional

Nov 30, 2016 - The knowledge about the organic matter predominant depositional ... approaches to handle the GC × GC–QMS data when classifying the c...
0 downloads 0 Views 1MB Size
Subscriber access provided by Warwick University Library

Article

Discriminating lacustrine and marine organic matter depositional paleoenvironments of Brazilian crude oils using comprehensive twodimensional gas chromatography – quadrupole mass spectrometry (GC×GC-QMS) and supervised classification chemometric approaches Guilherme Lionello Alexandrino, Paloma Santana Prata, and Fabio Augusto Energy Fuels, Just Accepted Manuscript • DOI: 10.1021/acs.energyfuels.6b01925 • Publication Date (Web): 30 Nov 2016 Downloaded from http://pubs.acs.org on December 11, 2016

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Energy & Fuels is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

Discriminating lacustrine and marine organic matter depositional paleoenvironments of Brazilian crude oils using comprehensive two-dimensional gas chromatography – quadrupole mass spectrometry (GC×GC-QMS) and supervised classification chemometric approaches

Guilherme L. Alexandrino*, Paloma S. Prata and Fabio Augusto

Institute of Chemistry, State University of Campinas, Cidade Universitária Zeferino Vaz, 13083-970, Campinas – SP, Brazil.

*Corresponding author: Guilherme L. Alexandrino, D.Sc. Institute of Chemistry – State University of Campinas P.O. Box 6154 13084-971 Campinas, SP, Brazil Phone: +55 19 3521-3105 FAX: +55 19 3521-3023 E-mail: [email protected]

1 ACS Paragon Plus Environment

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 37

Abstract The

knowledge

about

the

organic

matter

predominant

depositional

paleoenvironment in which crude oils gave rise is essential for the extensive understanding of their corresponding geochemical features. This task is laborious and conventionally done through the extensive search for classical biomarkers contained in specific SARA (Saturates, Aromatics, Resins and Asphaltenes) fractions of the crude oils. In this work, the well-stablished analytical technique comprehensive two-dimensional gas chromatography – quadrupole mass spectrometry (GC×GC-QMS) was used to analyze the first two fractions (maltenes) of crude oils, performing the chromatographic data treatment with chemometrics to evaluate the lacustrine or marine predominant origin of the organic matter depositional paleoenvironment of the crude oils. In this approach, the extraction of the target information contained in the GC×GC-QMS data for discriminating between crude oils derived from lacustrine or marine organic matter environments was evaluated using the supervised classification chemometrics k-th nearest neighbor (k-NN), linear discriminant analysis (LDA) quadratic linear discriminant analysis (QDA), partial least squares – discriminant analysis (PLSDA) and support vector machines – discriminant analysis (SVMDA). The methods were compared when predicting external samples using double cross-validation, which is a more appropriate approach to attest the performance of different methods. Additionally, the main advantages and pitfalls of the linear/non-linear classification approaches to handle the GC×GC-QMS data when classifying the crude oils was extensively discussed considering the samples individual prediction uncertainties provided by the double cross-validation results. The most important variable for discrimination was obtained while interpreting the Y-correlated loadings from a

2 ACS Paragon Plus Environment

Page 3 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

representative

orthogonal-PLSDA

model

performed

with

the

dataset.

SVMDA

outperformed the remaining methods for correctly classifying the crude oils (performance rank: SVMDA > PLSDA > QDA > LDA > k-NN), because the relevant variance in the GC×GC-QMS data for the proper discrimination of the crude oils according to their predominant organic matter paleoenvironments was not effectively explained exclusively by multilinear approaches.

Keywords: comprehensive two-dimensional gas chromatography - mass spectrometry, crude oils, organic matter depositional paleoenvironment, supervised classification, chemometrics

1. Introduction The knowledge of the organic matter depositional paleoenvironments in which the crude oils genesis occurred is essential to understand the chemical characteristics of these oils,

and

consequently

their

potentials

for

economic

exploitation.

Lacustrine

paleoenvironments are smaller and receive proportionally more terrigenous organic inputs comparing to the marine environments. Likewise, the sedimentation in lakes usually exceed the oceans, resulting in higher primary production when exploring crude oils herein1. Inferring the organic matter depositional paleoenvironment of a crude oil is commonly an extensive task that depends on the geological information of the source rock that originates 3 ACS Paragon Plus Environment

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 37

such crude oil as well as the interpretation about the chemical fingerprint of the crude oils compounds (biomarkers), which are conventionally analyzed using gas chromatography mass spectrometry (GC-MS). Mello and coworkers performed an extensive characterization of the original organic matter depositional paleoenvironments in which crude oils gave rise from

the

major

Brazilian

offshore

basins,

using

GC-MS.

The

depositional

paleoenvironments were inferred based on the overall distribution of the n-alkanes, isoprenoids, hopanes and steranes, besides the calculation of stablished geochemical parameters from the relative abundance of specific biomarkers in the crude oils, e.g.; 4methylsteranes, 18α(H)-oleanane, gammacerane, β-carotane, tricyclicterpanes, acyclic isoprenoids, 28,30-bisnorhopane and 25,28,30-trisnorhopane2,3. When studying the petroleum geochemistry using conventional one-dimensional (1D) capillary GC-MS, the saturates fraction usually has to be firstly separated from the crude oils using liquid chromatographic containing a pre-activated silica stationary phase, and then biomarkers in the oils are identified4. The overwhelming complex mixture that still occurs in this fraction of the crude oils also requires GC-MS working in selected ion monitoring (SIM) or metastable reaction monitoring (MRM) GC-MS/MS modes to selectively detect

different

coeluting

biomarkers5-6,

despite limitations

on

the

chromatographic resolution still remains. However, the resolution of the compounds in crude oils can be improved with the more sophisticated comprehensive two-dimensional gas chromatography - mass spectrometry (GC×GC-MS). In this technique, the eluate coming from the first conventional capillary column (first dimension, 1D) is concentrated and next reinjected into a second fast GC-type capillary column (second dimension, 2D) through a modulator, before achieving the MS detector. The orthogonal separation

4 ACS Paragon Plus Environment

Page 5 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

capabilities between the columns in this array implies overlapped peaks in 1D can be satisfactorily resolved in the two-dimensional chromatographic space. The enhanced chromatographic power of GC×GC-MS has been exploited to identify specific biomarkers in crude oils correlated with their corresponding organic matter depositional paleoenvironment. Kiepper and coworkers identified unusual biomarkers at trace levels in Brazilian crude oils using GC×GC-TOFMS, proposing a new geochemical index to distinguish between organic matter crude oils from lacustrine and marine environments7. Likewise, the compositional heterogeneity of branched-cyclic hydrocarbons biomarkers was studied in crude oils from different organic matter depositional paleoenvironments using GC×GC-TOFMS by Oliveira and coworkers8, and Casilli and coworkers could distinguish highly similar lacustrine organic matter crude oils from the same basin by identifying minor biomarkers that were successfully resolved only when using GC×GCTOFMS9. More recently, Mogollón et al.10, using GC×GC-QMS and -MS/MS (based on fast quadrupolar mass analysers instead of the more expensive TOFMS) combined with simplified sample preparation, were able to detect and identify biomarkers, such as C(14α)homo-26-nor-17α-hopane series, diamoretanes, nor-spergulanes, C19–C26 A-nor-steranes and 4α-methylsteranes, as well as others compounds related to origin, maturity, and biodegradation of the crude oil samples. Chemometrics-assisted data treatments have been demonstrated to be an interesting strategy for extracting geochemical information from crude oils. The overall amount of (multivariate) data generated from modern instruments when analyzing petrochemical samples, such as crude oils and/or their fractions, has been exploited for exploratory, classification and quantification purposes. Principal Component Analysis (PCA) is

5 ACS Paragon Plus Environment

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 37

undoubtedly the most widespread chemometric tool for exploratory analysis of crude oils, achieving successful results for finding composition similarities in crude oils from different sources after screening using GC×GC-FID11 and GC×GC-MS12, and for forensic investigation of oil spill fingerprinting obtained using GC-MS data13. Multivariate supervised classification approaches have been used to group Brazilian crude oils obtained from different fields and reservoirs from pre- and post-salt layers, using time-domain 1HNMR data and linear discriminant analysis (LDA) as the chemometric tool14. Likewise, principal components discriminant analysis (PCDA) has already been used to classify highly similar crude oils extracted from the same oil field using GC×GC-FID15. Filgueiras and coworkers quantified the saturates, aromatics and polars compounds in Brazilian crude oils from 13C-NMR data, using variable selection prior to support vector regression (SVR) as the multivariate regression tool16. Due to the laborious task for inferring the predominant organic matter depositional paleoenvironments of crude oils using GC, i.e. through the conventional inspection about the abundance of several biomarkers identified with GC-MS from the saturates fraction of the oils, this work attempted to explore the classification of Brazilian crude oils between lacustrine or marine organic matter paleoenvironments when applying supervised multivariate classification approaches to the overall GC×GC-(TIC)MS data obtained from the maltenes (i.e. saturates and aromatics) fraction of the oils. Therefore, the whole geochemical information about the profile of all compounds and biomarkers contained in the maltenes fraction of the crude oils, herein enhanced due to the chromatographic advantages of the GC×GC, were considered to develop classification models that aimed to proper group the oils according to their corresponding predominant lacustrine/marine

6 ACS Paragon Plus Environment

Page 7 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

organic matter depositional paleoenvironment. Prior to the development of the classification models, the predominant organic matter depositional environment of the crude oils samples were evaluated between marine or lacustrine when analyzing specific biomarkers using conventional GC-MS and GC-MRM-MS. Next, the class-modelling of the GC×GC-(TIC)MS dataset was performed starting from the simpler multivariate approaches k-th nearest neighbor (k-NN), linear discriminant analysis (LDA) and quadratic linear discriminant analysis (QDA)17,18, until the more sophisticated partial least squares – discriminant analysis (PLSDA) and support vector machines – discriminant analysis (SVMDA)19, since a considerable evolution of the models performances were obtained when handling such complex data. The models were double cross-validated; therefore, all the samples in the dataset are predicted when belonging only to the test sets, and since the models have been built in absolute absence of the test set, class-label predictions and model optimization steps are performed totally independent20. Contrary to single cross-validation, which requires fixed training and test sets for the modelling, double cross-validation provides more realistic predictions of the dataset, since the overall performance of the chemometric approaches are compared in models less prone to be overfitted. Moreover, a better knowledge about the role of samples in the whole dataset during the discriminant analysis is obtained when inspecting the individual samples predictions uncertainties.

Experimental 1.1. Samples A total amount of 24 samples of non-biodegraded crude oils from predominantly lacustrine (11 samples) and marine (13 samples) organic matter depositional 7 ACS Paragon Plus Environment

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 37

paleoenvironments available on the inventories of the our department´s Organic Geochemical Research Laboratories (Institute of Chemistry, State University of Campinas) were used as received.

1.2. Samples fractioning for conventional chromatographic analysis Aliquots of crude oils (0.1000 +/- 0.0005 g) were fractioned using conventional liquid chromatography containing pre-activated packed silica (30 g, 320 ºC, 4 h) as the stationary phase; the fraction corresponding to the saturates hydrocarbons was eluated using 60 ml of n-hexane. The collected eluate was evaporated to dryness under a gentle N2(g) stream and redissolved in n-hexane to result in a 30 mg mL-1 solution.

1.3. Simplified sample preparation for GC×GC-QMS analysis The maltenes fraction in the crude oils was isolated according to21: 0.1000 (+/0.005) g of the crude oil was suspended in 7 mL of n-pentane and centrifuged at 300 rpm for 5 min, and the supernatant was collected. This extraction procedure was repeated 5 times. The solvent was evaporated under a gentle stream of N2(g), and the resulting maltenes fraction redissolved using n-hexane to 20 mg.mL-1.

2. GC 2.1. Conventional GC-MS and GC-MS/MS The saturates fraction in n-hexane was analyzed by GC-MS using a Shimadzu GCTQ8030 instrument fitted with a split/splitless injector and a capillary column (30 m × 0.25 mm i.d. × 0.25 µm df) containing 5% phenyl methyl polysiloxane as stationary phase (RTX-5ms). The GC conditions were injection volume 1 µL, split ratio 1:20, at 300 ºC and 8 ACS Paragon Plus Environment

Page 9 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

purge flow 2.0 mL min-1; the oven ramp was set to 70 ºC until 325 ºC at 3ºC.min-1, using hydrogen as carrier gas at 1.0 mL min-1. The transfer line temperature to the triple quadrupole MS was set to 280 ºC, ion source temperature at 270 ºC, electron ionization mode at 70 eV, mass scanning range from m/z = 40 to 582 Da and acquisition rate of 5 Hz. In the GC-MS/MS operated in the MRM mode, the GC split ratio was set to 1:10 and specific biomarkers were identified when monitoring the following mass transitions (acquisition rate of 10 Hz): m/z = 398 → 191, 412 → 191, 426 → 191, 440 → 191 (C29C32 αβ-hopanes and gammacerane), 372 → 217, 386 → 217 and 400 → 217 (C27-C29 steranes).

2.2. GC×GC-QMS The maltene fractions of the crude oils in n-hexane was analyzed in duplicates using a lab-made GC×GC–QqQMS prototype, based on a Shimadzu GC-TQ8030 chromatograph and a two-stage cryogenic loop-type modulator programmed to provide a 4.0s cold jet (N2(liq)) followed by a 2.0 s hot jet (T = 350 ºC) each modulation period (6.0 s). The inhouse software and hardware to control the modulation were validated and their applications are already described in the literature22,23,24. The columns set consisted on a capillary column (30 m × 0.25 mm i.d. × 0.25 µm df) containing 5% phenyl methyl polysiloxane as stationary phase (RTX-5ms), coupled through a capillary HP5 column (1.0 m x 0.25 mm i.d. x 0.25 µm df) which acts as the modulator to a second dimension column (1.0 m x 0.15 mm i.d. x 0.15 µm df) containing 50% diphenyl-50% dimethyl polysiloxane as the stationary phase (Rxi ®-17Sil MS). The GC conditions were an injection volume of 1 µL, split ratio 1:20, at 300 ºC, purge flow 2.0 mL min-1 and oven ramp 70 ºC to 325 ºC at

9 ACS Paragon Plus Environment

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 37

3ºC min-1, using hydrogen as carrier gas at 0.8 mL min-1. The transfer line temperature to the triple quadrupole MS was set to 280 ºC, ion source temperature at 270 ºC, electron ionization mode at 70 eV, mass scanning range from m/z = 40 to 582 Da and acquisition rate of 33.33 Hz.

3. Data treatment 3.1. Data acquisition and preprocessing Data acquisition and processing were performed using the GCMS solution software v.4.20 (Shimadzu Corp., Kyoto, Japan). The GC×GC-QMS total ion chromatograms for all the replicates were combined and converted to .txt files and imported into the Matlab software (Matworks, Natick – MA, USA) as an unfolded row-wise data matrix X(48, 164000), 48 is the number of replicates and 164000 is the total number of variables associated with retention times from the first (1D) and second (2D) dimensions. Next, X was preprocessed by baseline subtraction and an in-house routine was written in Matlab to perform the piecewise peak alignment in X, using the icoshift v.1.2.3 algorithm25. The supervised classification chemometrics k-NN, PLSDA and SVMDA were performed using the Pls_Toolbox v.8.1.1 program for Matlab (Eigenvector Research Inc., Wenatchee – WA, USA), while LDA and QDA were performed using proper functions in Matlab. X was mean-centered and next compressed using PCA (maximum number of PCs = 10) prior to the building up of the k-NN, LDA and QDA classification models, while the PLS-approach was used for data compression in PLSDA and SVMDA.

10 ACS Paragon Plus Environment

Page 11 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

3.2. Double cross-validation For the double cross-validation of the models, 3-fold outer loops (CV2) were structured with 16 samples (including duplicates) defining the training set (totalizing 32 replicates) and the remaining 8 samples (totalizing 16 replicates) defined the test set, so that samples from both classes (i.e. lacustrine and marine organic matter depositional paleoenvironments) were included in training and test sets. The inner loops (CV1), for model optimization using only the training set, were 16-fold venetian blind cross-validated (except in k-NN), for which the duplicates of each sample were simultaneously extracted from the training set during this process to avoid the fitting of over-optimistic models. The lowest misclassification rate (i.e. number of misclassified replicates / total number of replicates) was the metric used in both CV1 (models optimization) and CV2 loops to attest the overall performance of the chemometric methods (after a total iterations = 1000) for classifying the crude oils. The optimization procedure to build the models (CV1 inner loops), for each chemometric approach, is detailed as follows: -

k-NN: number of PCs extracted from the PCA-compressed X varied from 1 to 10,

with the number of k neighbors also varying from 2 to 5. -

LDA and QDA: number of PCs extracted from the PCA-compressed X varied from

1 to 10, using pseudo-linear and pseudo-quadratic functions for LDA and QDA contained in the “fitcdiscr” function of Matlab, respectively, to perform the discrimination of the samples. -

PLSDA: number of LVs extracted from X varied from 1 to 10.

-

SVMDA: X was initially compressed using PLSDA, taking the optimum number of

LVs found in the previous PLSDA models. The models were computed by the “c-svm” 11 ACS Paragon Plus Environment

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 37

type function using the RBF kernel, contained in the libsvm library of the Pls_Toolbox. Previously to the double cross-validation of the SVMDA models, the C and ɣ parameters in the kernel function were optimized through a additional shorter double cross-validation procedure (total number of 100 iterations), using a search grid formed by 11 (0.001 ≤ C ≤ 100) x 15 (1.10-6 ≤ ɣ ≤ 10) log-spaced points in each iteration. The optimized C and ɣ values corresponded to the median of their respective distributions that were obtained after this procedure.

3.3. Class-label predicted probability After the double cross-validation procedure, the performances of the classification approaches were evaluated when computing their corresponding averaged class-label predicted probabilities for each replicate (CV2 outer loops) over the 1000 iterations, aiming to assign the class-labels considering also their respective uncertainties. Prediction probabilities were computed using Bayesian approaches for each chemometric method separately: -

k-NN: the probability that a sample belongs to each class was calculated as the

fraction of nearest neighbors that belong to that particular class. -

LDA and QDA: the predicted probability for the i-th sample belonging to the k

class (P(k | i) was calculated by (P(k | i) = P(i | k).P(k) / P(i) in Matlab; P(i | k) is the predicted probability of the i-th sample belonging to the k class considering the Gaussian distribution function of training samples belonging to this class. P(k) is the prior probability of the class k, which corresponds to the proportion of samples from the training set belonging to the k class. P(i) the sum over k of P(i | k).P(k). 12 ACS Paragon Plus Environment

Page 13 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

-

PLSDA: the predicted probability for a sample belonging to the class “1” was

calculated by P(y | 1) / (P(y | 1) + P(y | 2)), y is predicted value from the PLSDA model for this sample, P(y,1) and P(y,2) are the probability of y belonging to the classes 1 and 2, respectively, obtained using the Gaussian distribution function for each class calculated from the training set26. -

SVMDA: the predicted probabilities were calculated according to the libsvm

package contained in the Pls_Toolbox, through the calculation of pairwise combination class probabilities using the training samples27.

4. Results and Discussion

4.1. Organic matter depositional paleoenvironment of the crude oils using GC-MS and GC-MRM-MS Before the application of the chemometric tools, the predominant depositional paleoenvironment of the crude oil samples employed were confirmed by estimating wellstablished geochemical parameters traditionally adopted to discriminate between samples from organic matter predominant lacustrine or marine depositional environments; these parameters are usually calculated from conventional GC-MS and GC-MS/MS data. Oils generated on lacustrine environments have higher abundance of pristane (Pr) than phytane (Ph); the contrary (i.e. Ph over Pr abundance) is typical for oils from marine sources. Aditionally, Pr/n-C17 and Ph/n-C18 ratios can also be useful to assign the organic matter depositional paleoenvironment of the oil, notably the later (Table 1). However, geochemical information obtained from such parameters that depend exclusively from 13 ACS Paragon Plus Environment

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 37

paraffinic compounds may be biased; others properties of the samples (not necessarily correlated with the depositional paleoenvironment, such as thermal maturation and biodegradation levels) also affect the concentration of these compounds. Therefore, other saturated biomarkers were also inspected to confirm the lacustrine or marine origins of the organic matter depositional paleoenvironment of the crude oils: C27-C29 steranes, C29-C32 hopanes and the gammacerane were analyzed using GC-MS/MS (MRM) of the saturates fractions of the crude oils. Mello and coworkers have successfully discriminated between lacustrine and marine organic matter depositional paleoenvironments of Brazilian crude oils when analyzing the abundance of above-mentioned biomarkers2,3. Gammacerane is a common

biomarker

related

to

stratified

water

columns

in

the

depositional

paleoenvironment, commonly found in higher quantity in petroleum from saline-tohypersaline paleoenvironments. Therefore, this biomarker is usually found in higher concentration in crude oils from marine evaporitic paleoenvironments, when comparing with crude oils from lacustrine sources7,28, which resulted in higher gamacerane/C30αβhopanes ratios for the crude oils used in this study from marine organic matter depositional environments. Likewise, marine organic matter depositional paleoenvironments present steranes over hopanes predominance1,8,9, which is also in accordance with the corresponding lacustrine or marine predominant origins for the paleoenvironment of the samples (Table 1).

4.2. Supervised classification of the crude oils using GCxGC-QMS(TIC) data and chemometrics

14 ACS Paragon Plus Environment

Page 15 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

The overall complexity and performances of the double cross-validated models for the classification of the crude oils according to their corresponding organic matter depositional paleoenvironments (i.e. marine or lacustrine) are disposed in Table 2, along with the histograms of their respective misclassification errors (CV2 loops) plotted in Figure 1. The improvement on the performances for the methods is seen when comparing the predictions errors in CV2 and the histograms, in the order: k-NN (18.0) < LDA (26.8) < QDA (33.9) < PLSDA (50.6) < SVMDA (77.8), in which the values in the parenthesis are the percentage of models with misclassification rates for the CV2 prediction ≤ 10 %. Note that only after double cross-validating the models, it was possible to conclude that none of the methods performances (which can be compared when ranking the lowest misclassification rates) was due to a chance event, since less than 1% of the models would be expected to provide a misclassification rate ≤ 10 % if the crude oils had been totally random classified29.

4.2.1. k-NN, LDA and QDA The great advantage of the supervised classification methods k-NN, LDA and QDA is that they are simpler to compute and less prone to overfitting, when comparing with the PLSDA, since the models are more related to the inherent multivariate statistical properties of the data30. Moreover a better classification performance can be obtained by computing non-linear boundaries when using QDA. Although k-NN provided the lowest averaged misclassification rate in the CV1 inner loops (models optimization), with the data being compressed using mostly 4 PCs prior to the k-NN modeling (see Table 1), it is actually a overoptimistic result. The occurrence of replicates in the dataset implied most of the 15 ACS Paragon Plus Environment

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 37

optimized k-NN models were computed considering only 2 neighbors, because the assigning of the class-label of the samples in the training set was biased due to the fact the replicates from the same sample tend to be closer each other in the multivariate space. Indeed, the overfitting of the k-NN models can be confirmed by their poorer CV2predictions (see Table 2). The analysis of the class-label predicted probabilities computed from the double cross-validation results reveals M1, L7 and L8 samples misclassified (in average), when considering all the replicates, in the k-NN models and large probability standard deviations crossing the classification boundary for many other samples (Figure 2). The k-NN method failing to proper classify many replicates means the crude oils are not easily distinguished in the reduced multivariate space. A slight decrease in the number of misclassified samples was obtained when plotting the class-label predicted probabilities computed from LDA (Figure 3) and QDA (Figure 4) models, in which L8 is averaged misclassified in LDA and QDA, and L11 is also misclassified in LDA. When considering the fact that a truly misclassified crude oil required both replicates being averaged misclassified in the double cross-validated models, a (slight) better classification of the crude oils could be inferred when using QDA. However, the performance of LDA and QDA are actually very similar when comparing their respective histograms in Fig. 1 and their misclassification-CV2 ratio in Table 1. Additionally, it is wise enough to conclude most of the crude oils derived from marine environments were correctly misclassified as lacustrine-derived, according to their respective low predicted probabilities in Fig. 3 (LDA) and Fig. 4 (QDA). The contrary, i.e. most of crude oils derived from lacustrine environments is correctly classified with high probabilities, does not occur likewise. Indeed, lacustrine-derived crude oils such as L1, L3,

16 ACS Paragon Plus Environment

Page 17 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

L4, L8, L10 and L11 are more prone to be misclassified when using LDA and/or QDA, along with the marine oils M1, M11, M12 and M13. In the geochemical point of view, despite the depositional environments of the above-mentioned oils conventionally assigned according to section 4.1 (Table 1), lacustrine oils containing a more pronounced influences of marine sources, and vice-versa, e.g.; typifying lacustrine saline and restricted marine environments1, can explain such oils more liable to be misclassified. Therefore, the exploiting of more sophisticated classification approaches able to handle such crude oils geochemical complexity, such as PLSDA and SVMDA, is interesting and a valuable strategy.

4.2.2. PLSDA PLSDA takes advantage of the PLS-regression approach to perform the discriminant analysis by maximizing the covariance between the dataset X and the Yresponses, herewith assigning the class-label of the samples17. The enhanced performance of PLSDA for classifying the crude oils was not only confirmed from the lower prediction errors (i.e. misclassification – CV2) in Table 2, but also when confirming the more accurate predicted probabilities (Figure 5) comparing with the results obtained from LDA and QDA. Indeed, PLSDA succeed in distinguishing higher similar crude oils that could not be well discriminated by the LDA and QDA models. However, PLSDA was unable to correctly classify P1 oil, also providing larger uncertainty when classifying M1 and L11 replicates (comparing with the remaining oils), which might have been responsible for the computed mean misclassification ratio obtained from the double cross-validated models. The above-mentioned oils might have high relative importance for distinguishing the 17 ACS Paragon Plus Environment

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 37

classes in the models, suggesting as good practice they should be part of the training set for the fitting of a proper representative PLSDA model aiming future predictions. Therefore, the double cross-validation procedure allowed the identification of crude oils that are more prone to be misclassified when not belonging to the training set, whose geochemical properties can also be further investigated from their respective GC×GC-MS data. To discriminate the most important chromatographic signals for the discrimination of

the

crude

oils

according

to

their

respective

organic

matter

depositional

paleoenvironmnets, a single PLSDA model containing orthogonality restrictions, the socalled orthogonal-PLSDA (OPLSDA) model, was fitted and validated after the definition of the training (32 replicates) and the test (16 replicates) sets containing representative samples from both classes (marine and lacustrine). Bylesjö and coworkers pointed out the advantages in choosing OPLSDA instead of PLSDA when interpreting the models loadings, since both methods provide the same predictions for the same dataset when using the same number of LVs; considering that the Y-correlated variance in the dataset is separated from the Y-orthogonal variance into a single LV in OPLSDA, the loadings from the Y-correlated LV carry the relative importance of the variables exclusively when discriminating the samples, contrary to PLSDA31. The OPLSDA model was built in the Pls_Toolbox using 5 LVs in the minimum 16-fold venetian blind cross-validation error (similarly to the CV1 inner loops in the previous double cross-validation procedures), in which the single Y-correlated LV explained 7.93% and 88.36% of the total variance in X and Y, respectively, while the others 4 Y-orthogonal LVs explained 88.26% of the variance in X. Only a small portion of the dataset variance (approximately 8% of X) explaining most of the class-labels of the crude oils (approximately 88% of Y) is due to the large amount of information in the GC×GC-MS data that is not necessarily correlated with the property of 18 ACS Paragon Plus Environment

Page 19 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

interest (herein, the organic matter depositional paleoenvironment of the crude oils), typifying omics-based approaches. The chromatographic loadings from the Y-correlated LV along with the OPLSDA predictions are depicted in Figure 6. The large number of variables comparing to the number of samples required the OPLSDA model also validated using permutation test (100,000 iterations) to guarantee the performance of this model was not due to a chance event32,33, Table 3. Since the variance of the dataset dominated by chromatographic signals from the paraffin compounds contained in the crude oils, they were discriminated according to the relative higher amount of n-alkanes, mainly the C19-C27 n-alkanes, that was found in the crude oils from lacustrine environments (negative loadings in Fig. 6), while the oils from marine environments presented higher amount of C14-C18 isoprenoids (positive loadings in Fig. 6). This loadings profile is coherent with the geochemistry characteristics of these oils already described previously (see section 1). The OPLSDA predictions correctly predicted the class-label of the samples belonging to the test set, considering that such optimistic results are due to the well representative training set that was built up using the most complex samples from the dataset, previously identified when double cross-validating the models.

4.2.3. SVMDA The crude oils were discriminated using PLSDA-scores (6 LVs) previously obtained from the dataset, which corresponds to the closest integer number of LVs to the averaged number of LVs chosen for predictions in the double cross-validated PLSDA models (section 4.2.2). The use of PLSDA-scores, instead of PCA-scores, as compressed-data input for SVMDA modelling takes the advantage of the maximized covariance between X and Y

19 ACS Paragon Plus Environment

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 37

for the better classification in the higher-dimensional space. Contrary to the PLSDA models, none of crude oils were averaged misclassified (when considering the classification of both replicates from each sample) in the SVMDA models (Figure 7). Additionally, there is an evident decrease in the SVMDA predictions uncertainties for the crude oils more hardly classified by PLSDA (i.e. the L8, M1 and L11 replicates). The better capability of the SVMDA models to correctly predict the class-label of the crude oils is strongly related to the non-linear correlation between X and Y in the classification. The use of k-NN, LDA and QDA in the PCA-compressed dataset resulted that only the strict multilinear information was considered for the discrimination of the samples, which in this case resulted in the worst performance for predicting the class-label of the samples (CV2 outer loops in the double cross-validation). Additionally, despite the (linear) PLSDA outperformed k-NN, LDA and QDA in the external predictions, SVMDA outperforming PLSDA means the major variance of the paraffinic compounds in the dataset is indeed not strictly linear with the depositional paleoenviroments fingerprint of all the crude oils. Consequently the most effective discrimination analysis is only performed through the projection of the data into the more complex higher-dimensional space. The necessity of using a more complex (non-linear) chemometric approach for the most effective discrimination of the crude oils is understood and valuable considering the crude oils geochemical aspects: apart from the fact some crude oils may have influence of both lacustrine and marine sources, some other inherent properties of the crude oils, such as their slight biodegradation differences and thermal maturity levels, also disturbs the distribution profile of the paraffinic compounds in the oils1 and may add non-linearity features to the data.

20 ACS Paragon Plus Environment

Page 21 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

5. Conclusions This work demonstrates the advantages on exploiting the enhanced analytical performance of the GC×GC-QMS and multivariate statistical approaches in petroleum fingerprinting to discriminate between crude oils from lacustrine or marine organic matter depositional paleoenvironments, aiming to provide an analytical strategy alternatively to the laborious conventional procedure. The inherent chemical complexity of the crude oils could be further investigated when evaluating the limitations of the chemometrics methods for correctly classifying the crude oils, since the use of the double cross-validation strategy allowed the external (independent) predictions of the entire dataset. The knowledge of the complexity of the dataset is important because it allows the building up of a suitable supervised classification model for future predictions of “unknown” samples. Indeed, hardly classified samples may be an indicative of crude oils containing both lacustrine and marine influences in their corresponding geochemical profiles, differing from the typical (lacustrine or marine) oils more easily distinguished by the chemometric models, as well as the effect of different maturation and biodegradation levels over their corresponding chromatographic profiles. Although these properties implies higher complexity to the data, the SVMDA models were able to extract the target information about source more efficiently from the GC×GC-QMS data, resulting in the best discrimination of the crude oils among the methods. PLSDA provided the best performance among the linear classification methods, with the advantage that the most important chromatographic signals (variables) exclusively

21 ACS Paragon Plus Environment

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 37

correlated to the discrimination of the crude oils (i.e. the Y-correlated loadings) could be easily extracted and interpreted when choosing for the OPLSDA modelling. Although chemometrics proved to be valuable analytical tools to discriminate the crude oils according to their predominant lacustrine or marine organic matter depositional paleoenvironment, it is important to highlight the discrimination among subgroups of the types of the organic matter depositional paleoenvironment of the crude oils (e.g.; lacustrine freshwater, lacustrine saline, marine evaporitic, marine deltaic, etc) were not considered when building the models. However, this goal can be achieved with proper sample set and when using a chromatographic technique that is able to enhance this property more efficiently from the crude oils, such as GC×GC-MS.

Acknowledgments The authors are thankful to the Coordination for the Improvement of Higher Education Personnel – Ministry of Education (CAPES) and the São Paulo Research Foundation (research grant 2015/08201-0) for providing the financial support.

6. References (1)

Peters, K. .; Walters, C. C.; Moldowan, J. M. The Biomarker Guide, 2nd ed.; Cambridge University Press: New York, 2005; Vol. 2, Biomarkers and Isotopes in Petroleum Systems and Earth History.

(2)

Mello, M. R.; Gaglianone, P. C.; Brassell, S. C.; Maxwell, J. R. Mar. Pet. Geol.

22 ACS Paragon Plus Environment

Page 23 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

1988, 5 (3), 205–223. http://dx.doi.org/ 10.1016/0264-8172(88)90002-5 (3)

Mello, M. .; Telnaes, N.; Gaglianone, P. .; Chicarelli, M. .; Brassell, S. .; Maxwell, J. . Org. Geochem. 1988, 13 (1–3), 31–45. http://dx.doi.org/10.1016/01466380(88)90023-X

(4)

Panda, S. K.; Andersson, J. T.; Schrader, W. Anal. Bioanal. Chem. 2007, 389 (5), 1329–1339. http://dx.doi.org/ 10.1007/s00216-007-1583-6

(5)

Chiaberge, S.; Fiorani, T.; Cesti, P. Fuel Process. Technol. 2011, 92 (11), 2196– 2201. http://dx.doi.org/10.1016/j.fuproc.2011.07.011

(6)

Liang, Q.; Xiong, Y.; Fang, C.; Li, Y. Org. Geochem. 2012, 43, 83–91. http://dx.doi.org/10.1016/j.orggeochem.2011.10.008

(7)

Kiepper, A. P.; Casilli, A.; Azevedo, D. A. Org. Geochem. 2014, 70, 62–75. http://dx.doi.org/10.1016/j.orggeochem.2014.03.005

(8)

Oliveira, C. R.; Ferreira, A. a.; Oliveira, C. J. F.; Azevedo, D. a.; Santos Neto, E. V.; Aquino Neto, F. R. Org. Geochem. 2012, 46, 154–164. http://dx.doi.org/10.1016/j.orggeochem.2012.03.002

(9)

Casilli, A.; Silva, R. C.; Laakia, J.; Oliveira, C. J. F.; Ferreira, A. A.; Loureiro, M. R. B.; Azevedo, D. A.; Aquino Neto, F. R. Org. Geochem. 2014, 68, 61–70. http://dx.doi.org/10.1016/j.orggeochem.2014.01.009

(10)

Mogollón, N. G. S.; Prata, P. S.; dos Reis, J. Z.; Neto, E. V. dos S.; Augusto, F. J. Sep. Sci. 2016, 39 (17), 3384–3391. http://dx.doi.org/10.1002/jssc.201600418

(11)

Ventura, G. T.; Hall, G. J.; Nelson, R. K.; Frysinger, G. S.; Raghuraman, B.; 23 ACS Paragon Plus Environment

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 37

Pomerantz, A. E.; Mullins, O. C.; Reddy, C. M. J. Chromatogr. A 2011, 1218 (18), 2584–2592. http://dx.doi.org/ 10.1016/j.chroma.2011.03.004 (12)

Zhang, W.; Zhu, S.; He, S.; Wang, Y. J. Chromatogr. A 2015, 1380, 162–170. http://dx.doi.org/10.1016/j.chroma.2014.12.068

(13)

Christensen, J. H.; Tomasi, G. J. Chromatogr. A 2007, 1169 (1–2), 1–22. http://dx.doi.org/10.1016/j.chroma.2007.08.077

(14)

Barbosa, L. L.; Sad, C. M. S.; Morgan, V. G.; Santos, M. F. P.; Castro, E. V. R. Energy & Fuels 2013, 27 (11), 6560–6566. http://dx.doi.org/ 10.1021/ef4015313

(15)

van Mispelaar, V. G.; Smilde, A. K.; de Noord, O. E.; Blomberg, J.; Schoenmakers, P. J. J. Chromatogr. A 2005, 1096 (1–2), 156–164. http://dx.doi.org/10.1016/j.chroma.2005.09.063

(16)

Filgueiras, P. R.; Portela, N. A.; Silva, S. R. C.; Castro, E. V. R.; Oliveira, L. M. S. L.; Dias, J. C. M.; Neto, A. C.; Romão, W.; Poppi, R. J. Energy & Fuels 2016, 30 (3), 1972–1978. http://dx.doi.org/ 10.1021/acs.energyfuels.5b02377

(17)

Marini, F. Curr. Anal. Chem. 2010, 6 (1), 72–79. http://dx.doi.org/10.2174/157341110790069592

(18)

Dixon, S. J.; Brereton, R. G. Chemom. Intell. Lab. Syst. 2009, 95 (1), 1–17. http://dx.doi.org/10.1016/j.chemolab.2008.07.010

(19)

Brereton, R. G.; Lloyd, G. R. Analyst 2010, 135 (2), 230–267. http://dx.doi.org/10.1039/B918972F

(20)

Szymańska, E.; Saccenti, E.; Smilde, A. K.; Westerhuis, J. a. Metabolomics 2012, 8, 24 ACS Paragon Plus Environment

Page 25 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

3–16. http://dx.doi.org/ 10.1007/s11306-011-0330-3 (21)

Gürgey, K. Org. Geochem. 1998, 29 (5–7), 1139–1147. http://dx.doi.org/10.1016/S0146-6380(98)00134-X

(22)

Rivellino, S. R.; Hantao, L. W.; Risticevic, S.; Carasek, E.; Pawliszyn, J.; Augusto, F. Food Chem. 2013, 141 (3), 1828–1833. http://dx.doi.org/10.1016/j.foodchem.2013.05.003

(23)

Santos, T. G.; Fukuda, K.; Kato, M. J.; Sartorato, A.; Duarte, M. C. T.; Ruiz, A. L. T. G.; de Carvalho, J. E.; Augusto, F.; Marques, F. A.; Sales Maia, B. H. L. N. Microchem. J. 2014, 115, 113–120. http://dx.doi.org/10.1016/j.microc.2014.02.014.

(24)

de Lima, P. F.; Furlan, M. F.; de Lima Ribeiro, F. A.; Pascholati, S. F.; Augusto, F. J. Sep. Sci. 2015, 38 (11), 1924–1932. http://dx.doi.org/10.1002/jssc.201401404

(25)

Tomasi, G.; Savorani, F.; Engelsen, S. B. J. Chromatogr. A 2011, 1218 (43), 7832– 7840. http://dx.doi.org/10.1016/j.chroma.2011.08.086

(26)

Pérez, N. F.; Ferré, J.; Boqué, R. Chemom. Intell. Lab. Syst. 2009, 95 (2), 122–128. http://dx.doi.org/10.1016/j.chemolab.2008.09.005

(27)

Wu, T.-F.; Lin, C.-J.; Weng, R. C. J. Mach. Learn. Res. 2004, 5, 975–1005.

(28)

Sousa Júnior, G. R.; Santos, A. L. S.; de Lima, S. G.; Lopes, J. A. D.; Reis, F. A. M.; Santos Neto, E. V.; Chang, H. K. Org. Geochem. 2013, 63, 94–104. http://dx.doi.org/ 10.1016/j.orggeochem.2013.07.009

(29)

Brereton, R. G. TrAC Trends Anal. Chem. 2006, 25 (11), 1103–1111. http://dx.doi.org/10.1016/j.trac.2006.10.005 25 ACS Paragon Plus Environment

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(30)

Page 26 of 37

Brereton, R. G.; Lloyd, G. R. J. Chemom. 2014, 28 (4), 213–225. http://dx.doi.org/10.1002/cem.2609

(31)

Bylesjö, M.; Rantalainen, M.; Cloarec, O.; Nicholson, J. K.; Holmes, E.; Trygg, J. J. Chemom. 2006, 20 (8–10), 341–351. http://dx.doi.org/ 10.1002/cem.1006

(32)

Westerhuis, J. a.; Hoefsloot, H. C. J.; Smit, S.; Vis, D. J.; Smilde, A. K.; van Velzen, E. J. J.; van Duijnhoven, J. P. M.; van Dorsten, F. a. Metabolomics 2008, 4 (1), 81– 89. http://dx.doi.org/ 10.1007/s11306-007-0099-6

(33)

Triba, M. N.; Le Moyec, L.; Amathieu, R.; Goossens, C.; Bouchemal, N.; Nahon, P.; Rutledge, D. N.; Savarin, P. Mol. Biosyst. 2015, 11 (1), 13–19. http://dx.doi.org/10.1039/c4mb00414k

(34)

Thomas, E. V. J. Chemom. 2003, 17 (12), 653–659. http://dx.doi.org/10.1002/cem.833

(35)

van der Voet, H. Chemom. Intell. Lab. Syst. 1994, 25 (2), 313–323. http://dx.doi.org/10.1016/0169-7439(94)85050-X

26 ACS Paragon Plus Environment

Page 27 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

Table 1. Some conventional geochemical parameters to distinguish between predominant lacustrine or marine organic matter depositional paleoenvironments of crude oils. Pr/Ph 1

Pr/(n-C17) 1

Ph/(n-C18) 1

Gam/H30 2

Ste/Hop 2

Depositional Paleoenvironment L1 1.52 0.75 0.63 0.13 0.35 LACUSTRINE L2 1.11 0.41 0.43 0.15 0.34 LACUSTRINE L3 1.31 0.59 0.54 0.12 0.37 LACUSTRINE L4 1.44 0.75 0.62 0.06 0.27 LACUSTRINE L5 1.32 0.47 0.40 0.06 0.31 LACUSTRINE L6 1.26 0.38 0.32 0.13 0.34 LACUSTRINE L7 0.75 0.18 0.26 0.32 0.54 LACUSTRINE L8 1.17 0.67 0.72 0.13 0.28 LACUSTRINE L9 1.25 0.55 0.54 0.07 0.33 LACUSTRINE L10 2.00 0.63 0.30 0.27 0.16 LACUSTRINE L11 1.67 0.31 0.19 0.35 0.64 LACUSTRINE M1 0.69 0.44 1.03 0.24 3.14 MARINE M2 0.44 1.11 3.22 0.48 2.52 MARINE M3 0.47 1.10 3.08 0.51 2.31 MARINE M4 0.92 0.67 1.04 0.30 2.97 MARINE M5 0.82 0.62 1.18 0.32 2.56 MARINE M6 0.83 0.66 1.14 0.29 3.00 MARINE M7 1.16 0.51 0.67 0.14 6.27 MARINE M8 0.86 0.64 1.20 0.27 2.59 MARINE M9 0.86 0.67 1.16 0.19 2.89 MARINE M10 1.09 0.58 0.77 0.26 4.13 MARINE M11 1.16 0.53 0.82 0.24 3.27 MARINE M12 0.92 0.67 1.01 0.34 6.81 MARINE M13 1.20 0.50 0.75 0.30 4.23 MARINE Pr, pristane; Ph, phytane; Gam, gammacerane; H30, C30 αβ-hopane; Hop, ∑ C29-C32 (S + R) hopanes; Ste, ∑ ααα + αββ C27-C29 (S + R) steranes. 1 Parameters computed using GC-MS (EIC): m/z 85 1. 2 Parameters computed using GC-MS/MS (MRM): Gam and H30 (m/z 412 → 191), Ste [m/z 372 → 217 Sample

27 ACS Paragon Plus Environment

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 28 of 37

(C27), 386 → 217 (C28), 400 → 217 (C29)], Hop [m/z 398 → 191 (C29), 412 → 191 (C30), 426 → 191 (C31), 440 → 191 (C32)] 1.

Table 2. Parameters related to the complexity (nPC or nVL) and performance (error CV1 and error CV2) of the double cross-validated models computed using k-NN, LDA, QDA, PLSDA and SVMDA, while classifying the crude oils between marine or lacustrine organic matter depositional paleoenvironments. nPC / nVL (1)

misclassification - CV1 (2)

misclassification - CV2(2)

k-NN

4 (2) (3)

5.61 (4.08)

18.82 (10.64)

LDA

7

10.38 (5.25)

18.35 (12.04)

QDA

5

11.07 (6.02)

16.30 (11.63)

PLSDA

5

6.26 (4.26)

10.00 (9.79)

SVMDA

6 (6/6) (4)

8.47 (6.33) (5)

5.51 (8.07)

(1)

median of the number of PCs (k-NN, LDA and QDA) and VLs (OPLSDA and SVMDA)

obtained in the CV1 inner loops. (2)

mean misclassification ratio, in percentile, for the CV1 or CV2 double cross-validation loops

and the corresponding standard deviation (in parenthesis). (3)

median of the optimum number of k neighbors obtained in the CV1 inner loops.

(4)

median of the number of support vectors used for each class (in parenthesis) in the optimized

SVMDA models, during the double cross-validation procedure. (5)

mean misclassification ratio, in percentile, for the CV1 inner loops (100 iterations) during

the optimization of the SVMDA models, along with the standard deviation (in parenthesis).

28 ACS Paragon Plus Environment

Page 29 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

Table 3. Permutation test results from the cross-validated OPLSDA model while classifying the crude oils between marine or lacustrine organic matter depositional paleoenvironments. Statistical test Wilcoxon test

p-value*

34

0.028

Sign test34

0.015

Random t-test35

0.012

* p-value ≤ 0.05 infers rejecting the null hypothesis (OPLSDA model predictions are not significant from random classifications) at 95% confidence level.

29 ACS Paragon Plus Environment

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure Abstract 99x50mm (300 x 300 DPI)

ACS Paragon Plus Environment

Page 30 of 37

Page 31 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

Figure 1. Histograms of the misclassification error in the outer loops (CV2) for the double cross-validated models obtained using k-NN, LDA, QDA, PLSDA and SVMDA. 122x90mm (300 x 300 DPI)

ACS Paragon Plus Environment

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 2. Predicted probability of the crude oils from organic matter lacustrine (yellow) and marine (green) depositional paleoenvironments to belong to the lacustrine environment, computed using double crossvalidated k-NN models. The red line denotes the probability threshold upon which the samples are classified as lacustrine. 148x111mm (300 x 300 DPI)

ACS Paragon Plus Environment

Page 32 of 37

Page 33 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

Figure 3. Predicted probability of the crude oils from organic matter lacustrine (yellow) and marine (green) depositional paleoenvironments to belong to the lacustrine environment, computed using double crossvalidated LDA models. The red line denotes the probability threshold upon which the samples are classified as lacustrine. 148x111mm (300 x 300 DPI)

ACS Paragon Plus Environment

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 4. Predicted probability of the crude oils from organic matter lacustrine (yellow) and marine (green) depositional paleoenvironments to belong to the lacustrine environment, computed using double crossvalidated QDA models. The red line denotes the probability threshold upon which the samples are classified as lacustrine. 148x111mm (300 x 300 DPI)

ACS Paragon Plus Environment

Page 34 of 37

Page 35 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

Figure 5. Predicted probability of the crude oils from organic matter lacustrine (yellow) and marine (green) depositional paleoenvironments to belong to the lacustrine environment, computed using double crossvalidated PLSDA models. The red line denotes the probability threshold upon which the samples are classified as lacustrine. 148x111mm (300 x 300 DPI)

ACS Paragon Plus Environment

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 6. OPLSDA loadings from the y-correlated LV when classifying the crude oils between their original organic matter lacustrine or marine depositional paleoenvironments. 221x72mm (300 x 300 DPI)

ACS Paragon Plus Environment

Page 36 of 37

Page 37 of 37

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

Figure 7. Predicted probability of the crude oils from organic matter lacustrine (yellow) and marine (green) depositional paleoenvironments to belong to the lacustrine environment, computed using double crossvalidated SVMDA models. The red line denotes the probability threshold upon which the samples are classified as lacustrine. 148x111mm (300 x 300 DPI)

ACS Paragon Plus Environment