Classification of Diesel Fuel Using Two-Dimensional Fluorescence

Aug 3, 2017 - The results obtained using partial least squares discriminant analysis (PLS-DA) and random forest (RF) shows that fluorescence spectrosc...
2 downloads 15 Views 940KB Size
Subscriber access provided by UNIVERSITY OF ADELAIDE LIBRARIES

Article

Classification of Diesel fuel using 2D Fluorescence Spectroscopy Lucas Ranzan, Cassiano Ranzan, Luciane Ferreira Trierweiler, and Jorge Otávio Trierweiler Energy Fuels, Just Accepted Manuscript • DOI: 10.1021/acs.energyfuels.7b00954 • Publication Date (Web): 03 Aug 2017 Downloaded from http://pubs.acs.org on August 20, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Energy & Fuels is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

Classification of Diesel fuel using 2D Fluorescence Spectroscopy Lucas Ranzan¹*, Cassiano Ranzan², Luciane F. Trierweiler¹, Jorge O. Trierweiler¹ ¹Group of Intensification, Modeling, Simulation, Control, and Optimization of Process – GIMSCOP - Federal University of Rio Grande do Sul – Chemical Engineering Department – (lucas.ranzan,luciane,[email protected]) - 90040-040 - Porto Alegre - RS - BRAZIL ²EQA - School of Chemistry and Food - FURG - Federal University of Rio Grande – ([email protected]) - Santo Antônio da Patrulha - RS - Brazil KEYWORDS: Diesel, 2D Fluorescence Spectroscopy, Classification, PCA, PLS-DA, Random Forest.

ABSTRACT: Air pollution is a serious problem and to decrease the emission of pollutants, governments have established environmental rules to limit the concentration of sulfur in diesel fuel. Control of sulfur content in diesel streams demands on-line measurement of this component, with low pure time delay and easy operation. In this way, 2D Fluorescence Spectroscopy becomes a promising choice for soft-sensor development. Despite fluorescence qualities, translation of fluorescence spectroscopic data into process knowledge is not a simple task, demanding multivariate analysis. This work aims to evaluate the applicability of 2D Fluorescence Spectroscopy for monitoring Ultra Low Sulfur Diesel streams (Diesel S10) and the

ACS Paragon Plus Environment

1

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 29

capability of differentiation between ULSD and diesel streams with a higher sulfur concentration (Diesel S100). The evaluation of the different sample groups’ fluorescence spectrum made clear the ability of the technique to capture changes in the composition between Diesel S10 e Diesel S100. The results obtained using Partial Least Squares Discriminant Analysis (PLS -DA) and Random Forest (RF) shows that fluorescence spectroscopy can be applied for the classification of Diesel S10 and Diesel S100 test samples. Models calibrated with both methodologies achieved 100% correct classification. The RF implementation was better in the selection of few specific excitation/emission pairs (four) that could be used in the development of a customized sensor for diesel fuel classification.

1. Introduction In the last decades, the concerns about human and environment health have had a significant increase. All big cities are affected by serious problems caused by soil, water, and air pollution. The introduction of contaminants into the natural environment is highly correlated with the use of fossil fuels. This fact creates a necessity to produce fuels with less concentration of certain pollutant ion molecules as, for instance, sulfur1, 2. Several countries have developed legislations to promote the Ultra-Low Sulfur Diesel (ULSD) production, as an attempt to improve the quality of fuels and minimize air pollution, caused by combustion.The ULSD is defined by the U.S. Environmental Protection Agency as diesel fuel with a sulfur content lower than 15 ppm3. In Brazil, since December 2013, the resolution number 50 from the National Agency of Oil, Gas, and Biofuels (ANP) regulates the quality of diesel fuel commercialized on national territory. This law establishes that all metropolitan diesel must contain less than 10 ppm of sulfur

ACS Paragon Plus Environment

2

Page 3 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

(also known as Diesel S10), framing it as a ULSD. This regulation forces refineries to treat their products to meet specific requirements for sulfur concentration before commercialization4. Conventional hydrodesulfurization (HDS) is a catalytic chemical process used in refineries for efficient elimination of sulfur compounds, particularly for light compounds, such as thiophenes and benzothiophenes5. Evolving, side by side, with the technology for desulfurization there is the capability to measure sulfur content in process streams. Today, sulfur monitoring is carried according to standard methods published by American Society for Testing and Materials (ASTM), more precisely, ASTM D2622, ASTM D5453, and ASTM D7039. Those standards require sample preparation by specialized people and high-cost instrumentation. The measurements are not frequent and are associated with considerable time delays. Taking as reference a standard diesel stream in a refinery, it may take as long as 24 hours to obtain total sulfur concentration, limiting the use of on-line control and optimization6. The development of an on-line sensor for sulfur content in diesel streams is crucial from the economic and technological point of view, since it can considerably improve the plant performance. Additionally, these sensors are also attractive for environment regulation agencies and consumers in general7. Due to its characteristics as high robustness, non-invasiveness, fast time response, and high resolution8, optical sensors based on spectroscopy principles can be considered a suitable and promising option for developing a sulfur content analyzer. The Near Infrared Spectroscopy (NIR) is one of the most studied methods for determination of diesel characteristics, including total sulfur. As showed by Breitkreitz et al.9, this technique can satisfactorily quantify total sulfur concentration in a range between 0.07 and 0.33% (w/w), with results as good as standard procedures. Rocha et al.10 used MIR and NIR associated with PLS

ACS Paragon Plus Environment

3

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 29

with variable selection to determine sulfur content in petroleum fractions with limits of detection and quantification from 0.0234 and 0.0781 (weight percent), respectively. Ferrão et al11 was also able to satisfactorily predict (R² = 0.9995) sulfur content in biodiesel/diesel blends using HATRFTIR spectra and PLS/iPLS/siPLS for samples with sulfur concentration between 312 – 1351 ppm. For sulfur content lower than 15 ppm (case of ULSD), this methods did not yield good results, making it not a useful principle for S10 Diesel characterization12. Among the spectroscopy techniques for analytic determination of small concentrations of sulfur, the use of inductively coupled plasma optical emission spectrometry has been reported as a precise and powerful methodology. The samples are excited by inductively coupled plasma and the excited atoms emit radiation at wavelengths characteristic of a particular element (in this case, sulfur). Corazza et al13 and Mello et al14 achieved good results for the quantification of samples with less than 20 ppm of total sulfur. The technique, although precise, is very time consuming, destructive and require expensive equipment. As another spectroscopy technique, fluorescence spectroscopy appears as a promising alternative for sulfur quantification in diesel streams. This method shows excellent sensitivity and better limits of detection, up to three orders higher than the ones presented by absorption spectroscopies. As many sulfur compounds on diesel emit fluorescence, it seems natural to apply the fluorescence principle to develop an optical sensor5. Despite the apparent viability of fluorescence spectroscopy application for sulfur quantification in diesel, many technical issues must be solved before the construction of sensors using this technology, as the development of precise and robust models, the selection of the specific Ex/Em

ACS Paragon Plus Environment

4

Page 5 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

pairs related to sulphurous compounds and the development of the methodology for online sampling directly in the refinery streams. The primary objective of this paper is to answer if fluorescence spectrum data from process streams can be directly used for qualitative characterization. In related researches for sulfur quantification in diesel15, our research group developed two specific models to predict total sulfur using chemometric modeling and fluorescence data: (i) one for samples with less than 10 ppm and (ii) one for samples with about 100 ppm. We developed this work for the need of a classification methodology capable of correctly classifying diesel samples in the subgroups Diesel S10 and Diesel S100 using only fluorescence spectrum data. Aburto et al.5 have proposed first a pretreatment of the samples with an enzymatic oxidation before measuring the fluorescence spectrum, resulting in an excellent method for quantification of total sulfur content. Another possibility is to apply chemometric modeling techniques directly to the 2D fluorescence spectrum without any sample pretreatment. This paper explores the direct use of fluorescence spectrum to classify diesel fuel considering its sulfur content. The works by Hua et al.16 and Wang et al.17 support the application of fluorescence spectroscopy to qualify sulfur content in diesel fuel. After hydrodesulfurization, almost all nonaromatic compounds containing sulfur are removed, leaving only sulfurous polycyclic aromatic compounds that emit fluorescence. Initially, an Unsupervised Learning Technique (ULT) was used to detect groups in the measured data set. ULTs do not require dependent variables for modeling and search for patterns in independent variables. Groups of samples can be separated according to the structures of independent variables. In this work, a Principal Component Analysis (PCA) was applied to

ACS Paragon Plus Environment

5

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 29

identify a possible qualitative segregation between Diesel S100 and S10 based only on spectral measurements18. Additionally, two Supervised Pattern Recognition techniques called Partial Least Squares Discriminant Analysis (PLS-DA) and Random Forest (RF) were also applied. PLS-DA is one of the most commonly used classification method, particularly suited to deal with ill-conditioned data matrices (such as fluorescence spectra)19. The Random Forest method (RF) is an ensemble of multiple decision trees that creates nonparametric predictive models based on decisions by a collection of classification or regression trees20.

2. Materials and Methods Samples of Diesel S10 (ULSD) and Diesel S100 (previously characterized according to its sulfur content) were analyzed using 2D fluorescence spectroscopy measurements. All samples were provided and certified by a Brazilian Petroleum Refinery. Due to availability, the amount of Diesel S100 samples (51) was almost five times higher than the Diesel S10 samples (12).

2.1 Diesel S10 samples - ULSD Samples were certified according to the ASTM D-7039 test, using a Sindie® 7039 bench analyzer by XOS®. Twelve S10 samples were used in this study. Eleven of then had their sulfur content in the range between 5.1 ppm and 6.4 ppm, with an average of 5.8 ppm, and one sample had a sulfur content below 1.0 ppm, which is the threshold for the ASTM test D-5453 used to measure this specific sample (the ASTM D-7039 has a threshold of 3 ppm). The limits of detection and quantification for the ASTM tests D-5453 and D-7039 were 1 – 3 and 8000 ppm, respectively.

ACS Paragon Plus Environment

6

Page 7 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

2.2 Diesel S100 samples The total amount of samples from mild HDS treatment was 51. Each sample was characterized according to its total sulfur content, according to the ASTM D-4294 test, using a LABX-3000® produced by Oxford®. The frequency of sampling was of one sample each three days, totalizing a period of five months. This time between sampling enhances the results obtained in this paper, because the type and origin of the petroleum that originated some samples is different, and therefore this methodology was not conditioned to one particular type of oil (which can be very influential in the final product characteristics). The S100 sample group presented sulfur range between 73.7 ppm and 138.6 ppm, with an average of 100.3 ppm. Since the average concentration of sulfur in this group was close to 100 ppm, this group of samples was nominated as Diesel S100, even though this group has some samples with more than 100 ppm of total sulfur.

2.4 2D Fluorescence Spectroscopy The Fluorescence Spectra were measured by the equipment HORIBA Fluoromax-4®, equipped with a xenon lamp of 150W. The measurements were done in the range of excitation wavelengths between 260 nm and 600 nm, and emission wavelengths between 290 nm and 850 nm. The geometry of measurements was 90 degrees. Both excitation and emission wavelengths used an increment of 10 nm. With these arrangements, each fluorescence spectra was obtained as a 57x35 matrix, containing the fluorescence intensity of 1995 pairs of excitation/emission. Each of the fluorescence spectrums (referent to each sample) was later unfolded into a vector of

ACS Paragon Plus Environment

7

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 29

dimension 1x1995, the row being the sample and the columns containing the information referent to the fluorescence intensity of an excitation/emission pair. The fluorescence measurements were made using remote optical fiber accessory and glass vials, in a specially designed dark room support21. Before spectroscopic measures, samples were stabilized at 25°C using a thermostatic bath. All measurements were made in triplicate and then transformed in one final spectrum by the arithmetic mean of the three measurements.

2.5 Chemometric Analysis Attribute selection is an important topic in the area of sensor evaluation, aiming to (i) identify subsets of non-redundant attributes that describes products, and (ii) find attributes that lead to a good discrimination between products22. Principal Component Analysis (PCA) was applied as a tool to perform a qualitative analysis of the fluorescence data collected from diesel samples. PCA is classified as an unsupervised learning technique and only requires the coordinates of the data points (independent variables) for analysis. It seeks to map or embed data points from a high dimensional space to a low dimensional space, promoting the linear dimensionality reduction and keeping the most significant linear structure intact23. PCA is an appropriate tool for reducing the original dimension space to a low dimension subspace, which can capture most of the data information24. For the PCA analyses, three different datasets were created. To assemble the two first datasets, each containing the information about one sample group, matrixes were created with (i) Diesel S10: twelve rows, each of them related to one sample, and 1995 columns, each corresponding to

ACS Paragon Plus Environment

8

Page 9 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

the fluorescence intensity of one excitation/emission pair. (ii) Diesel S100: fifty-one rows organized in the same way as made for the Diesel S10. To assemble the dataset for the analyses of the two sample groups together, a matrix was created with sixty-three rows, the first twelve being the Diesel S10 samples and the other fiftyone being the Diesel S100 samples, and the 1995 columns, each containing the fluorescence intensity related to one excitation/emission pair. Visualization provided by PCA is useful for qualitative analysis or grouping of data samples. However, it does not define clusters in an explicit form18. The Classification Toolbox for MATLAB version 3.6 released by Milano Chemometrics and QSAR Research group19 was used to perform the PLS-DA studies presented in this paper. As the name suggests, the algorithm is based on a PLS regression, which searches for latent variables with a maximum covariance with Y-variables18. The PLS-DA returns estimated values  (y ) for each i-th sample and each g-th class. The estimated values will not have a perfect

binary value. To make a class assignment, the probability that a sample belongs to a specific class is  calculated based on the estimated class values. A threshold is defined for each class, if y is

greater than the threshold defined for the g-th class, then the i-th sample is assigned to the g-th class, otherwise not. It can happen that one sample has estimated values lower or higher than all the defined classes’ thresholds. Therefore, the sample can be assigned to more than one class, or to no class, becoming a confusion sample, and thus labeled as “not assigned”. With the calibrated PLS-DA model, new test samples (not used for calibration) can be evaluated to predict the real classification capability of the model.

ACS Paragon Plus Environment

9

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 29

Initially, the samples groups were divided into training and test subsets. The division was carried so the training set would retain 75% of the total samples, leaving 25% for the prediction test routine, as follows: (i) Training set: [S100 samples – 38; S10 samples – 9] and (ii) Test set: [S100 samples – 13; S10 samples – 3] As a preliminary procedure, we tried to reduce the number of variables in the matrices. The use of all excitation/emissions pairs (although feasible) limits the application of the methodology. Additionally, the less the number of variables needed to separate the classes, the easier it would be to project a customized sensor for the proposed methodology. The pair’s selection was based on a screening of the discrimination power of variables performed using Wilks’ Lambda25. The Wilks’ Lambda (WL) is related to the likelihood criterion and ranges between 0 and 1. Values close to 0 indicate that the classes’ means are different. Consequently, variables with low Wilks’ Lambda can be considered good for separating classes. The WL of the full matrix of fluorescence data of the training set was calculated, as shown in Figure 1(left). Based on the WL values for each variable (and a combination of various tests), 27 pairs of fluorescence with low WL values were selected (Figure 1 (right)), and the final training and test matrices were 47x27 and 16x27, respectively. The number of pairs was based on trying and error, once the final selection was a combination of dozens of tests between pairs with low WL in the test and calibration groups individually and the whole set as one. The 27 selected Ex/Em pairs were: Ex270/Em420, Ex280/Em440, Ex280/Em450, Ex280/Em460, Ex280/Em480, Ex290/Em840, Ex410/Em500, Ex410/Em520, Ex410/Em530, Ex410/Em540, Ex410/Em550, Ex420/Em520, Ex430/Em520, Ex430/Em540, Ex480/Em490,

ACS Paragon Plus Environment

10

Page 11 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

Ex480/Em500, Ex480/Em530, Ex480/Em540, Ex490/Em510, Ex490/Em520, Ex490/Em550, Ex500/Em530, Ex500/Em540, Ex500/Em550, Ex510/Em520, Ex520/Em530, Ex530/Em540. Using only the training set, the PLS-DA model is calibrated. First, a selection of latent variables (LV) was made to evaluate the optimal number of LVs that leads to the minimization of the error rate (miss classification of a sample), by cross-validation procedures. The training set was divided into five cross-validation groups (according to the venetian blinds approach19), and then the error rate in cross-validation and the percentage of not assigned samples is calculated as a function of the increasing number of latent variables. After selecting the optimal number of LVs, the PLS-DA model is calibrated. For the analysis of the classification performances, several features were evaluated. First, the confusion matrices obtained in fitting and cross-validation are observed. The confusion matrices show the number of true and false negatives, true and false positives and the number of not assigned samples. Besides the numerical parameters, ROC (Receiver Operating Characteristics) curves are adopted to evaluate the classification capability of a model. They are a graphical representation of the relationship between false positives and true positives rates. ROC curves are calculated separately for each class, plotting the sensitivity versus (1-specificity) for a binary classification system, as the discrimination threshold changes. Sensitivity ranges from 0 to 1 and describes the model ability to recognize samples belonging to one class correctly. Specificity also ranges from 0 to 1 and describes the ability of the model to reject samples from all other classes. For an evaluation of the true predictive performance of the classification model, the classification parameters of a test set (with new samples not used in the calibration step) were

ACS Paragon Plus Environment

11

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 29

analyzed. If the test set yields similar results as the training set, the model can be considered to be reliable and stable. In this research, we studied the application of the called unfold PCA and PLS-DA techniques (first order), once we took our second order data and unfolded it to first order (the transformation of the EX-EM matrix in a vector referring to the Ex/Em pairs). Although many authors recommend the use of second order chemometric methods to deal with second order data (as PARAFAC and N-PLS-DA26), the use of unfold techniques is still the preferred method in a variety of works dealing with spectral data, achieving similar and satisfactory results 27-29. All calculations for PCA and PLS-DA were performed using the software MATLAB® version 8.0.0.783 (R2012b). Random Forest is a classifier consisting of a collection of tree-structured classifiers. Each tree relies on the CART procedure (Classification and Regression Trees), where the feature space is disjoint into separate nodes, and then by a simple model the output is estimated in each region. The splitting algorithm is hierarchical and designed in a binary fashion20. Basically, in each node, an optimal splitting variable ‘s’ is selected among all input variables and the best splitpoint is calculated to minimize the residual sum of squares (RSS - Equation 1)9:  = ∑ ∈(  − 1 )² + ∑ ∈(  − 2 )²

(Eq. 1)

!"#$%∈&,,…,)*, () Where 1 = &+|+ < .* and 2 = &+|+ > .* are the two regions resulting by splitting the current space from point ‘s’ in the axis + , and 1 and 2 are the corresponding mean value of response (y) in these regions. This procedure is recursively applied to the created sub-regions until a full tree is created, yielding the best separation possible for the input data30. For the Random Forest implementation, the SciKit-Learn module for Python was used31.

ACS Paragon Plus Environment

12

Page 13 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

In the Random Forest, each tree in the ensemble is built from a sample drawn with replacement (called bootstrap sample) from the training set. This procedure is called bagging, where, given a standard training set D of size n, m new training sets of size n’ are generated by sampling from D uniformly and with replacement32. Also, when splitting a node during the construction of the tree, the split is not chosen among all features. Instead, the split that is picked is the best split among a random subset of the features of size z, selected by the user when defining the ‘max_features’ parameter in the algorithm. As a result of this randomness, the bias of the forest usually slightly increases (with respect to the bias of a single non-random tree) but, due to averaging, its variance also decreases, usually more than compensating for the increase in bias, hence yielding an overall better model. In the end, all classifiers are combined by averaging their probabilistic prediction, instead of letting each classifier vote for a single class31. The first step to build a Random Forest classifier is the selection of parameters that will yield the best classification results. The selected parameters were the use of 2000 estimators and the number of ‘max_features’ (the number of randomly selected features considerate at each split) fixed as the heuristic suggested by Breiman32 for classification: the square root of the number of total input features - 0$%1234567895 . As a further randomization procedure, before fitting any RF, the samples were randomly divided into training and test groups, leaving 30% of the total samples as test samples, to evaluate the true predictive power of the proposed methodology. The samples were divided using the Stratified Shuffle Split implementation for SciKit-Learn, where the proportion of S10 and S100 samples is kept in both subsets. Twenty training/test subsets were created.

3. Results and Discussions

ACS Paragon Plus Environment

13

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 29

3.1 2D Fluorescence Spectroscopy measurements Diesel S10 and S100 show fluorescence signal without any pre-treatment of samples, mostly due to its aromatic molecules. These measurements have been explored with analytical purposes for many researchers5, 9, 24, 33, however, its selectivity to interest anilities is often reduced, mainly because of extensive spectral overlap or the presence of quenching interferences5. Figure 2 shows a typical 2D fluorescence spectrum from Diesel S10 and S100 samples (represented by (a) and (b) respectively). The data was normalized using Standard Normal Variate (SNV), where each spectra, after scaling, had mean equal to zero and standard deviation equal to 133. SNV is the commonly applied scaling method for optical data, especially for NIR spectrum9 and fluorescent spectroscopy measurements15. The main difference between both groups is related to the relative intensity of fluorescence peaks and their location. Diesel S10 samples have shown smaller fluorescence intensity than Diesel S100 samples, indicating that the more aggressive process of hydrodesulfurization has withdrawn elements/molecules that emit fluorescence5. Previous analyses showed that despite this quantitative difference between fluorescence data of diesel, this information (fluorescence peak intensity and location) should be complemented for developing a sensor for diesel streams with real applicability.

3.2 Chemometric Analysis Chemometric analysis has two main purposes: (i) provide useful information about qualitative data and (ii) proposal of chemometric models for classification of diesel samples that undergo different processes of hydrodesulfurization.

ACS Paragon Plus Environment

14

Page 15 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

The first objective, i.e., the exploratory analysis, is carried by PCA. This technique is initially applied to each sample group individually and then it is applied to both groups as if they were a single dataset. The quality of condensed information provided by fluorescence data can be better evaluated for each diesel class using a Score Plot, where each diesel sample is localized in a Cartesian plan with the axis corresponding to the first and second PCs. This plot allows the visualization of samples’ dispersion in function of corresponding PCs. Figure 3 shows the Scores Plot for diesel S10 and S100, evaluated individually and combined. Diesel S100 has higher dispersion than Diesel S10, which concentrates along scores axes (in this graphics all triplicate points are shown). As can be seen by the individual Score Plot for Diesel S10 and Diesel S100, there are intensity differences between the samples, represented by the distances of the samples in the plot. The big differences between S100 samples are probably due to variations in samples composition caused by a greater concentration range of polyromantic molecules. On the other hand, Diesel S10 samples showed little dispersion, confirming that after thorough desulfurization, diesel streams present a more uniform composition. The graph confirms the fact that the information contained in fluorescence data is different between the two classes of diesel as two regions can be clearly identified. Once the qualitative information contained in each sample group is significantly different, the viability of classification between Diesel S10 and Diesel S100 samples using 2D fluorescence spectroscopy measurements is promising.

ACS Paragon Plus Environment

15

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 29

The first step for the calibration of the PLS-DA model is the selection of the optimal number of Latent Variables (LV) that minimizes the error rate, which is carried by cross-validation, dividing the training set in 5 cross validation groups. Figure 4 shows the error rate in cross-validation as a function of the number of LV. As can be seen in Figure 4, using only 2 LVs it is possible to fit models that have 0% error rate and zero not assigned samples. The use of more information (LVs) did not improve the classification power of the methodology and actually made the predictions worst, indicating that any number of LVs greater than two was introducing noise to the system. Cross-validated and fitting results are similar and comparable, indicating that the PLS-DA classification model could be reliable and stable, once the classification performance is not influenced by samples been taken out from the calibration set during the cross-validation procedure. The two selected latent variables explain almost 99% of the data variance (LV1 – 90.72% and LV2 –7.86%) of the twenty-seven independent variables. The confusion matrix obtained in cross-validation can be seen in Table 1. As the confusion matrix confirms, there were no ‘not assigned’ samples in the cross-validation routine, and all samples were correct classified. For a graphical evaluation of the classification model parameters, the ROC curves and the sensitivity versus specificity plots (as a function of the classes’ thresholds) are analyzed. Figure 5 shows the parameters for the fitted model. ROC curves are the graphical representation between the true/false positive ratios for a classifier as a classification parameter changes (in this case, the model threshold). The space is separated in two major regions: the area above the diagonal and the area below the diagonal. Any point in the diagonal represents a classifier with the same accuracy as a random guess for the

ACS Paragon Plus Environment

16

Page 17 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

binary classification problem. A point below the diagonal represents a classifier with performance worse than a random guess, therefore a classifier with those parameters would be useless. A point above the diagonal represents a classifier whose accuracy is better than a random guess. The ideal classifier is the one that could achieve 100% true positives with zero false positives, correctly classifying all the data. This ideal classifier would be represented in the ROC curve as a point in the upper left corner, representing a True Positive Rate of 1 and a False Positive Rate of 0. As can be seen in the ROC curves (Figure 5), the classification method was very successful, yielding a point in the upper left corner, representing maximum sensitivity and specificity. The PLS-DA model was able to solve the binary classification problem with perfect accuracy. The selected discriminating thresholds for the Diesel S100 and Diesel S10 classes were  for each sample and each class, we can see 0.1982 and +0.1982, respectively. Plotting the y

how the samples were distributed between the thresholds (Figure 6). As a final validation step of the fitted PLS-DA model, a test set of samples not used during calibration was used to evaluate the true predictive capability of the model. Sixteen samples (13  calculated for each of the classes and for diesel S100 and 3 for Diesel S10) had their y

assigned to a class based on the previously calculated thresholds. Figure 7 shows both the test and the training sets calculated responses, along with classes’ thresholds. As before, in the test set there were no miss-classification and both sensitivity and specificity parameters were maximized. Classification parameters derived from the test set confirm the same classification performance previously achieved by internal validation on the training samples. Therefore, the calibrated PLS-DA model can be considered to be reliable and stable,

ACS Paragon Plus Environment

17

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 29

since its performance on future samples is expected to be comparable to those achieved on the training samples. Once the PLS-DA methodology was evaluated, the non-parametric RF classification models were tested. The RF classifiers were calibrated to all 20 training subsets, and tested in the 20 random testing subsets, using 2000 estimators. Initially, all available fluorescence pairs were used as input variables for the classification algorithm, and their predictive performance was evaluated by the number of wrongly assigned test samples. In this first try, all test samples from all testing subsets were correctly classified, as can be seen in the confusion matrix (Table 2). This motivated a further study to evaluate the possibility of reducing the number of input variables and maintain the good results. This implementation of RF for Python has a feature that measures the importance of input variables, based on the variables selected as best splitting variables by all estimators in the forest, and the final score of the calibrated trees using those variables. The more a variable is selected as best splitting variable among tress, and the better the final score of the tree using this variable, greater is the relative importance of that variable. This ‘relative importance’ is normalized to a vector of total length one, and each variable receive a value that represents its importance amount all others. The subsequent experiments used decreasing amounts of input variables, at each try selecting around 90% of the top previous inputs, ranked by their importance. After some iteration, a group of inputs with almost the same relative importance was revealed, still yielding perfect classification results. Without a prominent variable high in importance, it is safe to assume that, in this particular case, the classification of diesel samples from diferrent

ACS Paragon Plus Environment

18

Page 19 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

streams may rely not specifically in one or two excitation/emission pair, but that there are throughout the whole spectrum many pairs capable of correctly classify all calibration and test samples. Figure 8 present twelve Ex/Em pairs selected by the algorithm iteration where the relative importance is almost the same for all variables, meaning they were equally highly selected as best splitting variables among trees, and the trees calibrated with these variables achieved similar high scores. Finally, we selected among the twelve pairs presented before four pairs to represent visually. Figure 9 illustrate the intensity of fluorescence for the different Ex/Em pairs, relative to the sulfur content of the samples in ppm. As can be seen, any of those pairs could be used as an input variable for the classification of samples with perfect results. The selection of specific fluorescence’s excitation/emission pairs for diesel sample classification is one of the most significant findings in this study. Although we proved that is possible to work with the full fluorescence spectrum for the creation of an online sensor for diesel classification, the use of more than 1500 variables, collected by expensive and fragile equipment that require specialized care, do not represent the optimum case. The reduction of input variables in the models also reduce the time required for data acquaintance, the complexity of equipment, the data noise and the computational time. The selection of variables is also a major step in the development of a customized sensor that we could design specifically for this classification problem. With the four variables selected by the RF methodology, we can now advance in the construction of a customized sensor based only in this four excitation/emission pairs for the classification of diesel samples. Comparing the variable selection power of both classification methodologies we applied, the RF implementation was superior and easier to use than the PLS-DA.

ACS Paragon Plus Environment

19

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 29

4. Conclusion The sulfur content determination in diesel samples is a problem to process optimization, once these measurements are usually time-consuming and require specialized people and equipment. Therefore, the development of a new sensor and/or application of a more efficient technique, able to satisfactory qualify diesel streams, in an on-line way, with low costs and easy operation, is a significant advance for the refinery industry. The 2D fluorescence spectroscopy measurements from diesel S10 and S100 samples presented good results as a source of information to classify two streams of diesel fuel with different sulfur content. Results given by PCA showed that fluorescence measurements from Diesel S10 samples exhibit significant qualitative difference from Diesel S100. Qualitative differences shown by S10 and S100 fluorescence measurements allow the proposal of sensors for diesel samples classification. This conclusion resulted from the calibration of PLSDA and RF models with 100% correct classification of Diesel S100 and S10 samples, confirmed by a test set of samples not used for calibration, using only twenty seven pairs of Excitation/Emission (PLS-DA) and less than four for RF. As the test set classification parameters achieved similar classification performance as the internal validation of the training samples, the calibrated models can be considered to be reliable and stable, once the expected performance on future samples can be comparable to those achieved on the training samples. This result shows a significant evolution in diesel characterization, especially for control purposes. Despite the fact that the methodology does not allow the determination of which

ACS Paragon Plus Environment

20

Page 21 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

fluorescence pairs (Excitation/Emission) are most correlated to the sulfur content in diesel samples, it is possible to confirm that fluorescence data can be applied to develop an on-line sensor able to correctly classify diesel streams in subgroups: one that is within specification regarding sulfur concentration (as diesel S10) and another that is not (S100). Although sulfur concentration is our main concern and was the only analytical measurement we had, this classification is based on all the changes that occur during hydrodesulphurization (that remove fluorescence emitting molecules present, many of which are not sulfurous) and not only in the concentration changes of sulfurous compounds. So, we cannot confirm that the selected Ex/Em pairs present direct correlation to sulfurous compounds and not to any other fluorophore removed during HDS. Finally, both classification methodologies applied were able to drastically reduce the number of fluorescence pairs needed for sample classification. The Random Forest implementation presented better results for variable selection, easily selecting less than 12 Excitation/Emission pairs that could individually classify samples based on different intensity of fluorescence between groups. The selection of roughly 0.5% of the original variable dataset is an important step in the construction of a customized sensor for online diesel stream monitoring.

AUTHOR INFORMATION Corresponding Author *Email address: [email protected] (L. Ranzan) Author Contributions The manuscript was written through contributions of all authors. All authors have given approval to the final version of the manuscript.

ACS Paragon Plus Environment

21

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 29

ACKNOWLEDGMENT Authors are grateful to Refinery Alberto Pasqualini and Petrobras for supplying the samples and to their employees for specialized support. ABBREVIATIONS LV, latent variable; PCA, Principal Component Analysis; PC, principal component; PLS-DA, Partial Least Squares Discriminant Analysis; ROC, Receiver Operating Characteristics; RF, Random Forest; ULSD, ultra-low sulfur diesel; S10, sample group with less than 10 ppm of sulfur content; S100, sample group with average 100 ppm of sulfur content; WL, Wilks’ Lambda. REFERENCES 1. Betha, R.; Balasubramanian, R., Emissions of particulate-bound elements from biodiesel and ultra low sulfur diesel: Size distribution and risk assessment. Chemosphere 2013, 90 (3), 1005-1015. 2. Government, A., Education for a Susteinable Future - A National Environmental Education Statement for Australian Schools. Heritage, D. o. t. E. a., Ed. 2005. 3. De la Paz-Zavala, C.; Burgos-Vázquez, E.; Rodríguez-Rodríguez, J. E.; RamírezVerduzco, L. F., Ultra low sulfur diesel simulation. Application to commercial units. Fuel (0). 4. Moreira, J. R.; Pacca, S. A.; Parente, V., The future of oil and bioethanol in Brazil. Energy Policy 2014, 65, 7-15. 5. Aburto, P.; Zuñiga, K.; Campos-Terán, J.; Aburto, J.; Torres, E., Quantitative analysis of sulfur in diesel by enzymatic oxidation, steady-state fluorescence, and linear regression analysis. Energy and Fuels 2014, 28 (1), 403-408. 6. Sajjad, H.; Masjuki, H. H.; Varman, M.; Kalam, M. A.; Arbab, M. I.; Imtenan, S.; Rahman, S. M. A., Engine combustion, performance and emission characteristics of gas to liquid (GTL) fuels and its blends with diesel and bio-diesel. Renewable and Sustainable Energy Reviews 2014, 30, 961-986. 7. de las Obras-Loscertales, M.; de Diego, L. F.; García-Labiano, F.; Rufas, A.; Abad, A.; Gayán, P.; Adánez, J., Sulfur retention in an oxy-fuel bubbling fluidized bed combustor: Effect of coal rank, type of sorbent and O2/CO2 ratio. Fuel 2014, 137, 384-392. 8. Whitford, W.; Julien, C., Analitical Technology and PAT. BioProcess International 2007, 32-41. 9. Breitkreitz, M. C.; Raimundo, J. I. M.; Rohwedder, J. J. R.; Pasquini, C.; Dantas Filho, H. A.; Jose, G. E.; Araujo, M. C. U., Determination of total sulfur in diesel fuel employing NIR spectroscopy and multivariate calibration. Analyst 2003, 128 (9), 1204-1207.

ACS Paragon Plus Environment

22

Page 23 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

10. Rocha, J. T. C.; Oliveira, L. M. S. L.; Dias, J. C. M.; Pinto, U. B.; Marques, M. d. L. S. P.; Oliveira, B. P.; Filgueiras, P. R.; Castro, E. V. R.; de Oliveira, M. A. L., Sulfur Determination in Brazilian Petroleum Fractions by Mid-infrared and Near-infrared Spectroscopy and Partial Least Squares Associated with Variable Selection Methods. Energy & Fuels 2016, 30 (1), 698705. 11. Ferrão, M. F.; Viera, M. d. S.; Pazos, R. E. P.; Fachini, D.; Gerbase, A. E.; Marder, L., Simultaneous determination of quality parameters of biodiesel/diesel blends using HATR-FTIR spectra and PLS, iPLS or siPLS regressions. Fuel 2011, 90 (2), 701-706. 12. Bueno, A. F., Desenvolvimento de um analisador de processo por espectroscopia no infravermelho próximo (NIR) para previsão de propriedades de derivados de petróleo. UNICAMP, 2011; Vol. Tese de Doutorado. 13. Corazza, G.; Henn, A. S.; Mesko, M. F.; Duarte, F. A.; Flores, E. M. M.; Mello, P. A., Microwave-Induced Combustion of Coal for Further Sulfur Determination by Inductively Coupled Plasma Optical Emission Spectrometry or Ion Chromatography. Journal of the Brazilian Chemical Society 2016, 27, 1569-1576. 14. de Azevedo Mello, P.; Fagundes Pereira, J. S.; de Moraes, D. P.; Dressler, V. L.; de Moraes Flores, E. M.; Knapp, G., Nickel, vanadium and sulfur determination by inductively coupled plasma optical emission spectrometry in crude oil distillation residues after microwaveinduced combustion. Journal of Analytical Atomic Spectrometry 2009, 24 (7), 911-916. 15. Ranzan, C.; Ranzan, L.; Trierweiler, L. F.; Trierweiler, J. O., Sulfur Determination in Diesel using 2D Fluorescence Spectroscopy and Linear Models. IFAC-PapersOnLine 2015, 48 (8), 415-420. 16. Hua, R.; Li, Y.; Liu, W.; Zheng, J.; Wei, H.; Wang, J.; Lu, X.; Kong, H.; Xu, G., Determination of sulfur-containing compounds in diesel oils by comprehensive two-dimensional gas chromatography with a sulfur chemiluminescence detector. Journal of Chromatography A 2003, 1019 (1-2), 101-109. 17. Wang, F. C. Y.; Robbins, W. K.; Di Sanzo, F. P.; McElroy, F. C., Speciation of SulfurContaining Compounds in Diesel by Comprehensive Two-Dimensional Gas Chromatography. Journal of Chromatographic Science 2003, 41 (10), 519-523. 18. Cramer, J. A.; Morris, R. E.; Hammond, M. H.; Rose-Pehrsson, S. L., Ultra-low sulfur diesel classification with near-infrared spectroscopy and partial least squares. Energy and Fuels 2009, 23 (2), 1132-1133. 19. Ballabio, D.; Consonni, V., Classification tools in chemistry. Part 1: linear models. PLSDA. Analytical Methods 2013, 5 (16), 3790-3798. 20. Chehreh Chelgani, S.; Matin, S. S.; Hower, J. C., Explaining relationships between coke quality index and coal properties by Random Forest method. Fuel 2016, 182, 754-760. 21. Alves, C. d. V., Uma nova sistemática para análise de enxofre em diesel baseada em fluorescência. In trabalho de diplomação em engenharia química, Universidade Federal do Rio Grande do Sul: Porto Alegre, 2012. 22. Barker, M.; Rayens, W., Partial least squares for discrimination. Journal of Chemometrics 2003, 17 (3), 166-173. 23. Cadima, J.; Cerdeira, J. O.; Minhoto, M., Computational aspects of algorithms for variable selection in the context of principal components. Computational Statistics & Data Analysis 2004, 47 (2), 225-236.

ACS Paragon Plus Environment

23

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 29

24. Wehrens, R., Chemometrics with R Multivariate Data Analysis in the Natural Sciences and Life Sciences. In Use R [Online] Springer-Verlag Berlin Heidelberg: Berlin, Heidelberg, 2011. http://dx.doi.org/10.1007/978-3-642-17841-2. 25. Mardia, K.; Kent, J.; Bibby, J., Multivariate Analysis. Academic Press: 1979. 26. Ríos-Reina, R.; Elcoroaristizabal, S.; Ocaña-González, J. A.; García-González, D. L.; Amigo, J. M.; Callejón, R. M., Characterization and authentication of Spanish PDO wine vinegars using multidimensional fluorescence and chemometrics. Food Chemistry 2017, 230, 108-116. 27. Assawajaruwan, S.; Reinalter, J.; Hitzmann, B., Comparison of methods for wavelength combination selection from multi-wavelength fluorescence spectra for on-line monitoring of yeast cultivations. Analytical and Bioanalytical Chemistry 2017, 409 (3), 707-717. 28. Sá, M.; Monte, J.; Brazinha, C.; Galinha, C. F.; Crespo, J. G., 2D Fluorescence spectroscopy for monitoring Dunaliella salina concentration and integrity during membrane harvesting. Algal Research 2017, 24, Part A, 325-332. 29. ElMasry, G.; Nakazawa, N.; Okazaki, E.; Nakauchi, S., Non-invasive sensing of freshness indices of frozen fish and fillets using pretreated excitation–emission matrices. Sensors and Actuators B: Chemical 2016, 228, 237-250. 30. Goudarzi, N.; Shahsavani, D.; Emadi-Gandaghi, F.; Chamjangali, M. A., Application of random forests method to predict the retention indices of some polycyclic aromatic hydrocarbons. Journal of Chromatography A 2014, 1333, 25-31. 31. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V., Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 2011, 12 (Oct), 2825-2830. 32. Breiman, L., Random Forests. Machine Learning 2001, 45 (1), 5-32. 33. Ranzan, C.; Strohm, A.; Ranzan, L.; Trierweiler, L. F.; Hitzmann, B.; Trierweiler, J. O., Wheat flour characterization using NIR and spectral filter based on Ant Colony Optimization. Chemometrics and Intelligent Laboratory Systems 2014, 132 (0), 133-140.

Table 1. PLS-DA confusion matrix obtained in cross validation with 5 groups split using the venetian blinds procedure.

Predicted Class Experimental Class Diesel S100 Diesel S10

Diesel S100 38 0

Diesel S10 0 9

‘not assigned’ 0 0

ACS Paragon Plus Environment

24

Page 25 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

Table 2. Confusion matrix obtained for all testing subsets for the Random Forest classifier.

Predicted Class Experimental Class Diesel S100 Diesel S10

Diesel S100 15 0

Diesel S10 0 4

‘not assigned’ 0 0

Figure 1. Wilks’ Lambda values for each of the 1995 variables of the original training set of fluorescence spectral data. As the number of variables is too large to display below the graphic, this information was suppressed (left). Wilks’ Lambda values of the 27 selected variables (right).

Figure 2. Average 2D fluorescence spectra for Diesel S10 (a) and Diesel S100 (b), normalized.

ACS Paragon Plus Environment

25

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 29

Figure 3. Scores Plot of Diesel S10 and Diesel S100 evaluated individually (left) and combined (right).

Figure 4. Error rate as a function of the number of LVs in cross-validation with 5 crossvalidation groups.

ACS Paragon Plus Environment

26

Page 27 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

Figure 5. ROC curve (left) and plots of sensitivity (blue) and specificity (red) values as the class threshold is changed (right), for Diesel S100 (upper) and Diesel S10 (lower) classes.

Figure 6. Calculated response for class 2 (Diesel S10) versus calculated response for class 1 (Diesel S100), with each class threshold discriminated.

ACS Paragon Plus Environment

27

Energy & Fuels

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 28 of 29

Figure 7. Calculated response for class 2 (Diesel S10) versus calculated response for class 1 (Diesel S100) for both training and test sets, with each class threshold discriminated.

Figure 8. Relative importance for the twelve Excitation/Emission pairs selected as input variables for the RF classification procedure.

ACS Paragon Plus Environment

28

Page 29 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Energy & Fuels

Figure 9. The final four selected pairs by the Random Forest methodology based on fluorescence spectroscopy (Diesel S10 – black and Diesel S100 – blue).

ACS Paragon Plus Environment

29