Visualization of Solar Cell Library Space by Dimensionality Reduction

Nov 28, 2018 - ... *H.S. e-mail: [email protected]. This article is part of the Materials Informatics special issue. Cite this:J. Chem. Inf. Mode...
1 downloads 0 Views 1MB Size
Subscriber access provided by University of Winnipeg Library

Chemical Information

Visualization of Solar Cells Libraries Space by Dimension Reduction Methods Omer Kaspi, Abraham Yosipof, and Hanoch Senderowitz J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.8b00552 • Publication Date (Web): 28 Nov 2018 Downloaded from http://pubs.acs.org on December 1, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Visualization of Solar Cells Libraries Space by Dimension Reduction Methods Omer Kaspi,a Abraham Yosipof,b* and Hanoch Senderowitza* Affiliations: a

b

Department of Chemistry, Bar-Ilan University, Ramat-Gan 5290002, Israel Department of Information Systems, College of Law & Business, Ramat-Gan, P.O.Box 852 Bnei Brak 5110801, Israel * Corresponding authors Abraham Yosipof

Email:[email protected]

Hanoch Senderowitz Email: [email protected]

Abstract Visualizing high dimensional data by projecting them into a 2D or 3D space is a popular approach in many scientific fields including computer aided drug design and cheminformatics. In contrast, dimensionality reduction techniques were far less explored for material informatics. Yet, similar to their usefulness in analyzing the space of e.g., drug-like molecules, such techniques could provide useful insights into the material space including the intuitive grasp of the overall distribution of samples, the identification of interesting trends including the formation of materials clusters and the presence of activity cliffs and outliers and the rational navigation through this space in search for new materials. In this work we present the first application of four dimensionality reduction techniques, principal component analysis (PCA), Kernel PCA, Isomap and Diffusion map, for visualizing and analyzing a part of the materials space populated by solar cells made of metal oxides. Solar cells in general and metal-oxide based solar cells in particular hold the promise of contributing to the world’s search for clean and affordable energy resources. With the exception of PCA, these methods were seldom used for the visualization of

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

the chemistry space and almost never for the visualization of the material space. For this purpose, we integrated five metal oxide-based solar cell libraries into a uniform database and subjected it to dimensionality reduction by all methods, comparing their performances using various criteria such as maintaining the local environment of samples and the clustering structure in the low dimension space. We also looked at the number of outliers produced by each method and analyzed common outliers. We found that PCA performs best in terms of the ability to correctly maintain the local environment of samples whereas Isomap does the best job in assigning class membership based on the identity of nearest neighbors, i.e., it is the best classifier. We also found that many of the outliers identified by all methods could be rationalized. We suggest that the methods used in this work could be extended to study other types of solar cells thereby setting the ground for further analysis of the photovoltaic space as well as other regions of the material space.

Introduction The ability to synthesize large collections of compounds / materials (e.g., by combinatorial methods), measure their activities and characterize them, either experimentally or computationally, by multiple parameters (i.e., descriptors) resulted in multiple high-dimensional compounds / materials collections forming high-dimensional spaces.1 While undoubtedly information rich, such collections are often challenging to analyze. First, many techniques for data analysis including similarity search, diversity analysis, outliers removal and clustering require the calculation of distances between the different samples that make up the space. However the concept of distance becomes less meaningful as the dimensionality of the space increases2. Next, machine-learning models for predicting “activities” from descriptors are typically constructed based on a subset of the descriptors. This is done to prevent over-fitting and / or chance correlation. As the dimensionality of the space increases, so does the number of subsets. Thus, the search for the optimal subset (i.e. the subset that would lead to the most predictive model) becomes increasingly difficult and less likely to converge. Finally, data visualization is clearly not feasible in a high dimensional phase. Yet visualization is of paramount importance for providing intuitive understanding of the overall data structure including the formation of clusters and the identification of outliers and activity cliffs.3-7

ACS Paragon Plus Environment

Page 2 of 28

Page 3 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

To cope with this problem, multiple dimensionality reduction methods have been developed. The goal of these methods is to project a high dimensions space into a lower (typically 2-dimensional (2D) or 3-dimensional (3D)) dimensions space with minimal loss of information. By minimal loss of information we mean that the method should strive to preserve as much of the structure of the high-dimensional data as possible in the low-dimensional embedding. As described below, different methods describe the data structure in different ways and achieve this goal using different algorithms. The traditional approach for dimensionality reduction is Principal Component Analysis (PCA),8 which assumes linear correlation between the dimensions and therefore cannot adequately handle complex nonlinear data. In the last two decades, a large number of non linear techniques for dimensionality reduction has been proposed, such as Kernel PCA,9 Isomap,10 Diffusion maps,11 Locally Linear Embedding (LLE),12 Self-Organized Map (SOM)13 and Generative Topographic Map (GTM)14 to name but a few. In contrast to traditional linear techniques the nonlinear techniques have the ability to deal with complex nonlinear data. Dimensionality reduction methods have found many usages in manipulating compounds collections related to drug discovery. For example, Kireeva et al.15 suggested that mapping various databases to a global chemical space by using dimensionality reduction methods may help chemists to choose compounds to be purchased or synthesized in order to enrich “in-house” databases, to select subsets for screening campaigns, and to assess the overlap of different databases. However, far fewer examples are available for the usage of dimensionality reduction methods in the field of materials informatics. Srinivasan et al.16 used the Isomap method for the design of multi-components alloys. Yosipof et al.17 used PCA for the visualization and removal of outliers in solar cells libraries. In another study Yosipof et al.4 used PCA and SOM to compare two solar cells libraries made of metal oxides which differed only by the presence of a MoO3 layer in one of them. Experimentally it was observed that the addition of this layer had an overall favorable effect on the photovoltaic profile. However, PCA identified a sub-population of cells which were indifferent to the addition of MoO3. Finally, Ulaczyk et al.18 used PCA for clustering thin film photovoltaic (PV) cells.

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The main objective of the present study is to establish the usefulness of data reduction methods in materials informatics and in particular in the field of solar cells. Solar cells are promising devices for providing a significant part of the world’s demands for clean and affordable energy. At present, solar cells in the market are primarily based on silicon yet new alternatives are constantly emerging including organic solar cells,19 dye sensitized solar cells20 and solar cells based on metal oxides.21 This last class of solar cells features multiple favorable properties including natural abundance of the constituting materials, ease of fabrication and long time stability. Yet to date such cells do not demonstrate sufficient efficiency in converting sunlight to electricity thereby requiring the development of new MOs.21 Over the years we have assembled a reasonably large data set of MO-based solar cells largely prepared by the same method and by the same lab (see below).22-26 This is important for ensuring data uniformity. Thus, we chose to focus on such cells in the present work. MO-based solar cells are manufactured by depositing multiple layers of metal oxides, either uniformly or with a thickness gradient on a solid support.21 Hence we use the notation X||Y||Z… where X, Y, and Z are the window layer, the absorber layer (which could potentially be made of several metal oxides) and the recombination layer, respectively. Following manufacturing, the cells are characterized by several PV “activities” as described below. In this work, we focus on data visualization and posit that the same well developed rational for the usage of this approach in computer aided drug design and cheminformatics applications (e.g., understanding overall data structure and identifying interesting trends, observing activity cliffs and outliers and classifying new samples into existing groups)6, 7, 15 is equally applicable to the solar cell (i.e., photovoltaic; PV) space. As another potential application, one can envision using the reduced solar cells space to select for characterization already available or synthetically feasible cell which is proximate to a virtual cell which was designed to have optimal PV properties yet it is too expensive / complex to be manufactured. The specific goals of the present study are therefore to: (1) Compare different dimension reduction methods in terms of their ability to preserve the overall data structure in the original, high-dimensions space; (2) Analyze the resulting low-dimensions space; (3) Set the ground for further analysis using dimension reduction methods for the PV space. In order to meet these

ACS Paragon Plus Environment

Page 4 of 28

Page 5 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

goals we considered five different metal oxide based PV libraries namely a TiO2||Cu-O library reported by Anderson et al.,26

a TiO2||Cu2O library reported by Pavan et al.,25 a

TiO2||Co3O4||MoO3 library from Majhi et al.,22 a TiO2||Co3O4 library from Majhi et al.,23 and a TiO2||CuO-NiO-In2O3 library.24,

27

Each of these libraries was characterized by seven PV

properties. These libraries were integrated into a uniform database which was subsequently projected into a 3-dimensional space using four data reduction techniques, namely, PCA, Kernel PCA, Isomap and Diffusion map. We demonstrate that: (1) all methods led to a good separation between the libraries in the reduced space; (2) PCA performed the best in terms of the ability to maintain the local environment of samples; (3) The non-linear Isomap algorithm performed the best in terms of the ability to correctly assign class membership based on the identity of nearest neighbors. (4) Many of the outliers observed by all methods could be rationalized. The results presented in this work may set the ground for further analysis of the PV space.

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Methods Workflow In order to meet the research objectives, a knowledge discovery process was developed for the integration of the libraries and for the data mining procedure (Figure 1). The process is divided into four main sections: (1) Data integration; (2) Data curation; (3) Data mining; (4) Performance evaluation via data analysis.

Figure 1: Knowledge discovery process Data Integration Data integration is the process of joining together data from various sources that may not contain the same data structure. In this stage the user needs to define which data features to maintain and which to discard. The goal of this stage is to maintain as many common features as possible; keeping features that are not common to all samples may lead to the analysis disregarding these samples. In this work five metal oxide solar sets libraries were integrated (see Table 1) namely: a TiO2||Cu-O library reported by Anderson et al.26 which contained 169 cells, a TiO2||Cu2O library reported by Pavan et al.25 which contained 338 cells, a TiO2||Co3O4||MoO3 library from Majhi et

ACS Paragon Plus Environment

Page 6 of 28

Page 7 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

al.22 which contained 169 cells , a TiO2||Co3O4 library from Majhi et al.23 which contained 338 cells and a TiO2||CuO-NiO-In2O3 library

24, 27

which contained 338 cells, for a total of 1,352

cells. The common features that were kept are the seven experimentally measured photovoltaic properties (i.e, PV activities) : (1) The short circuit photocurrent density (JSC). The short circuit photocurrent density is the current through the solar cell when the voltage across the cell is zero, that is, when the cell’s termini are connected and the resistance is zero. (2) The open circuit photovoltage (VOC). The open circuit photovoltage is the maximum voltage a solar cell can provide to an external circuit. (3) The internal quantum efficiency (IQE). IQE is calculated by dividing JSC by the maximum theoretical calculated photocurrent. (4) The maximum photovoltaic power producible by a solar cell (Pmax). Pmax is the point on the I-V curve at which maximum power is being produced by the cell. (5) The fill factor (FF), which is the available power at the maximum power point divided by VOC and JSC. (6) Series resistance (Rs) and (7) Shunt resistance (Rsh).

For a more detailed discussion of these parameters and how they are measured, see

references.25,

26

The ranges of the seven PV activities are represented in Table 1, their

correlations in Table 2 and their distribution by box plots in Figure 2. Table 1: The seven experimentally measured photovoltaic properties activity ranges

VOC [mV] JSC

[µA cm-2]

Pmax

[µW cm-2]

FF [%]

Rs

[Ohm

Rsh

cm2]

[Ohm cm2]

IQE [%]

TiO2||Cu-O

TiO2||Cu2O

TiO2||Co3O4||MoO3

TiO2||Co3O4

TiO2||CuO-NiO-In2O3

31-380

6.6-354

24-620

172-443

111-509

73-290

13.9-406

10-25

5.5-11

7-54

0.02-1.02

0-1.26

0.0018-0.11

0.016-0.042

0.015-0.11

23-63

0.16-40

23-41

32-53

26-41

18-23x103

9-0.5x106

5x103-0.6x106

103-1.9x106

5x103-0.6x106

0.7x106-3.3x106

0.57-1.16

0.1-2.67

0.05-0.3

ACS Paragon Plus Environment

8x1030.5x106 2.8x10612x106 0.06-0.32

5x103-0.6x106 0.15x106-2.6x106 0.02-0.47

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 28

Table 2: Correlation between activities

VOC JSC Pmax FF Rs Rsh IQE

VOC

JSC

Pmax

FF

Rs

Rsh

IQE

1 -0.18 -0.00 0.23 0.42 0.30 -0.23

1 0.91 0.05 -0.49 -0.52 0.97

1 0.35 -0.44 -0.41 0.84

1 -0.20 0.43 -0.03

1 0.36 -0.5

1 -0.51

1

ACS Paragon Plus Environment

Page 9 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 2: Box plot of the 7 PV activities per library: Voc, Rsh, Rs, Pmax, IQE, FF and Jsc in panels A-G, respectively. The box plots show the median values (solid horizontal line), 50th percentile values (box outline), the lower and upper quartile (whiskers, vertical lines) and outlier values (red +).

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Data Curation Once data has been integrated into a single, standardized format, the data itself had to be curated. Data curation is crucial for two main reasons: (i) Either publicly available or in-house data sets may contain multiple errors; (ii) Even a small number of errors may compromise the quality of the models obtained from the data. 28-32 In this study, data curation involved the following steps: the removal of cells with missing data, non PV cells, and cells which suffer from errors caused by the measurement process such as negative resistance or unrealistic values. The number of remaining cells in each library and the total number of cells in the database can be seen in Table 3. The data curation stage lead to the removal of 187 cells (13.83% of the integrated database) leaving a total of 1,165 valid cells (86.17%). Table 3: Library membership before and after data curation. # cells

# cells removed

# cells retained

% data retained

TiO2||Co3O4||MoO3

169

19

150

88.76

TiO2||Co3O4

338

74

264

78.11

TiO2||Cu-O

169

27

142

84.02

TiO2||Cu2O

338

10

328

97.04

TiO2||CuO-NiO-In2O3 Total

338 1352

57 187

281 1165

83.14 86.17

Library

Dimension reduction methods This research discusses the application and comparison of four data reduction techniques for the purposes of visualization and uncovering underlining patterns in the PV space of the integrated database. Principal Component Analysis (PCA) Principal Component Analysis (PCA)8 is a common linear technique for dimension reduction and visualization. PCA reduces the dimensionality of a data set, while retaining as much as possible its original variance. PCA assumes that a linear subspace d that can capture the original data using new variables exists and identifies this space by transforming the original (potentially

ACS Paragon Plus Environment

Page 10 of 28

Page 11 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

correlated) variables into a new set of orthogonal variables called Principal Components (PCs). PCs are typically produced in an ordered manner so that the first PC retains the largest portion of the variance of the original set while subsequent PCs retain increasingly smaller portions not accounted for by the previous PCs. In this work we used PCA as implemented in the MATLAB version R2017b. Kernel Principal Component Analysis (Kernel PCA) Kernel PCA9 is a reformulation of traditional linear PCA in a high-dimensional space that is constructed using a kernel function. The main limitation of “classical” PCA is that it seeks a linear subspace d which can represent the samples originating from the high dimensional space. However, if the original data cannot be represented linearly, PCA will fail to represent it correctly. To handle this problem, Kernel PCA takes the original data and raises it to space Φ (the number of dimensions of space Φ is larger than the number of dimensions in the original space) by using a kernel function. After the conversion, the application of PCA to the new (kernel) space provides Kernel PCA with the ability to construct nonlinear mapping. In this work we used Kernel PCA as implemented in the MATLAB Toolbox for Dimensionality Reduction.3 Isomap Isomap10 is a convex nonlinear dimension reduction method that preserves pair-wise Geodesic distances between data points. The traditional dimension reduction method, PCA, mainly aims to maintain the original pair-wise Euclidean distances, but does not take into account the distribution of the neighboring data points. If the distribution of the data points in the original space takes the form of a non-linear manifold, traditional PCA that relies on Euclidean distances will misrepresent the true distances between any two points. Figure 3 illustrates a classic theoretical case of a space that is a non-linear manifold, also known as a “Swiss roll”. The red arrow represents the Euclidean distance between two points whereas the green arrow represents the actual distance between the two points (note, one cannot get from point A to point B by going through the manifold). The distance in green is the Geodesic distance. Geodesic distance assumes that the distance between two adjacent points is linear and that the curvature of the manifold is negligible. However, Euclidean and Geodesic distances between samples that are not immediate neighbors may vary greatly. The Isomap method includes two steps. First, a

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 28

connectivity table is generated which contains the Geodesic distance between any two points. Second,

using

this

connectivity

table,

dimensionality

reduction

is

performed

by

Multidimensional scaling (MDS).33 In this work we used the Isomap algorithm as implemented in the MATLAB Toolbox for Dimensionality Reduction.3

Figure 3: A synthetic example of a manifold subspace. Diffusion map Diffusion map11 offers an expansion to Isomap for reducing the dimensions of samples placed on a manifold. Both these algorithms rely on the neighbors and their density in order to determine the samples’ distribution. This allows for the calculation of a Geodesic distance thereby removing the assumption that the manifold is linear. The difference, between the two methods is that Isomap may “short-circuit” samples, meaning, erroneously determine that two samples are neighbors due to their relative proximity. As an example, let’s consider whether sample B (located at the end of the red arrow in Figure 3) is a neighbor of sample A (located at the beginning of the red arrow). Isomap will examine the proximity of B to A, and assuming that all other samples on the manifold are located further away from A than B, will declare B a neighbor of A. Diffusion map on the other hand, will also examine whether or not the neighbors of A are also the neighbors to B and declare B and A neighbors only if they have common neighbors.

ACS Paragon Plus Environment

Page 13 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

This is done by applying a Random-Walk Markov chain from which a stable transition probability matrix could be deduced. Neighbors of each sample are those for which the transition probabilities from this sample are the highest. This leads to a new distribution of samples that is more robust to noises and “short-circuits”. Importantly, non-linear manifolds are correctly represented. In this work we used Diffusion map as implemented in the MATLAB Toolbox for Dimensionality Reduction.3 Dimension Reduction Methods Comparison Figure 4 provides a comparison between the characteristics of the four dimension reduction methods considered in this work and highlights the relations between them. PCA is the simplest, among the earliest and one of the most common methods for dimensionality reduction. Kernel PCA is based on PCA, and its added value results from reducing limitations that apply to PCA such as nonlinear data by using the kernel function. In a similar manner Isomap is less restrictive than Kernel PCA. While both methods could be applied to non-linear data, Isomap uses Geodesic rather than Euclidean distances between points. Geodesic distances preserve the neighboring distribution and therefore are more suitable to non-linear manifolds but are computationally more expansive and could fail if the manifold is non-convex. Finally, the Diffusion map uses Markov chains to calculate the neighbor geometry and could produce better results than Isomap, for example in cases involving "short-circuits".

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 4: Dimensionality reduction techniques pyramid. Statistical parameters The ability of the different methods to reduce the dimensionality of the data while preserving their original distribution in the high dimensionality space was measured in two ways: (1) Measuring the success rate for identifying sample’s original library based on 1-nearest neighbor (1NN), 3-nearest neighbors (3NN) and 5-nearest neighbors (5NN) in the low-dimensional representation. The averaged percentage over all samples is computed. These measures evaluate the ability of the low-dimensional projection to preserve the clustered structure of the data in the original space.5 (2) The Trust measure,34, 35 which evaluates the ability of the low dimensional representation to preserve the high-dimensional data structure. This measure (with values ranging between 0 and 1) compares the sample’s neighborhood in the high dimension and the low dimension spaces. High Trust values occur when the neighborhood in the low dimensional representation is similar to that in the high dimension. The Trust measure is given by (Equation 1):

ACS Paragon Plus Environment

Page 14 of 28

Page 15 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

1 1 𝑇𝑟𝑢𝑠𝑡 = ∗ 𝑛 k

𝑛

𝑘

∑ ∑𝛿 (𝑠 𝑗

𝑖,𝑗,𝑥𝑖

)

(1)

𝑖 = 1𝑗 = 1

Where n represents the number of samples, k the number of nearest neighbors, 𝑠𝑖,𝑗 represents neighbor j for sample i in the low dimension representation, and 𝑥𝑖 is the vector of sample i neighbors in the high dimension. 𝛿𝑗 is defined to be 1 if 𝑠𝑖,𝑗 is found in 𝑥𝑖 or 0 if not. In this work, the neighborhood is defined as the 100 nearest neighbors. Outlier removal One important application of dimension reduction is for the identification and removal of outliers. Arguably, this is better performed in the low dimensional space where outliers could be easily and intuitively visualized and verified. In this work, the outlier removal procedure is based on the k Nearest Neighbor (kNN) algorithm. For this purpose, we defined the outliers using a threshold distance 𝐷𝑇 between a query sample and its nearest neighbors form the same class (i.e., library), calculated as follows (Equation 2): 𝐷𝑇 = 𝑦 +𝑍𝜎 (2) Where 𝑦 is the average Euclidean distance between each compound and its k nearest neighbors from the same class (library), σ is the standard deviation of the k nearest neighbors Euclidean distances, and Z is an arbitrary parameter to control the significance level. The number of nearest neighbors (k) must be chosen carefully; it must be small enough to reflect only the closest neighbors, while on the other hand, a too small number may not properly identify outliers that are within a cluster. In this work the number of neighbors used to determine outliers was set to 10 (10 samples are 3%-7% of the samples from the individual libraries). We set the value of Z to 3, which formally places the distance threshold at the mean plus three standard deviations (a common definition of an outlier in statistics). If a sample’s distance from its nearest neighbor from the same class exceeds the threshold distance the sample is considered an outlier.

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Results and Discussion The five datasets were integrated into a single database as explained in the methods section. In addition, each library was assigned with a class label in the integrated database. The correlation coefficients between all PV properties calculated over the entire curated database are presented in Table 2 and suggest that except of three significant correlations between Pmax -JSC , JSC –IQE and Pmax-IQE the other properties are totally non-correlated. This suggests that a complete visualization of this 6-7 dimension dataset would require a total of 15-21 2D plots or 20-35 3D plots which is unfeasible. Thus, the integrated 7-dimensions database was subjected to the different dimension reduction methods. The resulting PV 3D spaces produced by each method are presented in Figure 5A (PCA), 5B (Kernel PCA), 5C (Isomap) and 5D (Diffusion map). As the results in Figure 5 clearly indicate, all methods were able to separate the different libraries in the reduced space. In order to provide more quantitative estimates, the results for the Trust, 1NN, 3NN and 5NN measures are presented in Table 4. The resulting Trust measure of the low dimensional embedding was found to be between 0.70-0.82. The best dimension reduction method based on this criterion is PCA (with Trust =0.82) followed by Diffusion map and Kernel PCA (both with Trust =0.78), while the Isomap (with Trust =0.70) lags behind. These results indicate a good preservation of the structure and of the local information (i.e., neighborhood formation) of the high dimensional data in the low embedding for the four methods. Another way to assess the quality of the dimension reduction methods is by comparing the 1NN, 3NN and 5NN metrics before and after dimensionality reduction. In the original space values of 0.95, 0.95 and 0.94 were obtained for 1NN, 3NN and 5NN, respectively. Following dimensionality reduction, the results for 1NN were found to be between 0.91-0.92, for the 3NN between 0.900.93, and for the 5NN between 0.90-0.93. On the average the three Nearest Neighbors results (1NN, 3NN and 5NN) were found to be between 0.91-0.93. Thus, all dimensionality reduction methods largely maintain the original classes’ structure and do not artificially generate new clusters. The best dimension reduction method based on this criterion is the Isomap method with an averaged 92.7% of the cells being correctly classified to the initial library. The second is PCA (average Nearest Neighbors of 91.2%), the third is Diffusion map (average Nearest Neighbors of 91.0%), and the last is Kernel PCA (average Nearest Neighbors of 90.5%). To ensure that the dimensionality reduction methods did not, by chance, organized the samples according to their

ACS Paragon Plus Environment

Page 16 of 28

Page 17 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

original libraries (i.e., generated “artificial” clusters), kNN classification was also performed in the high dimensional space. As Table 4 suggests, and as expected, the high dimensional space allows for the best classification of samples into their parent libraries based on 1, 3 and 5 Nearest Neighbors. Based on these two parameters, it is evident that Isomap best maintains the initial clustering structure of the data (albeit all methods do a decent job in this respect).

Figure 5: 3D representation of the integrated database after different dimensionality reduction methods. Subfigures A-D represent the different libraries plotted in the reduced space following PCA, Kernel PCA, Isomap and Diffusion map, respectively.

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 28

Table 4: Statistical parameters for the dimension reduction methods.

Trust 1NN Classification 3NN Classification 5NN Classification

PCA

Kernel PCA

Isomap

Diffusion map

Original space

0.82 0.92

0.78 0.91

0.70 0.92

0.78 0.92

----0.95

0.92

0.90

0.93

0.91

0.95

0.90

0.90

0.93

0.90

0.94

To provide additional insight into the PV space, hierarchical clustering (ward's method) was performed on the low dimension spaces produced by all methods. Since the results of this analysis were largely similar, we only present those obtained for the space generated by diffusion map. As figure 6 suggests, the five libraries clustered into four clusters. Analysis of these clusters in terms of the IQE parameter revealed that cluster 2 (in green) is composed of solar cells with high IQE values from two libraries, TiO2||Cu2O and TiO2||Cu-O. This cluster is characterized by cells having a thick absorber layer (i.e., thickness of the Cu-O and Cu2O layers). Cluster 3 (in blue) is composed of cells with medium IQE values again form the same two libraries. This cluster is characterized by a thin absorber layer. Clusters 1 (in red) and cluster 4 (in cyan) are composed of cells with low IQE values and include cells from the TiO2||Co3O4||MoO3, TiO2||Co3O4 and TiO2||CuO-NiO-In2O3 libraries. This insight can be highly useful for further prediction of new solar cells IQE.

ACS Paragon Plus Environment

Page 19 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 6: Hierarchical clustering analysis on the low dimensionality space produced by the diffusion map method.

One of the key applications of dimension reduction is using the low dimension space for outlier detection and removal. Following dimension reduction, the reduced representations were subjected to an outlier removal procedure as described in the methods section. Figure 7A, 7C, 7E and 7G represent the outliers (Marked with "X") identified in the reduced PV space generated by PCA, Kernel PCA, Isomap and Diffusion map, respectively. The number and percentage of outliers found and removed by each dimension reduction method is presented in Table 5. The outlier removal procedure removed between 84-133 cells (7.2%-11.4% of the integrated database) depending on the dimension reduction method. This percentage of outliers is similar to that observed by other studies performed on the same data, which used different outlier removal methods. Thus, Yosipof et al.36, Nahum et al.37 and Kaspi et al.38 removed between 6%-17% outliers from all of the libraries composing the integrated database.

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 7: Integrated database after different dimensionality reduction methods. Subfigures A,C,E,G represent the integrated database plotted in the reduced space produced by PCA, Kernel PCA, Isomap, and Diffusion map, respectively, with outliers identified in each case marked with an “x”. Subfigures B, D, F, H represent the different libraries after their outliers have been removed (see text for more details).

ACS Paragon Plus Environment

Page 20 of 28

Page 21 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

The four different dimension reduction methods identified different numbers of outliers with Kernel PCA identifying the largest number (133, high sensitivity to outliers) followed by PCA (102), Isomap (89) and diffusion map (84, low sensitivity to outliers) (Table 5). These outliers have their numerical activity data different from those of the bulk of samples yet the scientific rationalization of their divergent behavior is not always possible. In this respect it is more interesting to look at the outliers identified by all methods under the assumption that their divergence from the bulk is not the consequence of numeric peculiarities of any specific methods but can indeed be rooted in their physical / chemical characteristics. Thus, analyzing the outliers removed by each method, we found 27 outliers common to all four methods and 82 outliers common to at least three methods. Six outliers identified by all methods in the TiO2||Cu-O library are located at the edge of the library and have high IQE values. This region of the library is characterized by a thin absorber layer (thickness of Cu-O layer < 100 nm). However, for this library high IQE values are typically associated with a thick absorber layer. The six outliers are presented in figure 8A and can be seen to belong to two different clusters, one of four outliers and the other of two outliers. Two of these six outliers were previously reported by Yosipof et al.17 For the TiO2||Co3O4||MoO3 library, four out of the six common outliers were found to have statistically significant (P-value < 0.05, using two-tails Student’s t-test) lower VOC (with an average VOC of 245 mV) than their neighboring cells (with an average VOC of 475 mV) that have similar layers' thickness (see figure 8B). These outliers could therefore be classified as “activity cliffs”.39 For the TiO2||Co3O4 library (Figure 8D), one common outlier likely has a measurement error in the Rsh (higher values from the rest of the data by two orders of magnitude). For the TiO2||Cu2O library (Figure 8C) three common outliers were found to be markedly different from the bulk of the library. These outliers were found to have a significantly low FF (with an average of 1.36 ± 2) and Pmax of zero, while the rest of the library cells have an average FF of 27.4 ± 3.9 and average Pmax of 0.35 ± 0.33. These results may indicate a measurement error or alternatively suggest that these cells are not photovoltaic. This last example highlights the potential role visualization can play in data curation. Clearly, non photovoltaic cells should have been removed from the analysis as part of the curation stage but somehow managed to slip through the cracks. Thus, overall we could at least partially understand 14 of the 27 common outliers.

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 8: A diffusion map-based 3D representation of A) TiO2||Cu-O; B) TiO2||Co3O4||MoO3; C) TiO2||Cu2O and D) TiO2||Co3O4 libraries. Common outliers (i.e., outliers identified in the low dimension representations produced by all methods) are marked by ”x”. Table 5: Summary results for the outlier removal procedures. # Outliers # cells after outlier removal % of the database removed

PCA 102 1063 8.75%

Kernel PCA 133 1032 11.41%

Isomap 89 1076 7.63%

Diffusion map 84 1081 7.21%

Outliers identified by each method were removed, in turn, from the original, seven-dimension space to produce four new “clean” integrated databases with a total of 1,063, 1,032, 1,076 and 1,081 cells for PCA, Kernel PCA, Isomap and Diffusion map, respectively (see Table 5). Each of the clean databases was subjected to the corresponding data reduction technique and the resulting low-dimensional spaces are presented in Figures 7B, 7D, 7F, and 7H. The statistical parameters (Trust, 1NN, 3NN, and 5NN classification) were re-calculated for the clean datasets and are presented in Table 6. The Trust results demonstrate a statistically

ACS Paragon Plus Environment

Page 22 of 28

Page 23 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

significant improvement upon outlier removal (P-value < 0.05 by paired sample t-test; compare the results of Tables 4 and 6). This result indicates that removing the outliers from the integrated database had a significant positive effect on the ability of the dimension reduction methods to preserve the high-dimensional data structure in the low dimensional embedding. Similar to the Trust measure, the 1NN, 3NN and 5NN metrics following outliers removal (Table 6) were also found to be statistically significantly higher than prior to removing the outliers (P-value < 0.01 by paired sample t-test; compare the results of Tables 4 and 6). This result indicates that removing the outliers increased the ability of the dimension reduction methods to maintain the cluster structure of the data in the low dimension representation. After outliers have been removed, Isomap remains the method that best differentiates between the different original libraries while PCA remains the method producing the highest Trust value, i.e., the method which best maintains the local topology of the original dataset. To ensure that the above-discussed improvement in performances resulted from the removal of outliers rather than from the arbitrary removal of samples, we have repeated the samples removal procedure and the subsequent analyses 10 times, each time removing a random number of samples which equals the number of outliers. No statistically significant differences in performances were observed for any of the dimensionality reduction techniques and for any of the metrics. Table 6: Statistical parameters evaluation for the dimension reduction methods following outlier removal.

Trust 1NN Classification 3NN Classification 5NN Classification

PCA

Kernel PCA

Isomap

Diffusion map

0.83 0.92 0.93 0.92

0.79 0.93 0.92 0.92

0.71 0.93 0.94 0.94

0.80 0.92 0.92 0.92

The Isomap method was found to outperform the other dimensionality reduction methods in terms of its ability to maintain the cluster structure of the data in the low dimension representation. However, a limitation of both the Isomap and the Diffusion map methods is the lack of an explicit mapping function between the high and low dimensional spaces. This

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 28

limitation prevents one from placing any new data on an already existing map. Thus, when new data become available, a new map must be rebuilt from scratch. Due to their characteristics, Isomap and Diffusion map are expected to perform better than PCA and Kernel PCA on data located on a non-linear manifold. If, for a particular dataset, all methods perform roughly the same, one can assume that the original data are linear. In this case however, data linearity in the sense that all original data points lie on one or two dimensions only is ruled out since the results of the eigenvalues introduced three principal components with an explained variance of 84% of the original variance (PC1=45%, PC2=22% and PC3=17%). Table 7 provides a comparison between the different metrics for all methods considered in this work. Table 7: A comparison between the different metrics for all dimensionality reduction methods considered in this work. Outlier sensitivity refers to the number of outliers identified by each method. Method PCA Kernel PCA Isomap Diffusion Map

Trustworthiness High Medium Low Medium

Classification Accuracy High (>90%) High (>90%) High (>90%) High (>90%)

ACS Paragon Plus Environment

Outlier Sensitivity Medium High Medium Low

Page 25 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Conclusions To the best of our knowledge, this work presents the first comparison between different dimensionality reduction methods for the visualization and analyses of the PV space. We found that the older PCA technique outperformed more modern techniques in terms of its ability to correctly maintain the local environment of samples as manifested in the “Trust” parameter. This is perhaps not surprising since even a non-linear dataset may exhibit local linearity. In contrast, the non-linear Isomap method does the best job in assigning class membership based on the identity of nearest neighbors, i.e., it is the best classifier. In addition, Diffusion map was found to be less sensitive to outliers (i.e., identify the smallest number of outliers). The visualization of data in a low dimension space, following reliable dimensionality reduction, allows for the “intuitive” identification of interesting trends, not easily deduced from numeric values obtained from more quantitative data mining tools. When applied to the PV space, this may open new opportunities for understanding structure-function relationship which might prove useful for unveiling the factors affecting solar cells performances and for designing new solar cells. Furthermore, the ability to visualize both real cells and virtual cells on the same chart may suggest ways by which previous knowledge can direct future developments. Moreover, there could be cases where dimensionality reduction would lead to better separation between groups. This separation may not necessarily be more “true” than the separation obtained in the original space but may be more useful for further analyses. The lowering of the Trust value coupled with the minimal change in the 1, 3, and 5NN values suggests that this might be the case here. While in the present work we focused on solar cells made entirely from metal oxides and characterized by a small number of descriptors (seven, all experimentally measured), these techniques could be readily used to study other types of solar cells characterized by many more parameters (either measured or calculated) including organic solar cells, dye sensitized solar cells (DSSCs) and Pervoskites. High-dimensions spaces for organic cells or DSSCs are particularly achievable since the PV performances of such cells are largely determined by the characteristics of the organic molecule for which a large number of descriptors are readily calculateable. One can also imagine studying different types of solar cells together thereby setting the grounds for large scale analysis. Finally, we would like to stress as we have done

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

before the importance of conducting such analyses in close collaboration with experimentalists to provide physics/chemistry based explanations to the observed trends and to capitalize on the results. We expect that the tools and methods employed in this work will further be used in material science researches.

Acknowledgments The authors acknowledge COST action CA16235 "Performance and Reliability of Photovoltaic Systems: Evaluations of Large-Scale Monitoring Data" (PEARL-PV) for travel support.

Note The authors declare no competing financial interest.

References 1. Hill, J.; Mulholland, G.; Persson, K.; Seshadri, R.; Wolverton, C.; Meredig, B., Materials science with large-scale data and informatics: Unlocking new opportunities. MRS Bulletin 2016, 41, 399-409. 2. Aggarwal, C. C.; Hinneburg, A.; Keim, D. A. On the Surprising Behavior of Distance Metrics in High Dimensional Space. Berlin, Heidelberg, 2001; Springer: Berlin, Heidelberg, 2001; pp 420-434. 3. Van Der Maaten, L.; Postma, E.; Van den Herik, J., Dimensionality reduction: a comparative Review. J. Mach. Learn. Res. 2009, 10, 66-71. 4. Yosipof, A.; Kaspi, O.; Majhi, K.; Senderowitz, H., Visualization Based Data Mining for Comparison Between Two Solar Cell Libraries. Mol. Inf. 2016, 35, 622-628. 5. Sanguinetti, G., Dimensionality Reduction of Clustered Data Sets. IEEE Transactions on Pattern Analysis and Machine Intelligence 2008, 30, 535-540. 6. Ivanenkov, Y. A.; Savchuk, N. P.; Ekins, S.; Balakin, K. V., Computational mapping tools for drug discovery. Drug Discovery Today 2009, 14, 767-775. 7. Balakin, K. V., Pharmaceutical data mining: approaches and applications for drug discovery. John Wiley & Sons: 2009; Vol. 6. 8. Jolliffe, I. T. Principal Component Analysis and Factor Analysis. In Principal Component Analysis; Springer New York: 2002. 9. Schölkopf, B.; Smola, A.; Müller, K.-R., Nonlinear component analysis as a kernel eigenvalue problem. Neural computation 1998, 10, 1299-1319. 10. Tenenbaum, J. B.; Silva, V. d.; Langford, J. C., A Global Geometric Framework for Nonlinear Dimensionality Reduction. Sci. 2000, 290, 2319-2323. 11. Lafon, S.; Lee, A. B., Diffusion maps and coarse-graining: a unified framework for dimensionality reduction, graph partitioning, and data set parameterization. IEEE Transactions on Pattern Analysis and Machine Intelligence 2006, 28, 1393-1403. 12. Roweis, S. T.; Saul, L. K., Nonlinear Dimensionality Reduction by Locally Linear Embedding. Sci. 2000, 290, 2323-2326. 13. Kohonen, T., The self-organizing map. Proceedings of the IEEE 1990, 78, 1464-1480.

ACS Paragon Plus Environment

Page 26 of 28

Page 27 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

14. Bishop, C. M.; Svensén, M.; Williams, C. K. I., GTM: The Generative Topographic Mapping. Neural Computation 1998, 10, 215-234. 15. Kireeva, N.; Baskin, I. I.; Gaspar, H. A.; Horvath, D.; Marcou, G.; Varnek, A., Generative Topographic Mapping (GTM): Universal Tool for Data Visualization, Structure-Activity Modeling and Dataset Comparison. Mol. Inf. 2012, 31, 301-312. 16. Srinivasan, S.; Broderick, S. R.; Zhang, R.; Mishra, A.; Sinnott, S. B.; Saxena, S. K.; LeBeau, J. M.; Rajan, K., Mapping Chemical Selection Pathways for Designing Multicomponent Alloys: an informatics framework for materials design. Sci. Rep. 2015, 5, 17960. 17. Yosipof, A.; Nahum, O. E.; Anderson, A. Y.; Barad, H.-N.; Zaban, A.; Senderowitz, H., Data Mining and Machine Learning Tools for Combinatorial Material Science of All-Oxide Photovoltaic Cells. Mol. Inf. 2015, 34, 367-379. 18. Ulaczyk, J.; Morawiec, K.; Zabierowski, P.; Drobiazg, T.; Barreau, N., Finding Relevant Parameters for the Thin‐film Photovoltaic Cells Production Process with the Application of Data Mining Methods. Mol. Inf. 2017, 36, 1600161. 19. Ameri, T.; Dennler, G.; Lungenschmied, C.; Brabec, C. J., Organic tandem solar cells: A review. Energy Environ. Sci. 2009, 2, 347-363. 20. Hagfeldt, A.; Boschloo, G.; Sun, L.; Kloo, L.; Pettersson, H., Dye-Sensitized Solar Cells. Chem. Rev. 2010, 110, 6595-6663. 21. Rühle, S.; Anderson, A. Y.; Barad, H.-N.; Kupfer, B.; Bouhadana, Y.; Rosh-Hodesh, E.; Zaban, A., All-Oxide Photovoltaics. J. Phys. Chem. Lett. 2012, 3, 3755-3764. 22. Majhi, K.; Bertoluzzi, L.; Rietwyk, K. J.; Ginsburg, A.; Keller, D. A.; Lopez-Varo, P.; Anderson, A. Y.; Bisquert, J.; Zaban, A., Thin-Film Photovoltaics: Combinatorial Investigation and Modelling of MoO3 Hole-Selective Contact in TiO2|Co3O4|MoO3 All-Oxide Solar Cells. Adv. Mater. Interfaces 2016, 3 : 1500405. 23. Majhi, K.; Bertoluzzi, L.; Keller, D. A.; Barad, H.-N.; Ginsburg, A.; Anderson, A. Y.; Vidal, R.; Lopez-Varo, P.; Mora-Sero, I.; Bisquert, J.; Zaban, A., Co3O4 Based All-Oxide PV: A Numerical Simulation Analyzed Combinatorial Material Science Study. J. Phys. Chem. C 2016, 120, 9053-9060. 24. Shimanovich, K., Combinatorial approach for development of new metal oxides materials for all oxide photovoltaics. arXiv preprint arXiv:1508.04626 2015. 25. Pavan, M.; Rühle, S.; Ginsburg, A.; Keller, D. A.; Barad, H.-N.; Sberna, P. M.; Nunes, D.; Martins, R.; Anderson, A. Y.; Zaban, A.; Fortunato, E., TiO2/Cu2O all-oxide heterojunction solar cells produced by spray pyrolysis. Sol. Energy Mater. Sol. Cells 2015, 132, 549-556. 26. Anderson, A. Y.; Bouhadana, Y.; Barad, H.-N.; Kupfer, B.; Rosh-Hodesh, E.; Aviv, H.; Tischler, Y. R.; Rühle, S.; Zaban, A., Quantum Efficiency and Bandgap Analysis for Combinatorial Photovoltaics: Sorting Activity of Cu–O Compounds in All-Oxide Device Libraries. ACS Comb. Sci. 2014, 16, 53-65. 27. Yosipof, A.; Shimanovich, K.; Senderowitz, H., Materials Informatics: Statistical Modeling in Material Science. Mol. Inf. 2016, 35, 568-579. 28. Fourches, D.; Muratov, E.; Tropsha, A., Trust, But Verify: On the Importance of Chemical Structure Curation in Cheminformatics and QSAR Modeling Research. J. Chem. Inf. Model. 2010, 50, 1189-1204. 29. Olah, M.; Rad, R.; Ostopovici, L.; Bora, A.; Hadaruga, N.; Hadaruga, D.; Moldovan, R.; Fulias, A.; Mractc, M.; Oprea, T. I. WOMBAT and WOMBAT-PK: Bioactivity Databases for Lead and Drug Discovery. In Chemical Biology; Wiley-VCH Verlag GmbH: 2008, pp 760-786. 30. Olah, M.; Mracec, M.; Ostopovici, L.; Rad, R.; Bora, A.; Hadaruga, N.; Olah, I.; Banda, M.; Simon, Z.; Mracec, M., WOMBAT: world of molecular bioactivity. Chemoinformatics in drug discovery 2004, 223239. 31. Young, D.; Martin, T.; Venkatapathy, R.; Harten, P., Are the Chemical Structures in Your QSAR Correct? QSAR Comb. Sci. 2008, 27, 1337-1345. 32. Isayev, O.; Fourches, D.; Muratov, E. N.; Oses, C.; Rasch, K.; Tropsha, A.; Curtarolo, S., Materials Cartography: Representing and Mining Materials Space Using Structural and Electronic Fingerprints. Chem. Mater. 2015, 27, 735-743.

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

33. Torgerson, W. S., Multidimensional scaling: I. Theory and method. Psychometrika 1952, 17, 401419. 34. Venna, J.; Kaski, S. Visualizing gene interaction graphs with local multidimensional scaling. In ESANN, 2006; 2006; Vol. 6; pp 557-562. 35. Yosipof, A.; Guedes, R. C.; García-Sosa, A. T., Data Mining and Machine Learning Models for Predicting Drug Likeness and Their Disease or Organ Category. Front. Chem. 2018, 6. 36. Yosipof, A.; Senderowitz, H., k-Nearest neighbors optimization-based outlier removal. J. Comput. Chem. 2015, 36, 493-506. 37. Nahum, O. E.; Yosipof, A.; Senderowitz, H., A Multi-Objective Genetic Algorithm for Outlier Removal. J. Chem. Inf. Model. 2015, 55, 2507-18. 38. Kaspi, O.; Yosipof, A.; Senderowitz, H., RANdom SAmple Consensus (RANSAC) algorithm for material-informatics: application to photovoltaic solar cells. J. Cheminf. 2017, 9, 34. 39. Maggiora, G. M., On Outliers and Activity CliffsWhy QSAR Often Disappoints. J. Chem. Inf. Model. 2006, 46, 1535-1535.

For Table of Contents Use Only

Visualization of Solar Cells Libraries Space by Dimension Reduction Methods

ACS Paragon Plus Environment

Page 28 of 28