Text Mining Metal–Organic Framework Papers - Journal of Chemical

Dec 11, 2017 - From the sample set data of over 200 MOFs, the algorithm managed to identify 90% and 88.8% of the correct surface area and pore volume ...
1 downloads 8 Views 2MB Size
Subscriber access provided by Universitaetsbibliothek | Johann Christian Senckenberg

Article

Text Mining Metal-Organic Framework Papers Sanghoon Park, Baekjun Kim, Sihoon Choi, Peter G Boyd, Berend Smit, and Jihan Kim J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.7b00608 • Publication Date (Web): 11 Dec 2017 Downloaded from http://pubs.acs.org on December 13, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Chemical Information and Modeling is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Text Mining Metal-Organic Framework Papers Sanghoon Park† , Baekjun Kim†, Sihoon Choi†, Peter G. Boyd* , Berend Smit*, and Jihan Kim† †

Department of Chemical and Biomolecular Engineering, Korea Advanced Institute of Science and Technology (KAIST), 291 Daehak-ro, Yuseong-gu, Daejeon 34141,Republic of Korea *

Laboratory of Molecular Simulation, Institut des Sciences et Ingénierie Chimiques,

Valais, Ecole Polytechnique Fédérale de Lausanne (EPFL), Rue de l’Industrie 17, CH-1951 Sion, Switzerland

ABSTRACT

We have developed a simple text mining algorithm that allows us to identify surface area and pore volumes of metal-organic frameworks using manuscript html files as inputs. The algorithm searches for common units (e.g. m2/g, cm3/g) associated with these two quantities to facilitate the search. From the sample set data of over 200 MOFs, the algorithm managed to identify 90% and 88.8% of the correct surface area and pore volume values. Further application to test set of randomly chosen MOF html files yielded 73.2% and 85.1% accuracies for the two respective quantities. Most of the errors stem from unorthodox sentence structures that made it difficult to identify the correct data as well as bolded notations of MOFs (e.g. 1a) that made it difficult identify its real name. These types of tools will become useful when it comes to discovering

ACS Paragon Plus Environment

1

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 25

structure-property relationships amongst MOFs as well as collecting a large data set of data for references. Introduction Metal-organic frameworks (MOFs) are comprised of metal nodes and organic linkers that can connect together in multiple different ways to form crystalline lattice structures1–3. Due to their exceptional properties (e.g. tunability, flexibility, facile synthesis procedures, large surface area), the amount of research devoted to developing new MOFs has increased rapidly in the past decade. To this end, the number of experimentally synthesized MOFs has surpassed 70,000 with many of the structures consequently being deposited onto the CCDC database and available as public information4. In 2014, Chung et al. developed a MOF database called CoRE MOFs where over 4,700 MOFs were aggregated and cleaned up from the CCDC database structures, facilitating ways for researchers to access these materials5. More recently, Peyman et al. has expanded upon the initial CoRE MOF database with a new MOF database that contains close to 70,000 structures4. The growing number of MOF structures obviously parallels the large number of scientific papers being published with respect to different MOFs, with the implication being that significant amount of information exists about MOFs in the form of text data in these papers that has yet to be fully explored. In this work, we have initiated an attempt to data-mine thousands of the MOF papers via text mining, where text mining is defined as a methodology that yields high quality data from texts.

Although used extensively in fields such as computer science6,7, biology8–10, and

biomedicine9,11,12, to the best of our knowledge, text mining has not been as frequently used in materials sciences, and as far as we know, there haven’t been any other efforts to text mine MOF

ACS Paragon Plus Environment

2

Page 3 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

papers. Given the advancements of the Materials Genome project initiated by the White House in 2011, the data in the forms of texts has potential to provide additional valuable information to the research community13. In this paper, as a first attempt at using text mining to analyze MOF papers, we have concentrated on accurately identifying two basic properties that are frequently seen in most published MOF papers: surface area (SA) and pore volume (PV).

There are few important

reasons on why these two quantities were targeted in our analysis: (1) these are highly important quantities that relate to the adsorption properties of MOFs14–16, (2) these are some of the most basic quantities obtained from most MOF experiments, and as such, large number of SA and PV data is available as opposed to more specialized properties (e.g. heat capacity, bulk modulus, and adsorption isotherm data of a specific gas), and (3) these two quantities have distinct signatures (e.g. the unit m2/g for SA and cm3/g for PV) that makes them relatively easier to identify compared to other quantities (e.g. decomposition temperature, which has units of Kelvin, or K that often appear in many different contexts within a paper). With this in mind, the initial goal of this work was to evaluate if we can construct a general framework that can identify these quantities, which can potentially lead to a loftier goal of collecting arbitrary text data of interest regarding MOFs. It is our anticipation that this type of effort can help the researchers within the field to (1) elucidate structure-property relationships from a large available dataset, (2) provide a standardized data set to check one’s computational models, and (3) serve as an platform to pursue new ideas in the future directions for MOF research. With regards to points (1) and (2), there have been several computational screening papers on MOFs17–21 where

wealth of

experimental dataset for comparison purposes would be indeed helpful.

ACS Paragon Plus Environment

3

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The paper is divided into following sections.

Page 4 of 25

In the Method section, we discuss the

algorithm that we have used in our work to obtain SA and PV. In the Results section, the accuracy of our code is compared with a large dataset of over 100 SA and PV data for verification purposes. Finally, in the Conclusion section, we contemplate about some of the future work that can open up with regards to text mining in MOFs.

Method Natural language processing (NLP)22,23 is a field of computer science related to information science, linguistics, mathematics, artificial intelligence and robotics; it explores how a computer can be used to control human natural language or speech. In general, concepts from NLP are used in text mining to facilitate handling enormous amounts of texts and to accomplish this task, raw materials written in human language need to be changed to a proper format understood by the computer. Subsequently, morphological, lexical, syntactic, semantic, pragmatic and others analysis are used to find the solutions to the user-defined goal of text mining. As this goal varies from one research to another, several analyses can be added or deleted from the general NLP paradigm. For our task, with just a couple of pre-selected MOF features (i.e. SA and PV) being mined, only a few selected sets of NLP components (lexical and syntactic analysis) were used to derive this information. Specifically, our algorithm does not require any artificial intelligence learning, but only need a mechanical text analysis.

The schematic view of overall NLP

components used in this research is summarized in Figure 1.

ACS Paragon Plus Environment

4

Page 5 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 1 The overall text mining scheme used in this work. (a) The input MOF paper in html file is changed to a text format. (b) Tokenization process divides the group of texts by a ‘period’ mark, creating sentences. (c) Sentence tokenization further divides the texts by ‘blank’ space. These word level tokens are stored in a two dimensional array (d) Classification of tokens and filtering of useful tokens are conducted. (e) Application of our matching algorithm to generate accurate text mining data sets creates the outputs.

For this research, the raw material texts all come from html files, and as such Python3 library called ‘Beautiful Soup 4.0’24,25 was used to eliminate meaningless html-related texts (e.g. ‘ul’, ‘li’, ‘noscript’, ‘a’) that disrupt proper recognition of the texts. Next, the html file was tokenized, where tokenization refers to splicing list of texts into small word tokens that are considered to be the smallest unit that has meaning. Tokens are stored in a two-dimensional array with the format, [sentence, token] with the ‘sentence’ information retained to facilitate the search for PV and SA (will be explained later). Finally, several lexicon tools (metal or linker name used in MOFs from

ACS Paragon Plus Environment

5

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 25

the Cambridge Structural Database(CCDC)26) or dictionary resources (Python3 NLTK library27) were used to categorize each of the tokens and to disregard the useless ones. Specifically, four categories (‘MOF name’, ‘Units’, ‘Numerical value’, ‘Keyword’) were chosen as they were the most relevant in accurately obtaining SA and PV (for a more exhaustive text mining procedure, all of the tokens might need to be categorized). (1) ‘MOF name’ refers to any token that might be interpreted as being a name of the MOF, where the rules to determine MOF name is described in the Supporting Information (see Figure S1 for schematics). (2) ‘Units’ are tokens which indicate units commonly used for SA and PV, with majority of the instances being ‘m2/g’ and ‘cm3/g’ for SA and PV, respectively. Different units (e.g. mL/g or cm3/cm3 for pore volume) and other text formats of the same units (e.g. m2•g-1 for surface area, cm3•g-1 for pore volume) are recognized as proper ‘Units’ as well to make the search more complete. In all cases encountered in this work, the units for SA is unique to describing the surface area and as such, we do not need to contextualize its occurrence. On the other hand, some of the units that are commonly used in other contexts (e.g. cm3/cm3 for gas uptake) are used in orthodox manner to describe pore volume by few of the authors in rare cases. . To differentiate between different contexts, we take advantage of the tendency of the authors to use the words, ‘pore volume’ (or other similar terms such as ‘micropore volume’, ‘mesopore volume’, ‘free volume’ and ‘permanent porosity’) to describe the case for the pore volume and as such, our algorithm searches for these words for identification purposes. And if none of these words are found within the data is simply disregarded. (3) ‘Numerical value’ refers to numbers that precede the ‘Units’. If more than one numerical value exists in the sequence of tokens preceding the ‘Units’ token, all of these numbers are collected given that multiple values of SA/PV can conceivably be found in a single sentence. (4) ‘Keyword’ is relevant to only SA and it allows us to decide between either the

ACS Paragon Plus Environment

6

Page 7 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Langmuir28 (monolayer) or the Brunauer-Emmett-Teller (BET)29 (multi-layer) type as differentiation between the two is important. At the end of the tokenization and classification process, the MOF name, the numerical values, and the PV/SA all need to be accurately matched with one another. Our matching algorithm can be summarized as follows. (1) Traverse through all the tokens and identify a token that is classified under the category of SA or PV ‘Unit’. (2) From the sentence from (1), find the numerical value(s) that precede the ‘Unit’ token (e.g. 300 in 300 m2/g). If there are more than one numerical value (usually separated by comma or the word ‘and’), all of the values are stored. (3) In the case of SA unit, traverse backward from the current sentence to the past sentences to search for the keywords, ‘BET’ and/or ’Langmuir’. If neither of these two keywords (or their variants) appears, the data invalidates and the algorithm proceeds to the next sentence and back to step (1). If only one type of keyword (e.g. ‘BET’) appears, then the number of MOF name is deemed to be the same as the number of numerical values identified from (2), and if both type of keywords appear, then the numerical values are deemed to be twice as large as the number of MOF names. Furthermore, when there are two or more MOF names with both BET and Langmuir keywords being used, proper assignment becomes crucial. For example, in the case for two MOFs with BET and Langmuir types, two cases can occur: (1) [B(for MOF1) B(for MOF2) L(for MOF1) L(for MOF2)] (B = BET surface area value, L = Langmuir surface area value) type or (2) [B(for MOF1) L(for MOF1) B(for MOF2) L(for MOF2)]. In our algorithm, the number

ACS Paragon Plus Environment

7

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 25

of words that separate the keywords ‘BET’ and ‘Langmuir’ is determined and used to differentiate the two cases. In general, if this separation is less than four (e.g. ‘BET and Langmuir surface area of MOFs are …), it is categorized as having a BLBL format. Otherwise (e.g. ‘BET surface area of MOF is … and Langmuir surface area if MOF is …’), this is viewed as being a BBLL format. The number four was empirically determined to yield the most accurate results. (4) Next, the tokens that represent names of the MOFs are identified to properly map the numerical values from (3) to the appropriate MOF names. As mentioned previously, the criteria that determines whether a given token is a MOF or not is described in the Supporting Information. In addition, many of the papers substitute MOF names to bold format symbols such as 1, 2, and 1a, where these symbols are assigned to a specific MOF name earlier on in the paper and the simpler symbols are used subsequently in the rest of the manuscript. Our code can identify some of these substitutions and accurately retrace the symbol to the original MOF name. It turns out that the MOF name is often not found in the same sentence as the SA/PV data and as such, four additional sentences (two forward, the rest backward from the current sentence) are traversed to find the MOF name. While arbitrary, we have found that beyond four sentences, the proper MOF name is not found and moreover, erroneous MOF names start to appear in the identification process the further we are from the SA/PV data. (5) Once the MOF name is properly identified, this particular dataset is collected/stored and the algorithm moves onto the next sentence and proceeds again from step (1). Steps (1)(5) are repeated until we reach the end of the manuscript file.

ACS Paragon Plus Environment

8

Page 9 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

To elucidate the overall process, several examples taken from text-mining real MOF papers are shown in Figure 2. Figure 2 Examples that demonstrate the identification schemes of our text mining algorithm. Example 1 is a representative case of one of the simplest sentence structure that yields the SA data. Example 2 shows that two MOFs with both BET and Langmuir surface areas can be correctly assigned. Example 3 shows mixtures of SA and PV with SA data further subdivided by BET and Langmuir types.

Some of the SA/PV data are not found in the main texts of the manuscript, but embedded inside a html table. Organization of table data within the html files can be reformatted as twodimensional data array and simple rules are applied to find the appropriate columns (or rows)

ACS Paragon Plus Environment

9

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 25

that refer to the SA or PV data. Once identified, these values are also stored in the same format to the SA/PV data discovered from the main texts. Finally, we want to point out that there can be two sources of errors: one, due to the inability of our code to find the MOF SA or PV data even if they exist, and two, due to the erroneous assignment of the values identified from our code. In this analysis, we do not deal with the first type of an error as it becomes very difficult to verify the existence of these omissions (especially for hundreds of papers).

We also believe that the 2nd type of an error can lead to more of a

serious issue as this is conveying false information to the users and as such, this work is focused on optimizing for these situations. Results To verify the accuracy of our text mining algorithm, two review papers (M. P. Suh, et al.30 and K. Sumida, et al.31) with hundreds of compiled MOF SA and PV were selected as a sample set data. From these review papers, 100 SA and 107 PV reference html files were obtained and our text mining code was applied to each of these input files. The output data for SA and PV respectively contained text mining information in following formats [MOF name, Keyword, Numerical value] for SA and [MOF name, Numerical value] for PV, where ‘Keyword’ refers to BET and/or Langmuir. Manual comparisons between the text mined data and the data taken from the review papers indicate that 90% of the SA (90/100) and 88.8% of the PV (95/107) yielded exact agreement. From the cases where our algorithm failed to retrieve the correct data, the problems are differentiated and summarized in Table 1.

ACS Paragon Plus Environment

10

Page 11 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Table 1 Breakdown of different types of errors from the sample set data. (1) Reference error Data in the Supporting Information File

Data in the Figures (i.e. N2 isotherm)

Typo Error

(2) Complex sentence structure error

SA

3

2

0

PV

3

3

1

(3) MOF name error

Total number of Error

Simple MOF misleading Error

Bold MOF format Error

4

0

1

10/100

4

0

1

12/107

In the case of reference error, the SA/PV data were not found inside the main manuscript of the html files and these accounted for the majority of the total errors in SA (5) and PV (7). Further breakdown of reference errors shows that in 3 cases for SA and PV respectively, the data were located in the Supporting Information files while 2 (3) errors come from what we interpret as the data from the review papers being obtained from the N2 isotherm figures. In this scenario, the addition of image mining could potentially help retrieve the data. Finally, in a single case for PV, there was a typo in the review paper itself as the value was incorrectly written as 0.017 (as opposed to the correct value of 0.17). Next, there were errors associated with highly complex nature of the sentences that made it difficult for our code to follow the logic of the papers (we categorize these as “complex sentence structure error” in Table 1). Some examples are shown in Figure 3.

ACS Paragon Plus Environment

11

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 25

Figure 3 Case 1: Cu and Ni catecholates are not recognized as MOF names and thus SA values of 425 and 490 m2/g are assigned to the wrong MOFs. Case 2: the MOF names PCN-56 is properly identified but our code did not understand that PCN-56 to -59 refers to PCN-56, PCN57, PCN-58, and PCN-59. Case 3: Many MOFs are assigned to one numerical value but our code did not understand that all three MOFs (DUT-6, DUT-9 and PCN-61) are assigned to 3000 m2 g1 .

Finally, in the case MOF name errors (Table 1 (3)), our MOF name criteria failed to identify the correct MOF names within the paper. Both of the instances here came from the authors using bold substitute symbols (instead of the actual MOF name) to refer to the MOF.

ACS Paragon Plus Environment

12

Page 13 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Given that our text mining algorithm was largely tuned and optimized from the initial sample set, it becomes important to test the algorithm on a new test set. As such, html files from CoRE MOF database were collected using our in-house developed code, where the code accessed doi from the CCDC database using the Cambridge Structural Datacenter (CSD) python API, then accessed the article through the publisher’s website. Using the code, 4763 crystallographic information files (cif) and 4008 html files are collected. After eliminating duplicated html, 2315 CoRE MOF html files were finalized and the text mining code was applied to all of these files with the text mining data summarized in Table 2. We refer to the Supporting Information Excel file, which has all the compiled data. Table 2 Summarized results from our text mining code on a large set of MOF html files. Surface Area

Pore Volume

Format of data set

[MOF, Keyword, Numerical value]

[MOF, Numerical value]

(1) # of total data

999

583

(2) # of data from random 50 papers

183

235

73.22% (134/183)

85.11% (200/235)

(3) Accuracy test for (2)

From the 2315 html files, SA and PV data were found in 490 and 250 files (with some files having multiple data), respectively, implying that in majority of the manuscripts, either the SA/PV data were unavailable or our code could not identify them. The number of retrieved SA data is larger than PV (999 > 583), which can be due to our method having more difficulty collecting the PV data, or (more likely) the instances of reported SA being larger than PV. Given the large dataset, it becomes impractical to manually verify our text mined results and as such, a subset of randomly selected 50 papers was used to determine the accuracy of our code.

ACS Paragon Plus Environment

13

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 25

Manual comparisons indicate that for SA, 134/183 (73.2%) displayed accurate results while for PV, 200/235 (85.1%) were shown to be accurate. Overall, these percentages are lower than what was obtained from the review paper test sets (90% for SA, 88.8% for PV), with the simplest explanation being that the rules/criteria were constructed with the original sample data set and as such, there was bound to be reduction in accuracy for a new set. Similar to before, the types of errors were further broken down to analyze the problems (Table 3). Table 3 Error distribution of test set data for double checking (1) Reference error

(2) Complex sentence structure error

Data in the Supporting Information File

Data in the Figures (i.e. N2 isotherm)

Typo Error

SA

2

1

0

PV

2

2

0

(3) MOF name error Total number of Error

Simple MOF misleading Error

Bold MOF format Error

31

1

14

49/183

10

2

19

35/235

For SA, 49 errors were categorized as 3 reference errors, 31 complicated sentence errors, and 15 MOF name errors; 35 PV errors are divided as 4 reference errors, 10 complicated sentence errors and 21 MOF name errors. The large discrepancy between SA and PV comes from the complicated sentence errors. It turns out that our algorithm does a poor job of identifying cases where there are multiple instances of the keywords BET and Langmuir or other new type of surface area. Some examples are shown in Figure 4.

ACS Paragon Plus Environment

14

Page 15 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 4 Case 1: zeolite Y is not recognized as a MOF name, leading to incorrect assignments. Moreover, a range expression (500-1160) cannot be assigned to the same MOF in our algorithm. Case 2: an unrecognized surface area type defined inside paper (Smesopore) leads to incorrect assignments.

When the keywords are completely excluded in our search with the output data format of SA being changed to [MOF name, Numerical value], the number of complicated sentence errors reduce from 31 to 11, leading to SA having comparable level of accuracy compared to PV. Results from Table 3 also indicate that our code does a poor job of identifying MOFs that are referred to in bold-case symbols. That said, there is a limit to how much we can fix this issue due to variety of different ways in which authors represent the MOFs in these bold texts (ranging from the easiest type of ‘1’, ‘1a’ or ‘1′’ to expressions such as ‘Sm-Co’, ‘3-Ni’, ‘1·x(Guest)’ or ‘1a–EtOH·EtOH’). The code can detect and resubstitute these complex expressions when the

ACS Paragon Plus Environment

15

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 25

original MOF and bold expression are exactly matched at some point in the manuscript but we have found that this is often not the case, which makes it a challenge. Next, the text mined data were compared with simulation results derived from Zeo++ (probe size 1.86Å were used for SA and PV calculations) 32. Since it is difficult to match the commonly used MOF names (e.g. MOF-5, HKUST-1) with its CCDC Refcodes, only few of the cases where the common name is successfully matched to REFCODE (38 for SA and 27 for PV) are compared and plotted below. Figure 5 Comparisons between the text mined data (y-axis) and calculated data using geometrical method from Zeo++ (x-axis). Both the surface area (Figure 5(a)) and pore volume (Figure 5(b)) are compared. Each data point indicates different MOFs and their REFCODE names are included.

In general, the agreement between the text mined and the Zeo++ data is good, serving as good consistency checks on both methods. However, especially for SA, similar to what was observed by Goldsmith, J. et al.33, the simulated data tend to overestimate the experimental data, which can be due to incomplete removal of solvents from the synthesized MOFs and/or structural

ACS Paragon Plus Environment

16

Page 17 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

distortion in the structure during solvent removal. There is also an anomaly data point ‘NIBHOW’ (Figure 5(b)) with a common name PCN-6’. From the original html file34, there is a phrase “The N2 adsorption isotherm of PCN-6’ (Figure 2a) indicates typical Type-I sorption behavior, with a Langmuir surface area of 2700 m2/g (pore volume 1.045 mL/g).”, and as such, the text mined data from the code seems to be accurate. However, it turns out that PCN-6’ is an interpenetrated structure but the CoRE MOF structure database has a non-interpenetrated file in it. This also indicates that one needs to take into account presence of interpenetration when comparing experimental data with results from simulated structures. Finally, given the difficulty in matching all of the MOF names, the distributions of the 4008 text mined data and the simulated data were compared with the histograms of SA and PV shown in Figure 6. It can be seen from Figure 6 that the Zeo++ results have larger relative proportion of data points in the small SA and PV intervals, which can be due to a bias in which experimentalists might not report these data in the main manuscripts when the numbers are small. That said, the general shapes of the curves are similar between the two sets of data, indicating congruency between experimental data for SA and PV and simulated data as a whole.

ACS Paragon Plus Environment

17

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 25

Figure 6 Histogram of surface area (Figure 6(a)) and pore volume (Figure 6(b)) for code output (blue line) and calculated data using geometrical method (red line).

Conclusions We have developed a Python code that automatically extracts SA and PV data from a given MOF html file. After honing our algorithm to optimize the accuracy from a sample set, our code was applied to a randomly selected test set, yielding 73.2% and 85.1% accuracy for SA and PV respectively. It is our intention to extend the text mining code to identify other important quantities beyond SA and PV such that wealth of information embedded within thousands of manuscripts can be unearthed. With the advancements of artificial intelligence technology (AI) and machine learning being used in nanomaterials, text mining should serve as an important tool to bridge the gap between AI and materials research.

Supporting Information

ACS Paragon Plus Environment

18

Page 19 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

The supporting information contains one pdf file and one excel file. PDF file contains additional information of rules used in this research and excel file contains MOF surface area and pore volume database from our code.

The text mining code is available online at

https://github.com/sanghooni/Textmining The following files are available free of charge. Additional description of defining MOF name criteria (PDF) Overall code output data; both surface area and pore volume that include MOF name extracted from our code (XLSX)

Corresponding Author Berend Smit ([email protected]) and Jihan Kim ([email protected])

Acknowledgements J.K. and B.S. acknowledge support of the Korean-Swiss Science and Technology Programme (KSSTP) grant number 162130 of the Swiss National Science Foundation (SNSF). PGB would like to acknowledge the helpful discussions with Amy Sarjeant and Paul Sanschagrin from the CSD in searching for article DOI’s associated with each CoRE MOF. This research was also supported by the International Research & Development Program of National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT & Future Planning (Grant number: 2015K1A3A1A14003244).

ACS Paragon Plus Environment

19

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 25

References 1.

Li, H., Eddaoudi, M., O’Keeffe, M., Yaghi, O. M. Design And Synthesis Of An Exceptionally Stable And Highly Porous Metal-organic Framework. Nature 402, 276–279 1999.

2.

Li, J. R., Kuppler, R. J., Zhou, H. C. Selective Gas Adsorption And Separation In Metalorganic Frameworks. Chem. Soc. Rev. 38, 1477–1504 2009.

3.

Zhou, H.-C., Long, Jeffrey, R., Yaghi, O. M. Introduction To Metal − Organic Frameworks. Chem 112, 673–674 2012.

4.

Moghadam, P. Z., Li, A., Wiggin, S. B., Tao, A., Maloney, A. G. P., Wood, P. A., Ward, S. C., Fairen-Jimenez, D. Development Of A Cambridge Structural Database Subset: A Collection Of Metal-Organic Frameworks For Past, Present, And Future. Chem. Mater. 29, 2618–2625 2017.

5.

Chung, Y. G., Camp, J., Haranczyk, M., Sikora, B. J., Bury, W., Krungleviciute, V., Yildirim, T., Farha, O. K., Sholl, D. S., Snurr, R. Q. Computation-ready, Experimental Metal-organic Frameworks: A Tool To Enable High-throughput Screening Of Nanoporous Crystals. Chem. Mater. 26, 6185–6192 2014.

6.

Hotho, a., Staab, S., Stumme, G. Ontologies Improve Text Document Clustering. Third IEEE Int. Conf. Data Min. 2–5 2003.

ACS Paragon Plus Environment

20

Page 21 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

7.

Larsen, B., Aone, C. Fast And Effective Text Mining Using Linear-time Document Clustering. Proc. fifth ACM SIGKDD Int. Conf. Knowl. Discov. data Min. - KDD ’99 16– 22 1999.

8.

Kim, J. D., Ohta, T., Tateisi, Y., Tsujii, J. GENIA Corpus - A Semantically Annotated Corpus For Bio-textmining. Bioinformatics 19, 180–182 2003.

9.

Tobergte, D. R., Curtis, S. Text Mining For Biology And Biomedicine (Book). J. Chem. Inf. Model. 53, 1689–1699 2013.

10.

Ananiadou, S., Kell, D. B., Tsujii, J. Text Mining And Its Potential Applications In Systems Biology. Trends Biotechnol. 24, 571–579 2006.

11.

Cohen, A. M., Hersh, W. R. A Survey Of Current Work In Biomedical Text Mining. Br. Bioinform 6, 57–71 2005.

12.

Tanabe, L., Tanabe, L., Smith, L., Scherf, U., Scherf, U., Smith, L. H., Lee, J., Lee, J. K., Hunter, L., Hunter, L., Weinstein, J., Weinstein, J. N. MedMiner: An Internet Text-mining Tool For Biomedical Information, With Application To Gene Expression Profiling. Biotechniques 27, 1210–1214 1999.

13.

Jain, A., Ong, S. P., Hautier, G., Chen, W., Richards, W. D., Dacek, S., Cholia, S., Gunter, D., Skinner, D., Ceder, G., Persson, K. A. Commentary: The Materials Project: A Materials Genome Approach To Accelerating Materials Innovation. APL Mater. 1, 2013.

14.

Ongari, D., Boyd, P. G., Barthel, S., Witman, M., Haranczyk, M., Smit, B. Accurate Characterization Of The Pore Volume In Microporous Crystalline Materials. Langmuir 2017.

ACS Paragon Plus Environment

21

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

15.

Page 22 of 25

Bruinsma, R. F., Gennes, P. G. De, Freund, J. B., Levine, D. Letters To Nature. Nature 427, 523–527 2004.

16.

Eddaoudi, M. Systematic Design Of Pore Size And Functionality In Isoreticular MOFs And Their Application In Methane Storage. Science (80-. ). 295, 469–472 2002.

17.

Wilmer, C. E., Leaf, M., Lee, C. Y., Farha, O. K., Hauser, B. G., Hupp, J. T., Snurr, R. Q. Large-scale Screening Of Hypothetical Metal–organic Frameworks. Nat. Chem. 4, 83–89 2011.

18.

Nazarian, D., Camp, J. S., Chung, Y. G., Snurr, R. Q., Sholl, D. S. Large-Scale Refinement Of Metal−Organic Framework Structures Using Density Functional Theory. Chem. Mater. 29, 2521–2528 2017.

19.

Simon, C. M., Mercado, R., Schnell, S. K., Smit, B., Haranczyk, M. What Are The Best Materials To Separate A Xenon/Krypton Mixture? Chem. Mater. 27, 4459–4475 2015.

20.

Jeong, W., Lim, D.-W., Kim, S., Harale, A., Yoon, M., Suh, M. P., Kim, J. Modeling Adsorption Properties Of Structurally Deformed Metal–organic Frameworks Using Structure–property Map. Proc. Natl. Acad. Sci. 114, 7923–7928 2017.

21.

Thornton, A. W., Simon, C. M., Kim, J., Kwon, O., Deeg, K. S., Konstas, K., Pas, S. J., Hill, M. R., Winkler, D. A., Haranczyk, M., Smit, B. Materials Genome In Action: Identifying The Performance Limits Of Physical Hydrogen Storage. Chem. Mater. 29, 2844–2854 2017.

22.

Chowdhury, G. G. Natural Language Processing. Annu. Rev. Appl. Linguist. 37, 51–89 2003.

ACS Paragon Plus Environment

22

Page 23 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

23.

Kao, A., Poteet, S. R. Natural Language Processing And Text Mining. (2007).

24.

Smedt, T. De, Daelemans, W. Pattern For Python. J. Mach. Learn. Res. 13, 2063–2067 2012.

25.

Richardson, L. Beautiful Soup Documentation. 1–72 2016.

26.

Allen, F. H. The Cambridge Structural Database: A Quarter Of A Million Crystal Structures And Rising. Acta Crystallogr. Sect. B Struct. Sci. 58, 380–388 2002.

27.

Bird, S., Klein, E., Loper, E. Natural Language Processing With Python: Analyzing Text With The Natural Language Toolkit. (O’Reilly Media, Inc., 2009).

28.

Hanaor, D. A. H., Ghadiri, M., Chrzanowski, W., Gan, Y. Scalable Surface Area Characterization By Electrokinetic Analysis Of Complex Anion Adsorption. Langmuir 30, 15143–15152 2014.

29.

Brunauer, S., Emmett, P. H., Teller, E. Gases I N Multimolecular Layers. J. Am. Chem. Soc. 60, 309–319 1938.

30.

Suh, M. P., Park, H. J., Prasad, T. K., Lim, D.-W. Hydrogen Storage In Metal – Organic Frameworks. Chem. Rev. 122, 782–835 2012.

31.

Sumida, K., Rogow, D. L., Mason, J. A., McDonald, T. M., Bloch, E. D., Herm, Z. R., Bae, T. H., Long, J. R. Carbon Dioxide Capture In Metal-organic Frameworks. Chem. Rev. 112, 724–781 2012.

ACS Paragon Plus Environment

23

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

32.

Page 24 of 25

Willems, T. F., Rycroft, C. H., Kazi, M., Meza, J. C., Haranczyk, M. Algorithms And Tools For High-throughput Geometry-based Analysis Of Crystalline Porous Materials. Microporous Mesoporous Mater. 149, 134–141 2012.

33.

Goldsmith, J., Wong-Foy, A. G., Cafarella, M. J., Siegel, D. J. Theoretical Limits Of Hydrogen Storage In Metal-organic Frameworks: Opportunities And Trade-offs. Chem. Mater. 25, 3373–3382 2013.

34.

Ma, S., Sun, D., Ambrogio, M., Fillinger, J. A., Parkin, S., Zhou, H. C. Frameworkcatenation Isomerism In Metal-organic Frameworks And Its Impact On Hydrogen Uptake. J. Am. Chem. Soc. 129, 1858–1859 2007.

ACS Paragon Plus Environment

24

Page 25 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

For Table of Contents Only

ACS Paragon Plus Environment

25