Computational Chemistry Data Management Platform Based on the

Dec 12, 2016 - This paper presents a formal data publishing platform for computational chemistry using semantic web technologies. This platform encaps...
0 downloads 16 Views 8MB Size
Article pubs.acs.org/JPCA

Computational Chemistry Data Management Platform Based on the Semantic Web Published as part of The Journal of Physical Chemistry virtual special issue “Mark S. Gordon Festschrift”. Bing Wang,† Paul A. Dobosh,† Stuart Chalk,‡ Mirek Sopek,† and Neil S. Ostlund*,† †

Chemical Semantics Inc., 2772 NW 43rd Street, Suite B1, Gainesville, Florida 32606, United States Department of Chemistry, University of North Florida, Jacksonville, Florida 32224, United States



ABSTRACT: This paper presents a formal data publishing platform for computational chemistry using semantic web technologies. This platform encapsulates computational chemistry data from a variety of packages in an Extensible Markup Language (XML) file called CSX (Common Standard for eXchange). On the basis of a Gainesville Core (GC) ontology for computational chemistry, a CSX XML file is converted into the JavaScript Object Notation for Linked Data (JSON-LD) format using an XML Stylesheet Language Transformation (XSLT) file. Ultimately the JSON-LD file is converted to subject−predicate−object triples in a Turtle (TTL) file and published on the web portal. By leveraging semantic web technologies, we are able to place computational chemistry data onto web portals as a component of a Giant Global Graph (GGG) such that computer agents, as well as individual chemists, can access the data.



INTRODUCTION Computational chemistry is an indispensable tool for scientific discovery and innovation. With access to massively parallel and highly distributed computer systems, researchers can now perform very sophisticated calculations on large systems. For example, with accurate and expensive coupled cluster (CC) methods,1 one can routinely study molecules composed of around 100 atoms. For molecular mechanical (MM) methods, recent improvements in both hardware and software have put molecular dynamics (MD) simulations over the limit of a microsecond-per-day on million-atom systems.2 These calculations have demonstrated their utility in (i) producing results comparable to experimental observables,3 and (ii) allowing sampling of molecular states that were previously inaccessible.4 Thus, new insights into biological and material phenomena have been obtained. The achievements in modern computational chemistry have been highlighted in the 1998 and 2013 Nobel Prizes in Chemistry.5,6 Along with the developments above comes the explosive volume of data generated by computations, making it a daunting task to archive, manage, and share this valuable data. In addition, current practices in data management are generally misaligned with the evolving community guidelines of scientific data publishing as embodied in the Findability, Accessibility, Interoperability, and Reusability (FAIR) perspective.7 The principal barriers to managing/reusing computational chemistry data in a FAIR way follow: (1) The data is intrinsically heterogeneous because of an abundance of complicated theoretical models and algorithms across molecular mechanics, quantum mechanics © XXXX American Chemical Society

with semiempirical methods, density functional theory (DFT), and ab initio methods. (2) Heterogeneity or the variety of ways data presented in the formats of input/output files (and associated metadata) hinders computational chemistry data interoperability and reproducibility between different software packages. (3) In the current scientific publication paradigm, only the raw data that researchers consider “important” is processed/ reported, and thus, a large portion of valuable data/metadata is lost during this process. (4) When authors attach original computational data in the Supporting Information section of a paper, they are usually in a format (image, PDF) that makes the data difficult or almost impossible to reuse. To address these issues, several efforts have been directed toward online repositories of computational chemistry data. The National Institute of Standard Technology (NIST) Computational Chemistry Comparison and Benchmark DataBase (CCCBDB) includes a collection of quantum mechanically calculated electrostatic, geometrical, thermochemical, and vibrational data and their corresponding experimental values for a carefully selected set of gas-phase atoms and small molecules.8 The Benchmark Energy and Geometry DataBase (BEGDB) contains high-level QM calculations, including the CCSD(T)/ CBS method, for intermolecular complexes.9 The Quixote project, initialized by Peter Murray-Rust, aims to create a federated infrastructure for quantum chemistry data management, which Received: October 17, 2016 Revised: December 9, 2016 Published: December 12, 2016 A

DOI: 10.1021/acs.jpca.6b10489 J. Phys. Chem. A XXXX, XXX, XXX−XXX

Article

The Journal of Physical Chemistry A Scheme 1. Our Workflow of Semantic Data Publishing for Computational Chemistry

format of eXtensible Markup Language (XML) specifically designed for computational chemistry data. Second, based on the Gainesville Core (GC) Ontology, our ontology for computational chemistry, a CSX file, is transformed into a JavaScript Object Notation for Linked Data (JSON-LD) file using an eXtensible Stylesheet Language Transformation (XSLT). Finally, the JSON-LD file is converted into a Terse RDF Triple Language (turtle) file, a standard serialization of RDF, which is subsequently ingested and published on our portal. The description of each process is presented in the following sections.

takes advantage of Chemical Markup Language (CML) to convert output files into XML.10 A closely related platform is ioChem-BD, which focuses on the generation of graphics for manuscripts from output files of QM calculations.11 Finally, the iBIOMES platform12 manages and shares biomolecular simulations data in a distributed environment, which requires users to install the Integrated Rule Oriented Data System (iRODS).13 In this paper, we propose a new solution to address these issues through our efforts to publish computational chemistry data (including metadata) using a new XML standard for output data normalization and formal semantic web technologies. Using this mechanism, researchers will be able to share, receive credit for, and reuse each other’s data and thus streamline their scientific endeavor. As the inventor of the World Wide Web, Sir Tim Berners-Lee wrote:14 The established system of journals for communicating the results of scientific research is already challenged by the existence of the web. But we are only at the early day of a new Internet revolution, one which will have a deeper and more disruptive impact on scientific, and other, web publishing, and have profound implications for the web itself. An emerging successor for the web, the Semantic Web, will likely profoundly change the very nature of how scientific knowledge is produced and shared, in ways that we can now barely imagine. In a nutshell, the semantic web is a collection of technologies and standards that allow machines to understand the meaning (semantics) of information on the Internet. Instead of publishing plain text documents readable and understandable only by humans, it allows semantic annotation of data and metadata through use of subject−predicate−object “triples” encoded in Resource Description Framework (RDF) format.15 RDF can be encoded in many ways and can be integrated into existing Web sites so that computers can understand published data (its meaning and relation to other data), retrieve it, and extend it using inference. The World Wide Web Consortium (W3C)16 has published the RDF standard (essentially a graph database) for storing the data, a Web Ontology Language (OWL standard)17 for defining the data, the SPARQL Protocol,18 and the RDF query language (SPARQL standard) for creating software to find and process data. How are we able to harness the power of the semantic web for heterogeneous computational chemistry data publishing? Our solution is demonstrated in Scheme 1. First, all important data/metadata are extracted, either directly from computational chemistry packages or from the output files, and encoded in the Common Standard for eXchange (CSX) file format, our open



COMMON STANDARD FOR EXCHANGE (CSX) Computational chemistry data publishing should not be simply placing the output files generated by software packages onto a Web page. As mentioned earlier, these output files have different formats and syntaxes for reporting computational results which require a lot of domain knowledge to interpret, making them human-readable, but not machine-readable. Consequently, it is difficult to extract the context (semantic meaning) regarding a computation from these files using current semantic web technologies. One early response to this problem was the creation of the Chemical Markup Language (CML),19 promoted over the past decade by Murray-Rust. However, CML has several drawbacks: (1) It has no facility for identifying the residues of proteins or nucleic acids (or other monomers) that are fundamental entities in large biomolecules. (2) The placement of computational results such as wave functions, the calculated vibrational analysis, etc. into a CML file has not been fully defined, although there have been efforts to encode computational chemistry results into a CML file. (3) It lacks the structure for data stewardship. (4) The development of CML seems stale with the latest stable release, Schema 3.0, unchanged since 2006 and too relaxed in its logic implementation. Therefore, we are proposing a new data standard, CSX, for capturing and normalizing computational chemistry data and its contextual metadata (more information on the structure of the CSX file format will be published in a future paper). The CSX file structure includes four sections: molecular publication, molecular system, molecular calculation, and molecular collection, as defined in the schema file for its specification. In the molecular publication section (Figure 1), information about the data being published such as its title, authors (with their organization and email addresses), and abstract is stored. Additionally, four contextual flags can be stored this section: visibility, status, category, and tags. The user can select the visibility as “private, protected, or public”, and the status as “preliminary, draft, or final”. A private publication can be accessed only by the author, B

DOI: 10.1021/acs.jpca.6b10489 J. Phys. Chem. A XXXX, XXX, XXX−XXX

Article

The Journal of Physical Chemistry A

Figure 1. Example of the molecular publication section in a CSX file.

Figure 2. Example of the molecular system section for a CSX file.

whereas a protected publication is associated with an access key, and anyone who has the key (e.g., a collaborator) can see the data publication. Public publications can be seen by anyone accessing the portal where the publication is hosted. Since each publication is associated with a unique Uniform Resource Identifier (URI), an author can invite anyone to see the public publication by

passing its dereferenceable URI (accessible pointer to the publication). Category indicates the area of chemistry such as biochemistry, inorganic chemistry, organic chemistry, physical chemistry, etc. that the data is based on. The molecular system section (Figure 2) comprises molecules, residues, groups, and atoms with coordinates and bonds as well C

DOI: 10.1021/acs.jpca.6b10489 J. Phys. Chem. A XXXX, XXX, XXX−XXX

Article

The Journal of Physical Chemistry A

Figure 3. Example of the molecular calculation section of a CSX file.

Figure 4. Example of the molecular collection section of a CSX file.

the future. The CSX schema, located at http://chemicalsemantics. com/schema/csx2.xsd, is extendable so new methods and calculated properties can be added to the schema by the addition of new elements and attributes (while also being backward compatible). If a computation is a multistep calculation such as an intrinsic reaction coordinate (IRC),23 we use the molecular collection concept to cover it. As shown in Figure 4, a molecular collection is just a set of internal pointers to the corresponding molecular system(s) and molecular calculation(s). This implementation is

as some intrinsic molecular properties such as charge, chirality, and multiplicity. An InChIKey20 is used to identify each molecule and connect to other chemical databases such as CHEBI21 and ChemSpider,22 etc. If the InChI-key is not available for inclusion in a CSX file when it is created, it will be generated automatically on our portal. In the molecular calculation section, various computational chemistry methods are organized into a hierarchical structure according to their reference state and determinant, as shown in Figure 3. This classification is for commonly used calculations and will be extended to cover additional methods in D

DOI: 10.1021/acs.jpca.6b10489 J. Phys. Chem. A XXXX, XXX, XXX−XXX

Article

The Journal of Physical Chemistry A

Figure 5. Part of the CSX publication information.

Figure 6. Part of the XSLT file use to write out JSON-LD.

files do not always have the complete set of data/metadata. For instance, the molecular orbital information is not always included in an output file by default in many packages. Also, output data is sometimes contained in multiple files. Nevertheless, the parsing approach is preferable if you have many valuable legacy output files and do not or cannot repeat the calculations. In this work, both approaches have been adopted to generate CSX files, depending on the availability and complexity of the computational chemistry packages. Psi4 is an open-source suite of ab initio quantum chemistry packages written in C++ and Python and has a good interface that allows external modules to hook into the main program.24 Therefore, an external module called csx4 psi was developed to generate a CSX file for a Psi4 computation. NWChem is another popular open-source computational chemistry package.25 In order to generate a CSX file from an NWChem calculation, its RunTime DataBase (RTDB) has been utilized. The RTDB, based on Berkeley DB, is a persistent data storage mechanism in NWChem. It is designed to hold specific calculation information for its high-level programming modules. NWChem developers have already developed a Python interface to read/write the data from an RTDB file. Therefore, we collaborated with the NWChem developers to add the missing computational chemistry data into the RTDB so that a complete CSX file could be successfully generated from an NWChem calculation. Gamess is a general ab initio quantum chemistry package, developed and maintained by the Gordon research group.26 To generate CSX files from Gamess, cclib,27 an open-source Python library, was used to parse Gamess output files. Cclib is

flexible enough that complicated multistep computational chemistry calculations can be represented. Two approaches to the generation of CSX files from calculations performed using computational chemistry software have been developed. One approach has been to code data publication modules for particular quantum chemistry packages so that these modules are able to collect all data based on the CSX schema during a calculation (including metadata that is optionally added to output files) and create a CSX file at the end of a calculation that is equivalent to (or more comprehensive than) the normal package output. The second approach has been to develop a standalone program for parsing output files from computational chemistry calculations into CSX files, useful for legacy data. Both approaches have pros and cons. The first approach is independent of the format of output files and thus complete, and robust calculation results and metadata are obtainable (possibly including the calculation input file). Most computational chemistry packages, however, have a poor Application Program Interface (API) to access their computed data, configuration variables, and/or runtime environment, making it challenging to develop a CSX generation module without a significant modification of their source codes. Therefore, the involvement of the original package developers is required. Conversely, it is relatively straightforward to develop a program to parse the output files or structured temporary files (assuming there is good documentation of the output format available), but a simple change in the file format can lead to the failure of the parser. Moreover, the output and/or the temporary E

DOI: 10.1021/acs.jpca.6b10489 J. Phys. Chem. A XXXX, XXX, XXX−XXX

Article

The Journal of Physical Chemistry A

Figure 7. XSLT output of the CSX data represented in JSON-LD.

Figure 8. JSON-LD context file for the CSX “molecularPublication” section: title, abstract, author metadata.

able to parse output files from 11 other QC packages in addition to Gamess, and so this was an effective approach to expand CSX file generation. All computational data parsed by cclib (the latest version 1.4) for each package can be found at http://cclib.github. io/data.html. An internal module for cclib was developed to write

Figure 9. JSON-LD context file for the CSX “molecularPublication” section: GC ontology-defined metadata. F

DOI: 10.1021/acs.jpca.6b10489 J. Phys. Chem. A XXXX, XXX, XXX−XXX

Article

The Journal of Physical Chemistry A

Figure 10. RDF triples generated from the JSON-LD file in Figure 7.

out a CSX file after cclib completed parsing of a QC output file. This has the side benefit of allowing NMR chemical shifts and molecular polarizabilities to be parsed from computational chemistry data. Code additions will be submitted to the cclib GitHub repository so that the computational chemistry community can generate a CSX file using cclib.

data type of the value, URIs, and unique identifiers. As a result, the JSON-LD specification can store data/metadata with semantic meaning in a JSON format (a JSON-LD file is still valid JSON) that encodes semantic meaning and allows automated conversion to Resource Description Framework (RDF) triples. An XSLT file is an XML file that contains instructions from the XSLT specification (in this case XSLT v 2.0)29 and applies them to the input XML file. In order to access data in the XML file, the XSLT uses the xpath specification to access data similar to a directory path. The XSLT extracts data from the CSX file and outputs the data and names for JSON name−value pairs in the JSON format, a text file. The format of the JSON is in alignment with JSON-LD context files that have been written that identify the name−value pairs, objects, and arrays in the JSON-LD file and refer to the meaning of those elements (based on the GC ontology). In order to explain the process of conversion of information from CSX to RDF triples, Figure 5 shows some of the publication metadata in a CSX file, and Figure 6 shows part of the XSLT used to convert the CSX in Figure 5 into JSON-LD. Figure 7 shows the resultant JSON-LD of publication metadata. Figures 8 and 9 show the context file that defines the semantics meaning of the JSON-LD, and finally Figure 10 shows the RDF produced upon semantic processing of the JSON-LD. A series of papers on the stage of this conversion is planned in order to describe this process in more detail.



CONVERSION OF CSX TO JSON-LD In the data publishing process, CSX serves as a structured data container. Data and metadata are annotated using the element definitions in the CSX schema. However, the XML encoded data in CSX does not have a semantic sense, since it has no connection with semantic meaning as defined in the Gainesville Core (GC) ontology. Therefore, we convert a CSX file into a JSON-LD file by the use of an XML Stylesheet Language Transformation file (XSLT). The transformation integrates the computational chemistry annotated in CSX with the semantic meanings described in the GC ontology. JavaScript Object Notation (JSON) is becoming the default standard for moving data around the web. It is a compact, lightweight format which means it is better than XML for transmitting/retrieving data, in terms of size, human readability, and integration into many programming languages. The JSON format uses key−value pairs as the basis for identification of data, and it supports arrays as well as objects. In 2014, the W3C published a linked data specification built on JSON, i.e., JSON for Linked Data or JSON-LD.28 The implementation defines special keywords (names that start with “@”) to construct documents that can alias the definition of the name of a key−value pair, the



GAINESVILLE CORE (GC) ONTOLOGY The development of the Gainesville Core (GC, named after the well-known Dublin Core) ontology is the core part of the G

DOI: 10.1021/acs.jpca.6b10489 J. Phys. Chem. A XXXX, XXX, XXX−XXX

Article

The Journal of Physical Chemistry A

Figure 11. One of localized orbitals of H3PO from Gamess standard example 26.

Figure 12. Example SPARQL query of dipole moments calculated at the MP2 level using different basis sets for chlorobenzene.

semantic data publishing of computational chemistry data/ metadata. It formalizes the meaning of terms used in the computational chemistry domain, and provides clear humanreadable definitions that disambiguate term usage, along with

logical axioms that allow automated reasoning. It enables consistency checking, classification, and querying over the knowledge landscape of computational chemistry domain. It is also a cross-mapping hub for other chemistry ontologies such H

DOI: 10.1021/acs.jpca.6b10489 J. Phys. Chem. A XXXX, XXX, XXX−XXX

Article

The Journal of Physical Chemistry A

chlorobenzene?” There is no need to open each output file to grab values, if those output files are published and converted to RDF on our portal. Instead, a simple SPARQL query, as shown in Figure 12, produces the results shown in Figure 13. The unit for dipole moments on our portal is not just a literal string “Debye”, but is defined by the Quantities, Units, Dimensions, and Data Types (QUDT) ontology.30 The user can click the link to see the precise definition of “Debye”. Similarly, our basis sets are linked to the EMSL basis set exchange library.31

as ChEBI.21 The current GC ontology can be found at http:// purl.org/gc. Version 0.7 of GC provides semantic meaning of molecular system information including the definition of atoms (both type and position), bonds (including order and type), and molecular system properties such as temperature, charge, and multiplicity. In terms of molecular calculations, a detailed representation of technologies and methodologies is possible, and spin types, basis sets, and common calculated properties (e.g., energies, atomic and molecular properties, wave functions, orbitals and spectra) are systematically defined. Although it covers most canonical QC theoretical methods and calculated molecular properties, it is still incomplete. GC is open for the computational chemistry community to use, and the community is encouraged to contribute to ongoing improvement of the ontology as the de facto knowledge representation standard for computational chemistry. We will change the GC repository permission on our GitHub account so that the public users can submit their pull requests. To demonstrate the semantic data publishing for computational chemistry, we have published 47 standard test calculations in Gamess. All test files have been successfully published on our portal except exam 7, 13, 18, 20, and 30. These highly specific input files have user-defined basis functions, which our current CSX schema does not support. A future CSX file will not only have standard basis set names, but also allow specification of exponents and coefficients for user-specific basis function definitions. Figure 11 shows one of the localized orbitals of H3PO from exam26.



FUTURE PLANS Digital object identifiers (DOIs) will be assigned to each data publication on the portal, so that researchers will be able to cite their data publications. It will also allow journal publishers to link computational chemistry data used in a published research paper via DOIs, instead of the current practice of providing unusable PDF versions of research data. Currently, this work focuses primarily on quantum chemistry data. In the near future, the integration of CSX with molecular simulation packages such as AMBER, Charmm, VMD, and Gromacs will allow molecular dynamics calculations to be semantically published as well.



CONCLUSION Data created by computational chemists, as output from a variety of computational packages, is essentially lost to the broad community of other scientists. The data is neither structured nor put into a form that other scientists can easily use. Scientific funding agencies are moving toward requiring scientists to have a data plan that corrects this problem. The semantic web and portals such as described here are one possible solution. We have created such a functioning portal for computational chemistry and are now exploring its utility.



SPARQL QUERYING OF SEMANTIC DATA Once computational chemistry data is published in the RDF format, which is machine-readable, users are able to locate specific information efficiently using SPARQL queries. Unlike a conventional web search engine, which returns a collection of Web pages that might contain the requested answer, a SPARQL query can deliver accurate data directly. For example, imagine performing a set of calculations for chlorobenzene to study the basis set effect on dipole moments. Typically, one might want to ask a question such as “What are the dipole moment values calculated at the MP2 level using different basis sets for



AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. ORCID

Stuart Chalk: 0000-0002-0703-7776 Neil S. Ostlund: 0000-0001-7705-4873 Notes

The authors declare no competing financial interest.

■ ■

ACKNOWLEDGMENTS Financial support for this work derives from a Department of Energy (DOE) SBIR Grant DE-SC0011735. REFERENCES

(1) Bartlett, R.; Musial, M. Coupled-cluster Theory in Quantum Chemistry. Rev. Mod. Phys. 2007, 79, 291−352. (2) Dror, R.; Dirks, R.; Grossman, J.; Xu, H.; Shaw, D.; Rees, D. Biomolecular Simulation: A Computational Microscope for Molecular Biology. Annu. Rev. Biophys. 2012, 41, 429−452. (3) Rezac, J.; Hobza, P. Benchmark Calculations of Interaction Energies in Noncovalent Complexes and Their Applications. Chem. Rev. (Washington, DC, U. S.) 2016, 116, 5038−5071. (4) Dror, R.; Arlow, D.; Borhani, D.; Jensen, M.; Piana, S.; Shaw, D. Identification of Two Distinct Inactive Conformations of the Beta(2)Adrenergic Receptor Reconciles Structural and Biochemical Observations. Proc. Natl. Acad. Sci. U. S. A. 2009, 106, 4689−4694. (5) The Nobel Prize in Chemistry 1998. http://www.nobelprize.org/ nobel_prizes/chemistry/laureates/1998/ (accessed December 1, 2016).

Figure 13. SPARQL results from Figure 12. I

DOI: 10.1021/acs.jpca.6b10489 J. Phys. Chem. A XXXX, XXX, XXX−XXX

Article

The Journal of Physical Chemistry A (6) The Nobel Prize in Chemistry 2013. http://www.nobelprize.org/ nobel_prizes/chemistry/laureates/2013/ (accessed December 1, 2016). (7) Wilkinson, M. D.; Dumontier, M.; Aalbersberg, I. J.; Appleton, G.; Axton, M.; Baak, A.; Blomberg, N.; Boiten, J. W.; da Silva Santos, L. B.; Bourne, P. E.; et al. The Fair Guiding Principles for Scientific Data Management and Stewardship. Sci. Data 2016, 3, 160018. (8) NIST Computational Chemistry Comparison and Benchmark Database. http://cccbdb.nist.gov/ (accessed December 1, 2016). (9) Rezac, J.; Jurecka, P.; Riley, K.; Cerny, J.; Valdes, H.; Pluhackova, K.; Berka, K.; Rezac, T.; Pitonak, M.; Vondrasek, J.; Hobza, P. Quantum Chemical Benchmark Energy and Geometry Database for Molecular Clusters and Complex Molecular Systems (http://www.begdb.com): A Users Manual and Examples. Collect. Czech. Chem. Commun. 2008, 73, 1261−1270. (10) Adams, S.; de Castro, P.; Echenique, P.; Estrada, J.; Hanwell, M.; Murray-Rust, P.; Sherwood, P.; Thomas, J.; Townsend, J. The Quixote project: Collaborative and Open Quantum Chemistry data management in the Internet age. J. Cheminf. 2011, 3, 38. (11) Alvarez-Moreno, M.; de Graaf, C.; Lopez, N.; Maseras, F.; Poblet, J.; Bo, C. Managing the Computational Chemistry Big Data Problem: The ioChem-BD Platform. J. Chem. Inf. Model. 2015, 55, 95−103. (12) Thibault, J.; Facelli, J.; Cheatham, T. iBIOMES: Managing and Sharing Biomolecular Simulation Data in a Distributed Environment. J. Chem. Inf. Model. 2013, 53, 726−736. (13) The Integrated Rule-Oriented Data System. http://irods.org/ (accessed December 1, 2016). (14) Berners-Lee, T.; Hendler, J. Publishing on the Semantic Web The Coming Internet Revolution will Profoundly Affect Scientific Information. Nature 2001, 410, 1023−1024. (15) Resource Description Framework. https://www.w3.org/RDF/ (accessed December 1, 2016). (16) The World Wide Web Consortium. https://www.w3.org/ (accessed December 1, 2016). (17) Web Ontology Language. https://www.w3.org/OWL/ (accessed December 1, 2016). (18) SPARQL Protocol and RDF Query Language. https://www.w3. org/TR/rdf-sparql-query/ (accessed December 1, 2016). (19) Murray-Rust, P.; Rzepa, H. Chemical markup, XML, and the Worldwide Web. 1. Basic principles. J. Chem. Inf. Comput. Sci. 1999, 39, 928−942. (20) The IUPAC International Chemical Identifier. http://www.inchitrust.org/ (accessed December 1, 2016). (21) Hastings, J.; de Matos, P.; Dekker, A.; Ennis, M.; Harsha, B.; Kale, N.; Muthukrishnan, V.; Owen, G.; Turner, S.; Williams, M.; Steinbeck, C. The Chebi Reference Database and Ontology for Biologically Relevant Chemistry: Enhancements for 2013. Nucleic Acids Res. 2013, 41, D456−D463. (22) ChemSpider. http://www.chemspider.com/ (accessed December 1, 2016). (23) Fukui, K. The Path Of Chemical-Reactions - The IRC Approach. Acc. Chem. Res. 1981, 14, 363−368. (24) Turney, J.; Simmonett, A.; Parrish, R.; Hohenstein, E.; Evangelista, F.; Fermann, J.; Mintz, B.; Burns, L.; Wilke, J.; Abrams, M.; et al. PSI4: an Open-Source Ab Initio Electronic Structure Program. Wiley Interdiscip. Rev.: Comput. Mol. Sci. 2012, 2, 556−565. (25) Valiev, M.; Bylaska, E. J.; Govind, N.; Kowalski, K.; Straatsma, T. P.; Van Dam, H. J. J.; Wang, D.; Nieplocha, J.; Apra, E.; Windus, T. L.; de Jong, W. NWChem: A Comprehensive and Scalable Open-Source Solution for Large Scale Molecular Simulations. Comput. Phys. Commun. 2010, 181, 1477−1489. (26) Gordon, M. S.; Schmidt, M. W. Advances in Electronic Structure Theory: Gamess a Decade Later. In Theory and Applications of Computational Chemistry: The First Forty Years; Dykstra, C. E., Frenking, G., Kim, K. S., Scuseria, G. E., Eds.; Elsevier: Amsterdam, 2005; pp 1167−1189. (27) O’Boyle, N.; Tenderholt, A.; Langner, K. cclib: A Library for Package-Independent Computational Chemistry Algorithms. J. Comput. Chem. 2008, 29, 839−845.

(28) JavaScript Object Notation for Linked Data. http://json-ld.org/ (accessed December 1, 2016). (29) Extensible Stylesheet Language Transformations. https://www. w3.org/standards/xml/transformation (accessed December 1, 2016). (30) Quantities, Units, Dimensions and Data Types Ontologies. http:// www.qudt.org/ (accessed December 1, 2016). (31) Schuchardt, K.; Didier, B.; Elsethagen, T.; Sun, L.; Gurumoorthi, V.; Chase, J.; Li, J.; Windus, T. Basis Set Exchange: A Community Database for Computational Sciences. J. Chem. Inf. Model. 2007, 47, 1045−1052.

J

DOI: 10.1021/acs.jpca.6b10489 J. Phys. Chem. A XXXX, XXX, XXX−XXX