Searching Online Chemical Data Repositories via the ChemAgora

Nov 30, 2017 - ACS AuthorChoice - This is an open access article published under a Creative Commons Attribution (CC-BY) License, which permits unrestr...
1 downloads 4 Views 2MB Size
Application Note Cite This: J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

pubs.acs.org/jcim

Searching Online Chemical Data Repositories via the ChemAgora Portal Antonella Zanzi* and Clemens Wittwehr European Commission, Joint Research Centre (JRC), Via Enrico Fermi 2749, 21027 Ispra (VA), Italy ABSTRACT: ChemAgora, a web application designed and developed in the context of the “Data Infrastructure for Chemical Safety Assessment” (diXa) project, provides search capabilities to chemical data from resources available online, enabling users to cross-reference their search results with both regulatory chemical information and public chemical databases. ChemAgora, through an on-the-fly search, informs whether a chemical is known or not in each of the external data sources and provides clikable links leading to the thirdparty web site pages containing the information. The original purpose of the ChemAgora application was to correlate studies stored in the diXa data warehouse with available chemical data. Since the end of the diXa project, ChemAgora has evolved into an independent portal, currently accessible directly through the ChemAgora home page, with improved search capabilities of online data sources.





INTRODUCTION

CHEMAGORA Enabling users to cross-reference their search results with both regulatory chemical information and public chemical databases was the main target of the portal design. To reach this goal, we chose to exploit synergies with eChemPortali.e., 30 data sources containing regulatory information from institutional bodies around the worldto have data from the regulatory world and with public repositories containing data of interest for those who work in the toxicological field and allowing online search functionalities. eChemPortal4−6 provides free public access to information on chemical properties and direct links to collections of information prepared for government chemical review programmes at national, regional, and international levels. eChemPortal is an effort of the Organisation for Economic Co-operation and Development (OECD) in collaboration with the European Commission (EC), the European Chemicals Agency (ECHA), the United States Environmental Protection Agency (EPA), Health Canada, the Japanese Minister of Economy, Trade and Industry (METI), the Japanese National Institute of Technology and Evaluation (NITE), the International Council of Chemical Associations (ICCA), the Business and Industry Advisory Committee (BIAC), the World Health Organization’s (WHO) International Program on Chemical Safety (IPCS), the United Nations Environment Programme (UNEP), and environmental nongovernmental organizations. In addition to eChemPortal, ChemAgora links to the following public data sources: ChemIDplus, CCRIS, GENETOX, HSDB, IRIS, and ITER from the TOXicology Data NETwork (TOXNET);7 ChEMBL;8 ChEBI;9 CompTox

ChemAgora has been designed and developed in the context of the diXa project, a three year project (started in 2011 and ended in 2014) funded by the European Union Seventh Framework Programme (FP7) for Research and Technological Development to provide a single resource to capture the data produced by toxicogenomics (the application of “-omics” technologies in chemicals risk assessment of chemicals toxicity) studies. The main result of the diXa project is an infrastructure1 storing toxicogenomics data consisting of a central data warehouse accessible through a web portal; the data warehouse is complemented by analytical resources and links to chemical/ toxicological information and human disease data. ChemAgora provides direct access for each chemical substance in the diXa data warehouse to chemical information available on third-party databases. The web application, through an on-the-fly search, informs whether a chemical is known or not in each of the external data sources and provides clickable links leading to the third-party web site pages containing the information. Some of the third-party data sources contain regulatory chemical information typically using the CAS Registry Number (CASRN)2a registered trademark of the American Chemical Societyas the substance identifier. While diXa itself does not use CASRNs, but InChIKeys3 as chemical structure identifiers, ChemAgora maps the InChIKeys to CASRNs, then searches also CASRN-based repositories and, thereby, makes a much wider range of data available to diXa users. The ChemAgora search functionalities are available through the home page of the application (http://chemagora.jrc.ec. europa.eu), and, for the diXa infrastructure users, through a call performed by the diXa portal (http://www.dixa-fp7.eu). Published XXXX by the American Chemical Society

Received: February 14, 2017

A

DOI: 10.1021/acs.jcim.7b00086 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Journal of Chemical Information and Modeling

Application Note

Figure 1. Result page for an InChIKey search. [The European Commission has granted permission to the publisher to reproduce their logo contained in the screenshots of the ChemAgora portal.]

Chemistry Dashboard;10 ChemSpider;11 PubChem;12 DrugBank;13 ConsensusPathDB;14 Common Chemistry;15 Comparative Toxicogenomics Database (CTDbase);16 CREST;17 the OECD Adverse Outcome Pathway (AOP) Wiki.18 Search Functionalities. ChemAgora provides an identifierbased search and a structure-based search. With the first option, the identifiers accepted by the portal are InChIKeys, CASRNs, and names, which can be partial. Each search request based on an InChIKey is also performed with the corresponding CASRN(s)more than one CASRN can correspond to an InChIKey; in such a case, all the CASRNs found are used to perform the search. Each search request based on a CASRN is also performed with the corresponding InChIKey. Finally, to perform a search with a name, the name provided as input for the search is mapped into both the corresponding InChIKey and CASRN(s), then the search is executed for the name, the InChIKey, and all the CASRNs found. To map among chemical identifiers, ChemAgora relies on the information provided by the NCI/CADD Chemical Identifier Resolver. For each identifier and for each third-party data source, the search result page (an example is shown in Figure 1) informs if the requested identifier has been found or not and provides clickable links leading to the page where information about the chemical can be found. The result page visualizes also the molecular formula, the InChI, and an image of the chemical structure collected from the external resources on which ChemAgora relies (as detailed in the section External

Resources and Local Repository). In the result page, a link to a list of synonyms found through the NCI/CADD Chemical Identifier Resolver is also provided. In addition, two buttons are displayed: one starting an InChIKey skeleton search and the other, when the InChI is available, opening the chemical structure editor while importing the structure of the substance. The functionality “search by name” includes the option to use partial names with the wildcard character “∗”; more than one wildcard character can be present in the input string. When the portal receives a request for a search by name, and if the wildcard character is not present in the input string received by the portal, the functionality search by name with an exact match is executed. However, if wildcard characters are detected in the input string, the result page (Figure 2) shows the list of chemical names that were found in the local database, matching the requested name pattern along with, if available, the molecular formula, the IUPAC name, and the chemical structure image. From the list, the user can select one of the names and start a search into the third-party repositories. An InChIKey skeleton search functionality is also provided, and it can be activated from both the search page and the result page. The result page of an InChIKey skeleton search visualizes the list of the InChIKeys found, as in the case of the search by partial name. With the structure search option, a search is carried out using a chemical structure drawn by the user using Ketcher,19 an open source web-based chemical structure editor written in B

DOI: 10.1021/acs.jcim.7b00086 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Journal of Chemical Information and Modeling

Application Note

Figure 2. Result page for a search by partial name. [The European Commission has granted permission to the publisher to reproduce their logo contained in the screenshots of the ChemAgora portal.]

• The URL to start a search based on a CASRN has the form “http://chemagora.jrc.ec.europa.eu/chemagora/ casrn/” plus a CASRN. • The URL to start a search with a name has the form “http://chemagora.jrc.ec.europa.eu/chemagora/name/” plus a chemical name. For example, the following URLs can be used in order to search for caffeine information:

JavaScript that has been integrated in ChemAgora (Figure 3). A chemical structure can also be loaded into the editor from a structure file (MOL and SDF formats are accepted) or providing an InChI. Using the chemical structure editor, three types of search can be performed: exactly matching the provided structure, a similarity search with Tanimoto similarity cutoff, or using the InChIKey skeleton of the provided structure. In the first case, the chemical structure supplied through the editor is mapped to an InChIKey, which is used to carry out the search on the third-party repositories; the second option starts a similarity search based on the ChEMBL Data Web Services with the choice among 90%, 80%, and 70% Tanimoto similarity cutoff; the third option, after the mapping of the structure into an InChIKey, starts a search using the corresponding InChIKey skeleton. Accessing ChemAgora Search Functionalities. As described above, users accessing the home page of the ChemAgora portal can start a search using an identifier (InChIKey, CASRN, and name) or drawing a chemical structure with an editor. Furthermore, in order to start a search without accessing the portal search page (e.g., if a thirdparty application wants to reuse ChemAgora capabilities), ChemAgora provides the following uniform resource locators (URLs):

• http://chemagora.jrc.ec.europa.eu/chemagora/inchikey/ RYYVLZVUVIJVGH-UHFFFAOYSA-N • http://chemagora.jrc.ec.europa.eu/chemagora/casrn/5808-2 • http://chemagora.jrc.ec.europa.eu/chemagora/name/ caffeine These URLs provide an easy way to use some of the ChemAgora search functionalities from outside the application. The provided URLs can be used by directly typing the request in the navigation toolbar of an Internet browser or they can be embedded in web applications. This second approach has been used to connect the diXa software infrastructure with the ChemAgora portal from the page of the diXa portal listing the chemical substances present in the diXa repository. For each of the chemicals listed in the diXa page, there is a button linking to the ChemAgora application activating, for the selected chemical, the ChemAgora search functionality for the

• The URL to start a search based on an InChIKey has the form “http://chemagora.jrc.ec.europa.eu/chemagora/inchikey/” plus an InChIKey. C

DOI: 10.1021/acs.jcim.7b00086 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Journal of Chemical Information and Modeling

Application Note

Figure 3. Search by structure with Ketcher, the chemical structure editor integrated into ChemAgora. [The European Commission has granted permission to the publisher to reproduce their logo contained in the screenshots of the ChemAgora portal.]

Figure 4. Accessing the ChemAgora search functionality from the diXa portal. [The European Commission has granted permission to the publisher to reproduce their logo contained in the screenshots of the ChemAgora portal.]

corresponding InChIKey; the data flow from the diXa portal to

• “http://chemagora.jrc.ec.europa.eu/chemagora/xml/ casrn/” plus a CASRN External Resources and Local Repository. The main advantages of accessing the third-party data sources on-the-fly instead of storing the information about the data availability of chemicals for each external resource in a local database are that the result of a search is always up-to-date with respect to the content of the third-party repositories and there is no need to

the search functionality in ChemAgora is shown in Figure 4. For InChIKeys and CASRNs, the option is also provided to receive the search results in XML format through the following URLs: • “http://chemagora.jrc.ec.europa.eu/chemagora/xml/inchikey/” plus an InChIKey D

DOI: 10.1021/acs.jcim.7b00086 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Journal of Chemical Information and Modeling

Application Note

Table 1. Approach Used To Access Each Third-Party Repository and Identifiers Used searched by identifier third-party repository eChemPortal ToxNet: ChemIDplus, CCRIS, GENETOX, HSDB, IRIS, ITER ChEMBL ChEBI ChemSpider PubCHEM ConsensusPathDB CompTox Chemistry Dashboard CREST CTDbase DrugBank Common Chemistry AOP Wiki

accessing approach simulated web simulated web REST API simulated web simulated web simulated web simulated web simulated web simulated web simulated web simulated web simulated web REST API

form submission form submission form form form form form form form form form

submission submission submission submission submission submission submission submission submission

InChIKey

CASRN

name

− − × × × × − × × − × − −

× × − − × × × × − × × × ×

× × − − − − × × − × × − −

system hosted on a VMWARE infrastructure with two virtual CPUs and 4 GB of RAM, (ii) the installed operating system is Linux CentOS (version 6), (iii) the application server used is Apache Tomcat (version 8), and (iv) the Oracle DBMS (version 11) is installed on a database server. Third-party data sources are accessed by ChemAgora through a simulated web form submission or when available via REST APIs. Table 1 summarizes for each third-party repository the approach used to access it along with the identifiers used during the searches.

periodically update the local database. However, as a consequence of this approach, ChemAgora cannot provide any information about the availability of a chemical substance in a third-party repository if the external data source is temporarily not available. ChemAgora also relies on third-party services in order to convert from one chemical identifier to another, to implement the similarity search, and to provide chemical information, such as molecular formulas, InChIs, and IUPAC names; the external services exploited are the following: • The NCI/CADD Chemical Identifier Resolver (CIR)20 is a conversion service for different chemical structure identifiers provided by the NCI/CADD Group at the U.S. National Cancer Institute. ChemAgora relies on the NCI/CADD CIR to map among chemical identifiers. It is also one of the sources of molecular formulas, InChIs, IUPAC names, synonyms, and structure images. • The PubChem’s PUG REST service, 21 which is maintained by the National Center for Biotechnology Information (NCBI) of the U.S. National Library of Medicine, provides access to PubChem data through HTTP requests. This service is another source of molecular formulas, InChIs, and structure images. • ChemIDplus,22 which is also maintained by the U.S. National Library of Medicine, is the source of some of the structure images displayed on ChemAgora. • ChEMBL Data Web Services,23 which are provided by the European Bioinformatics Institute (EBI) of the European Molecular Biology Laboratory (EMBL), are used by ChemAgora to implement the similarity search. This approach reduces the amount of maintenance work required by the portal management, having however the drawback of causing the partial disruption of the ChemAgora search functionalities if the external services are temporarily not available. To implement the search by partial name (i.e., using a wildcard character), a local repository has nevertheless been set up and has been populated with names of chemical substances collected from the ECHA REACHRegistration, Evaluation, Authorisation, and Restriction of Chemicalsdata repository.24 Technical Details. ChemAgora is a Java servlet-based web application; the local repository has been implemented with the Oracle database system. The following software infrastructure has been set up to host the ChemAgora portal: (i) a virtual



RELATED WORK A number of chemical data repositories provide search capabilities on their data through web interfaces. In addition, the online platforms providing conversion services among chemical identifiers base the conversion functionalities on data previously downloaded from the third-party databases and stored locally. The only web site that we are currently aware of that provides on-the-fly search functionalities is iScienceSearch,25 which was released after the diXa project started. iScienceSearch, using web APIs or web services exposed by the various data sources, allows users to search free chemistry databases and scientific journals on the Internet by structure, CAS Registry Number, name, and free text. The structural search functionality is provided through the chemical structure editor JSDraw from Scilligence Corporation. iScienceSearch is provided by AKos Consulting & Solutions Deutschland GmbH (AKos GmbH), a company selling compounds, software, and databases for chemical and pharmaceutical research and development. AKos GmbH provides access to iScienceSearch for free stating on their web site that the work is financed through advertisement.



CONCLUSIONS Enabling users to cross-reference their search results with both regulatory chemical information and public chemical repositories was the main target of the portal design. To reach this goal, we chose to exploit synergies with the data sources available through the OECD eChemPortal to cover the toxicological regulatory environment; in addition, we selected some public repositories sharing data of interest for the more scientifically oriented toxicological field and allowing online search functionalities. The web application developed on the basis of this design provides functionalities to search and access chemical data on third-party online repositories, providing a E

DOI: 10.1021/acs.jcim.7b00086 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Journal of Chemical Information and Modeling

Application Note

first step in bridging the gap between scientific and regulatory environments. Currently, ChemAgora executes a search for the user request into 17 third-party online repositories and through eChemPortal into more than 30 data sources containing regulatory information; moreover, the search functionality is easily extensible to other third-party data sources. The development of ChemAgora has resulted in a positive outcome for eChemPortal users: both the search by structure functionality and the search performed on InChIKey provided by ChemAgora extend the search options on eChemPortal, thus increasing the number of users that can benefit from the services provided by eChemPortal.



(13) Law, V.; Knox, C.; Djoumbou, Y.; Jewison, T.; Guo, A. C.; Liu, Y.; Maciejewski, A.; Arndt, D.; Wilson, M.; Neveu, V.; Tang, A.; Gabriel, G.; Ly, C.; Adamjee, S.; Dame, Z. T.; Han, B.; Zhou, Y.; Wishart, D. S. DrugBank 4.0: Shedding New Light on Drug Metabolism. Nucleic Acids Res. 2014, 42, D1091−D1097. (14) Kamburov, A.; Stelzl, U.; Lehrach, H.; Herwig, R. The ConsensusPathDB Interaction Database: 2013 Update. Nucleic Acids Res. 2013, 41, D793−D800. (15) American Chemical Society. Common Chemistry. http://www. commonchemistry.org/ (accessed July 21, 2017). (16) Davis, A. P.; Grondin, C. J.; Johnson, R. J.; Sciaky, D.; King, B. L.; McMorran, R.; Wiegers, J.; Wiegers, T. C.; Mattingly, C. J. The Comparative Toxicogenomics Database: Update 2017. Nucleic Acids Res. 2017, 45, D972−D978. (17) CREST. http://www.rmeonline.net/CREST/ (accessed July 21, 2017). (18) AOP-Wiki. https://aopwiki.org (accessed July 21, 2017). (19) Karulin, B.; Kozhevnikov, M. Ketcher: Web-based Chemical Structure Editor. J. Cheminf. 2011, 3, P3. (20) Chemical Identifier Resolver. NCI/CADD Group. https:// cactus.nci.nih.gov/chemical/structure (accessed July 21, 2017). (21) PUG REST Tutorial. PubChem. http://pubchem.ncbi.nlm.nih. gov/pug_rest (accessed July 21, 2017). (22) ChemIDplus. U.S. National Library of Medicine. NIH. https:// chem.nlm.nih.gov/chemidplus/ (accessed July 21, 2017). (23) CHEMBL. EMBL-EBI. https://www.ebi.ac.uk/chembl/ws (accessed July 21, 2017). (24) REACH. ECHA. https://echa.europa.eu/regulations/reach/ understanding-reach (accessed July 21, 2017). (25) iScienceSearch. http://isciencesearch.com (accessed July 21, 2017).

AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. ORCID

Antonella Zanzi: 0000-0002-3567-9709 Clemens Wittwehr: 0000-0003-2760-7702 Funding

This work was supported by the Seventh Framework Programme (FP7) of the European Union, under grant agreement number RI-283775. Notes

The authors declare no competing financial interest.



REFERENCES

(1) Hendrickx, D. M.; Aerts, H. J. W. L.; Caiment, F.; Clark, D.; Ebbels, T. M. D.; Evelo, C. T.; Gmuender, H.; Hebels, D. G. A. J.; Herwig, R.; Hescheler, J.; Jennen, D. G. J.; Jetten, M. J. A.; Kanterakis, S.; Keun, H. C.; Matser, V.; Overington, J. P.; Pilicheva, E.; Sarkans, U.; Segura-Lepe, M. P.; Sotiriadou, I.; Wittenberger, T.; Wittwehr, C.; Zanzi, A.; Kleinjans, J. C. S. diXa: a Data Infrastructure for Chemical Safety Assessment. Bioinformatics 2015, 31, 1505−1507. (2) Weisgerber, D. W. Chemical Abstracts Service Chemical Registry System: History, Scope, and Impacts. J. Am. Soc. Inf. Sci. 1997, 48, 349−360. (3) Heller, S.; McNaught, A.; Stein, S.; Tchekhovskoi, D.; Pletnev, I. InChI: The Worldwide Chemical Structure Identifier Standard. J. Cheminf. 2013, 5, 7. (4) Wittwehr, C. eChemPortal: Neuer Zugang zu ChemikalienDaten. Chem. Unserer Zeit 2011, 45, 122−125. (5) De Marcellus, S. eChemPortal−The Global Portal to Information on Chemical Substances. In Encyclopedia of Toxicology, 3rd ed.; Wexler, P., Ed.; Vol. 3; Elsevier Inc., Academic Press, 2014; pp 655−662. (6) Wexler, P.; Judson, R.; De Marcellus, S.; De Knecht, J.; Leinala, E. Health Effects of Toxicants: Online Knowledge Support. Life Sci. 2016, 145, 284−293. (7) Wexler, P. TOXNET: an Evolving Web Resource for Toxicology and Environmental Health Information. Toxicology 2001, 157, 3−10. (8) Warr, W. ChEMBL. An Interview with John Overington, Team Leader, Chemogenomics at the European Bioinformatics Institute Outstation of the European Molecular Biology Laboratory (EMBLEBI). J. Comput.-Aided Mol. Des. 2009, 23, 195−198. (9) Degtyarenko, K.; De Matos, P.; Ennis, M.; Hastings, J.; Zbinden, M.; McNaught, A.; Alcántara, R.; Darsow, M.; Guedj, M.; Ashburner, M. ChEBI: a Database and Ontology for Chemical Entities of Biological Interest. Nucleic Acids Res. 2008, 36, D344−D350. (10) Chemistry Dashboard. U.S. Environmental Protection Agency. https://comptox.epa.gov/dashboard (accessed July 21, 2017). (11) Pence, H. E.; Williams, A. ChemSpider: an Online Chemical Information Resource. J. Chem. Educ. 2010, 87, 1123−1124. (12) Wang, Y.; Xiao, J.; Suzek, O. T.; Zhang, J.; Wang, J.; Bryant, S. H. PubChem: a Public Information System for Analyzing Bioactivities of Small Molecules. Nucleic Acids Res. 2009, 37, W623−W633. F

DOI: 10.1021/acs.jcim.7b00086 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX