Complexity and Heterogeneity of Data for ... - ACS Publications

In biology and bioinformatics, the 'big data' issue has been high on the agenda for several years, presenting significant challenges for data manageme...
1 downloads 0 Views 246KB Size
Chapter 2

Complexity and Heterogeneity of Data for Chemical Information Science Downloaded by CORNELL UNIV on October 11, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch002

Jürgen Bajorath* Department of Life Science Informatics, B-IT, LIMES Program Unit Chemical Biology and Medicinal Chemistry, Rheinische Friedrich-Wilhelms-Universität, Dahlmannstr. 2, D-53113 Bonn, Germany *E-mail: [email protected]

In biology and bioinformatics, the ‘big data’ issue has been high on the agenda for several years, presenting significant challenges for data management, analysis, and knowledge extraction. The big data wave is also beginning to hit chemistry, especially in the context of pharmaceutical R&D, albeit still at much lower magnitude than is the case in biology. Nonetheless, characteristics of rapidly growing amounts of chemical data need to be re-evaluated and criteria for their analysis re-defined. Various databases are evolving to store and organize large numbers of compounds and volumes of activity data. Currently, there already is so much compound and activity information available in the public domain that it would be rather negligent for chemical or pharmaceutical companies to ignore this information in their own R&D efforts. Importantly, not only data volumes grow at unprecedented rates, but also the complexity of chemical data and heterogeneity across different databases increases. Despite these challenges, the big data era also provides many opportunities for chemical information science at the interface with experimental disciplines. The following discussion concentrates on bioactive compounds and activity data.

© 2016 American Chemical Society Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by CORNELL UNIV on October 11, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch002

Introduction In many scientific fields, current research is more data-intense than ever before including data generation and knowledge extraction. This does not constitute a ‘big data’ scenario per se, but there is no doubt that biology and the life sciences have entered the big data era – and similar trends can now be observed in chemistry, especially chemical biology and medicinal chemistry (1, 2) or –in a broader sense– chemistry in highly interdisciplinary settings such as drug discovery (where the focus is on bioactive molecules). Accordingly, it is not surprising that large databases evolve in the public domain to archive and organize rapidly growing amounts of chemical structures and associated activity data (similar initiatives –on a somewhat smaller scale– must be internally supported by large pharmaceutical companies for their own proprietary data). Among others, prominent data repositories currently include PubChem (3, 4), UniChem (5), ChemSpider (6), the Chemical Structure Lookup Service (CSLS) (7), or ZINC (8), which often store overlapping yet distinct collections of chemical structures and data. In addition, there are databases with a particular focus on bioactive compounds and drug discovery including ChEMBL (9, 10), the major current repository for compounds and activity data originating from medicinal chemistry sources, BindingDB (11, 12) (with a current focus similar to ChEMBL), Open PHACTS (13), which reports biological targets and/or activities for given compounds (in the form of pharmacological records), or DrugBank (14), a major source of approved and experimental drugs. A major challenge facing the drug discovery field is how to best make use of the large volumes of compound activity data available in the public domain. It is generally being recognized that it would not be very careful disregarding this information in internal discovery efforts. Data mining and merging of data from different sources typically falls into the domain of chemoinformatics, which was first described in 1998 as a discipline evolving in drug discovery environments (15). Chemoinformatics includes a large spectrum of computational approaches and infrastructures for analysis, modeling, and design (16) and the field is still evolving. However, the scientific roots of chemoinformatics go back many years, at least to the 1950s and 60s, long before the term was coined (16–18). Considering core tasks such as chemical structure classification, data mining, information extraction, or derivation of predictive model for chemical properties, chemoinformatics can be well considered in a broader context as a part of chemical information science (18), a view adopted herein for the discussion of compound activity data and their characteristics.

Data Volumes Only considering some of the major chemical databases introduced above, recent growth in chemical structure data alone has been nothing but astonishing 10 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by CORNELL UNIV on October 11, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch002

– and would not have been predicted just a few years ago. As summarized in Table 1, tens of millions to about 100 million compound or structure entries were available in August 2015 in various databases including collections of synthetic and drug-like molecules (e.g. ZINC), repositories of screening data sets, substances, and compounds (PubChem), or databases establishing links between different repositories (e.g. UniChem). Regardless of specific database design philosophies and architectures, it is evident that organizing and mining of such large amounts of chemical structure data requires elaborate and efficient computational infrastructures.

Table 1. The number of compound (CPD) or chemical structure entries in five major public repositories is reported. Public Database

Organization

CPDs/Structures

UniChem

EMBL-EBI

72

CSLS

NCI-NIH

46

ChemSpider

Royal Soc. Chem.

35

ZINC

UCSF

23

PubChem

NCBI-NIH

61

CAS Registry

Amer. Chem. Soc.

105

Reaxys

Elsevier

55

Similar trends are observed for compound activity data. Table 2 lists entries in databases that collect active compounds, drugs, targets, and activity records and are of particular interest for drug discovery. Large volumes of compound activity data are currently available, again, unimaginable just a few years ago. For example, release 20 of ChEMBL, the major public repository of bioactive compounds and activity data mainly curated from medicinal chemistry literature and patents, contains nearly 1.5 million compounds with activity against more than 1000 targets and a total of more than 13 million activity records. In addition, PubChem, the major repository of biological screening data, currently stores more than one million assays of different experimental type and design including more than 200,000 confirmatory assays (re-investigating primary screening hits). Taken together, these compound activity data from medicinal chemistry and biological screening alone represent a rich knowledge base for drug discovery efforts. Considering the recent growth rates of ChEMBL and PubChem, there is no end in sight. 11

Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Table 2. Compound and activity records available in databases focusing on bioactive compounds (or drugs) are summarized. ChEMBL (release 20) EMBL-EBI

Downloaded by CORNELL UNIV on October 11, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch002

BindingDB (accessed in Aug. 2015) UCSD PubChem (accessed in Aug. 2015) NCBI-NIH DrugBank (version 4.3) U Alberta

# Compounds

1,463,270

# Activity annotations

13,520,737

# Biological targets

10,774

# Compounds

495,128

# Binding records

1,141,421

# Protein targets

7030

# Compounds

60,770,909

# Assays

1,154,350

# Confirmatory assays

206,541

# Drugs

7759

# FDA-approved small molecule drugs

1602

# Protein targets

4300

Big Data Criteria While current volumes of chemical structure and bioactivity data are truly impressive, big data characteristics go beyond mere volumes. Views about big data phenomena frequently differ, but there is broad consensus that multiple criteria need to be taken into consideration. For example, in medicinal chemistry, five ‘Vs’ have been put forward as criteria for big data (2) including Velocity, Variety, Veracity, and Value, in addition to Volume – and these criteria are meaningful without doubt. The speed with which new chemical data are generated correlates with data volumes, and the variety of data is also steadily increasing (which immediately applies to compound activity data reported in Table 2). The value of data is probably more subjective in nature and difficult to quantify. However, it is easy to rationalize, for example, that medicinal chemistry data might be of different relevance for drug discovery, depending on the targets that are investigated and their potential for therapeutic intervention. While the five ‘Vs’ provide a meaningful initial characterization of big data in medicinal chemistry, these criteria are probably not sufficient to fully account for big data phenomena when considering compound activities, as discussed in the following.

Complexity and Heterogeneity Growth of compound activity data is also accompanied by increasing data complexity (i.e., increasing numbers of data attributes and modifications) and heterogeneity (i.e., variations across different databases). For example, varying levels of complexity can be detected in the assembly and organization of activity records (Table 2) or the consideration of different assays and activity 12 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by CORNELL UNIV on October 11, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch002

measurements to establish target annotations. Complexity and heterogeneity of activity data are illustrated using a simple example. Figure 1 compares the structures of two closely related anti-allergic agents, trimeprazine and promethazine, and Table 3 compares activity records and target annotations that are reported for these drugs in different databases.

Figure 1. Shown are two anti-allergic agents that are structural analogs.

Table 3. Activity records and target annotations for trimeprazine and promethazine contained in different compound databases are reported. Trimeprazine

Promethazine

# Target proteins

2

14

# All targets

13

149

ChEMBL 20

# Targets (highconfidence)

0

22

BindingDB

# Activity records

3 (2 Ki; 1 IC50)

21 (14 Ki; 6 IC50; 1 EC50)

DrugBank 4.3

Although trimeprazine and promethazine are structural analogs with similar therapeutic indications, DrugBank reports significantly different numbers of targets for them; two and 14, respectively. Moreover, ChEMBL reports a total of 13 and 149 targets for trimeprazine and promethazine, respectively, whereas BindingDB contains only three and 21 activity records (on the basis of different activity measurements). However, when high confidence criteria are applied to compound activity data in ChEMBL, as further discussed below, the number of target annotations for promethazine is drastically reduced from 149 to 22 and no target remains for trimeprazine. How can these differences be reconciled? They most likely result from the complexity of activity data taken from different publications (or other sources), which report different assays, experimental conditions, and types of activity measurements, etc. To what extent these data are then accessed, which sources are utilized, and which selection criteria are applied, largely determines the contents of activity records stored in various databases. In 13

Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by CORNELL UNIV on October 11, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch002

addition, internal data curation steps and different formats of activity records or annotations influence information retrieval from databases. Taken together, such differences then result in substantial heterogeneity of compound activity data across different databases, as exemplified in Table 3. In essence, heterogeneity results from data complexity. In addition, there currently are no uniform standards how to generate and represent compound activity records. It is worrisome that a ‘quick’ search for targets of promethazine will yield 149 in ChEMBL, but only 14 in DrugBank. At the least, this likely gives rise to confusion among uninitiated users, and the situation becomes especially problematic if scientific conclusions are drawn on the basis of such ‘quick’ searches in a given database. Particular care must be taken to avoid considering compound data repositories as ‘black boxes’ that provide unambiguous answers to compound queries. It should be emphasized that examples such as trimeprazine and promethazine are not an exception, but rather the rule; such examples are omnipresent and involve more or less all bioactive compounds (and drugs). Thus, heterogeneity of compound activity data across public repositories represents a substantial problem and requires consideration of an additional issue in the big data context – data confidence levels.

Data Confidence There has been increasing awareness that the types of activity measurements (e.g. assay-dependent IC50 vs. (theoretically) assay-independent Ki values) that are considered as well as their experimental variance and confidence limits might significantly affect the results of compound activity analysis (19–21). However, confidence assessment of activity records in databases requires the analysis of additional criteria at different levels (22). For example, revisiting the example in Table 3, one should ask the question how it might be possible that 149 target annotations for promethazine in ChEMBL are reduced to 22 and 13 target annotations for trimeprazine to zero (!) when ‘high-confidence’ activity data are exclusively considered? What does ‘high-confidence’ mean and which selection criteria are applied? ChEMBL enables the specification of multiple data selection and confidence criteria (which are, however, not immediately obvious). Target classes (e.g. single protein) and organisms (e.g. homo sapiens) can be selected and assay types (e.g. direct binding or inhibition assays) and confidence scores be specified. In addition, different types of activity measurements and their standardized units can be selected and approximate values and records with inconsistent activity designations be eliminated. Sequential application of such criteria reduces the number of activity records in a stepwise manner and increases data confidence levels. For example, when activity records are selected for human targets from ChEMBL (release 18) by exclusively focusing on direct assays against single protein targets and eliminating approximate or not well-defined activity measurements and ambiguous activity annotations, the total number of available activity records is reduced from ~1.3 million to ~148,000. Under these conditions, the number of promethazine targets in ChEMBL is drastically reduced from 149 14

Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by CORNELL UNIV on October 11, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch002

(when considering all available activity records) to 22 and none of the 13 target annotations for trimeprazine remains, as discussed above. Of course, this does not mean that trimeprazine (for which DrugBank reports two targets) has no known target; rather, it means that for this specific combination of high-confidence data selection criteria, no human target qualifies in ChEMBL. Thus, care must also be taken when interpreting high-confidence data. From the discussion above, it is clear that confidence criteria must be carefully considered in the analysis of compound activity data. Differences in activity annotations across databases might not be resolvable at the user levels. At the least, however, clear definition and consistent application of data confidence criteria ensures reproducibility of data analysis and limits discrepancies between activity annotations in many instances.

Perspective for Chemical Information Science Chemical information science is entering the big data era. Exponentially increasing volumes of chemical data of their increasing variety challenge data organization, analysis, and knowledge extraction. Computational frameworks for data analysis have become indispensable components of chemical research. Big data phenomena are particularly evident at interfaces between chemistry and other scientific fields such as pharmaceutical research. As discussed herein, compound structure and activity data currently grow at astonishing rates. This situation has also catalyzed the development of public repositories of chemical structures and activity data. It has been argued herein that special attention must be paid to data complexity, heterogeneity, and confidence criteria in the analysis of compound activity data, which is often not sufficiently considered when drawing conclusions from data analysis. It is evident that data mining will provide substantial opportunities for pharmaceutical R&D going forward, given the increasingly large knowledge base accumulating in the public domain. This will provide many growth opportunities for chemical information science. It is also clear, however, that meaningful progress in knowledge extraction can only be made if high data confidence is consistently ensured. No doubt, for chemical information science, these are equally challenging and exciting times.

Acknowledgments The author is grateful to members of former research groups in the USA and the current Life Science Informatics Department at the University of Bonn for their many contributions to chemical information science, computational medicinal chemistry, and interdisciplinary research. Special thanks to Ye Hu for her review of this chapter.

References 1.

Hu, Y.; Bajorath, J. Learning from ‘Big Data’: Compounds and Targets. Drug Discovery Today 2014, 19, 357–360. 15

Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

2.

3.

4.

Downloaded by CORNELL UNIV on October 11, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch002

5.

6. 7. 8.

9.

10.

11. 12.

13.

14.

15. 16.

Lusher, S. J.; McGuire, R.; van Schaik, R. C.; Nicholson, C. D.; de Vlieg, J. Data-Driven Medicinal Chemistry in the Era of Big Data. Drug Discovery Today 2014, 19, 859–868. Bolton, E.; Wang, Y.; Thiessen, P. A.; Bryant, S. H. PubChem: Integrated Platform of Small Molecules and Biological Activities. Annu. Rep. Comput. Chem. 2008, 4, 217–241. Wang, Y.; Xiao, J.; Suzek, T. O.; Zhang, J.; Wang, J.; Zhou, Z.; Han, L.; Karapetyan, K.; Dracheva, S.; Shoemaker, B. A.; Bolton, E.; Gindulyte, A.; Bryant, S. H. PubChem’s BioAssay Database. Nucleic Acids Res. 2012, 40, D400–D412. Chambers, J.; Davies, M.; Gaulton, A.; Hersey, A.; Velankar, S.; Petryszak, R.; Hastings, J.; Bellis, L.; McGlinchey, S.; Overington, J. P. UniChem: A Unified Chemical Structure Cross-Referencing and Identifier Tracking System. J. Cheminf. 2013, 5, 3. Pence, H. E.; Williams, A. ChemSpider: An Online Chemical Information Resource. J. Chem. Educ. 2010, 87, 1123–1124. Chemical Structure Lookup Service. http://cactus.nci.nih.gov/lookup/ (accessed Aug. 23, 2015). Irwin, J. J.; Sterling, T.; Mysinger, M. M.; Bolstad, E. S.; Coleman, R. G. ZINC: A Free Tool to Discover Chemistry for Biology. J. Chem. Inf. Model. 2012, 52, 1757–1768. Gaulton, A.; Bellis, L. J.; Bento, A. P.; Chambers, J.; Davies, M.; Hersey, A.; Light, Y.; McGlinchey, S.; Michalovich, D.; Al-Lazikani, B.; Overington, J. P. ChEMBL: A Large-scale Bioactivity Database for Drug Discovery. Nucleic Acids Res. 2011, 40, D1100–D1107. Bento, A. P.; Gaulton, A.; Hersey, A.; Bellis, L. J.; Chambers, J.; Davies, M.; Krüger, F. A.; Light, Y.; Mak, L.; McGlinchey, S.; Nowotka, M.; Papadatos, G.; Santos, R.; Overington, J. P. The ChEMBL Bioactivity Database: An Update. Nucleic Acids Res. 2014, 42, D1083–D1090. Chen, X.; Lin, Y.; Liu, M.; Gilson, M. K. The Binding Database: Data Management and Interface Design. Bioinformatics 2002, 18, 130–139. Liu, T.; Lin, Y.; Wen, X.; Jorissen, R. N.; Gilson, M. K. BindingDB: a WebAccessible Database of Experimentally Determined Protein-Ligand Binding Affinities. Nucleic Acids Res. 2007, 35, D198–D201. Williams, A. J.; Harland, L.; Groth, P.; Pettifer, S.; Chichester, C.; Willighagen, E. L.; Evelo, C. T.; Blomberg, N.; Ecker, G.; Goble, C.; Mons, B. Open PHACTS: Semantic Interoperability for Drug Discovery. Drug Discovery Today 2012, 17, 1188–1198. Law, V.; Knox, C.; Djoumbou, Y.; Jewison, T.; Guo, A. C.; Liu, Y.; Maciejewski, A.; Arndt, D.; Wilson, M.; Neveu, V.; Tang, A.; Gabriel, G.; Ly, C.; Adamjee, S.; Dame, Z. T.; Han, B.; Zhou, Y.; Wishart, D. S. DrugBank 4.0: shedding new light on drug metabolism. Nucleic Acids Res. 2014, 42, D1091–1097. Brown, F. K. Chemoinformatics: What is it and How Does it Impact Drug Discovery? Annu. Rep. Med. Chem. 1998, 33, 375–384. Bajorath, J. Understanding Chemoinformatics: A Unifying Approach. Drug Discovery Today 2004, 9, 13–14. 16

Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by CORNELL UNIV on October 11, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch002

17. Gasteiger, J. Chemoinformatics: A New Field with a Long Tradition. Anal. Bioanal. Chem. 2006, 384, 57–64. 18. Willett, P. From Chemical Documentation to Chemoinformatics: 50 Years of Chemical Information Science. J. Inf. Sci. 2006, 34, 477–499. 19. Stumpfe, D.; Bajorath, J. Assessing the Confidence Level of Public Domain Compound Activity Data and the Impact of Alternative Potency Measurements on SAR Analysis. J. Chem. Inf. Model. 2011, 51, 3131–3137. 20. Hu, Y.; Bajorath, J. Growth of Ligand-Target Interaction Data in ChEMBL is Associated with Increasing and Measurement-Dependent Compound Promiscuity. J. Chem. Inf. Model. 2012, 52, 2550–2558. 21. Kramer, C.; Kalliokoski, T.; Gedeck, P.; Vulpetti, A. The Experimental Uncertainty of Heterogeneous Public Ki Data. J. Med. Chem. 2012, 55, 5165–5173. 22. Hu, Y.; Bajorath, J. Influence of Search Parameters and Criteria on Compound Selection, Promiscuity, and Pan Assay Interference Characteristics. J. Chem. Inf. Model. 2014, 54, 3056–3066.

17 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.