UNICON: A Powerful and Easy-to-Use Compound Library Converter

May 26, 2016 - Software availability: UNICON is available for Linux, OS X, and Windows from http://www.zbh.uni-hamburg.de/unicon for free for academic...
0 downloads 14 Views 1MB Size
Application Note pubs.acs.org/jcim

UNICON: A Powerful and Easy-to-Use Compound Library Converter Kai Sommer, Nils-Ole Friedrich, Stefan Bietz, Matthias Hilbig, Therese Inhester, and Matthias Rarey* Center for Bioinformatics, Research Group for Computational Molecular Design, University of Hamburg, Bundesstraße 43, 20146 Hamburg, Germany S Supporting Information *

ABSTRACT: The accurate handling of different chemical file formats and the consistent conversion between them play important roles for calculations in complex cheminformatics workflows. Working with different cheminformatic tools often makes the conversion between file formats a mandatory step. Such a conversion might become a difficult task in cases where the information content substantially differs. This paper describes UNICON, an easy-to-use software tool for this task. The functionality of UNICON ranges from file conversion between standard formats SDF, MOL2, SMILES, PDB, and PDBx/mmCIF via the generation of 2D structure coordinates and 3D structures to the enumeration of tautomeric forms, protonation states, and conformer ensembles. For this purpose, UNICON bundles the key elements of the previously described NAOMI library in a single, easy-to-use command line tool.



INTRODUCTION Dealing with large small-molecule compound collections is very much the heart of most cheminformatics tasks. Most applications are able to process large amounts of data, but not all standard file formats are usually accepted. Due to the different data content of chemical file formats, consistent interconversion is far from trivial.1 Extracting small molecules from a chemically sparsely annotated PDB file is extremely challenging, and many approaches for this problem have been developed in the past.2−8 Also, converting from SMILES to SDF can be extremely tricky in the case of SDF-processing applications that require proper 2D structure diagram coordinates or low-energy 3D structures. Two-dimensional (2D) coordinates are of importance as soon as molecules should be presented as structure diagrams. The corresponding problem named structure diagram generation has been addressed by cheminformaticians for more than 30 years.9 Although the problem can be considered as solved for large portions of the drug-like chemical space,10 drawing highly congested chemical structures, complex ring systems, and macrocycles still have room for improvement. Many modeling tools require 3D coordinates, most notably structure- and ligand-based virtual screening but also 3D-QSAR model builders. Early approaches for 3D structure generation are CONCORD,11 CORINA,12 and OMEGA,13 the latter two are probably the most widely used tools in the field today. Nowadays, many modeling and cheminformatics software packages contain 3D structure generation as an internal functionality. As with 2D coordinates, the generation of chemical structures is solved for many drug-like molecules but remains challenging for complex, bridged ring systems and macrocycles. Since most chemical applications do consider molecules as structurally invariant entities, inherent molecular variability must be considered in advance. Most importantly, different tautomeric forms and protonation states must be enumerated. In those cases © XXXX American Chemical Society

where 3D coordinates are required, conformational ensembles have to be created. Using only one possible valence bond structure may lead to false-negative results, whereas numerous possible energetically unfavorable structures result in an increased false-positive rate.14 Well-known tools for dealing with tautomers and/or protonation states are Ambit-Tautomer,15 MOE,16 QUACPAC from OpenEye,17 and others.18,19 The generation of low-energy conformer ensembles remain a challenge. Several well-performing tools like OMEGA,20 CONFECT,21 Frog,22 CAESAR,23 RDKit,24 or OpenBabel25 exist. The most difficult element is to find a good trade-off between ensemble size and accuracy. The number of conformations should be kept low since the computing time of subsequent calculations usually show a linear dependency. Therefore, sensitive scoring functions are needed to pick a collection of conformers that might occur in biological context. All these tasks can be addressed by most modeling and cheminformatics platforms. Most notably, graphical pipelining systems like Pipeline Pilot26 and Knime27 allow us to quickly construct a required conversion workflow from small functional building blocks. Powerful toolkits like OpenBabel,25 CACTVS,28 RDkit,24 or OEChem13 can be used to custom-program conversion tools. All three approaches show a lot of flexibility; however, they require the installation of large software packages and startup time for the background computing engines or substantial programming skills. In a shell-driven computing environment, programs that can be used on the command-line level have substantial advantages. They have low startup time, nearly no installation hurdle, are easy-to-use, and can be combined with arbitrary scripting languages. In the following, we present UNICON that encapsulates the most important conversion functionality including conformer Received: February 19, 2016

A

DOI: 10.1021/acs.jcim.6b00069 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Application Note

Journal of Chemical Information and Modeling

Figure 1. Some examples for the main features of UNICON. All molecules are taken from DrugBank.33,34 The first column shows 2D and 3D coordinate generation using default settings and considering hydrogen clashes. The second column shows examples for (top) tautomer enumeration and (bottom) protonation state enumeration. The column on the right shows two examples for conformation generation using default settings and considering hydrogen clashes during sampling (2D depiction with overlay of conformers). Example one has 16 conformations, whereas example two has 135 conformations. B

DOI: 10.1021/acs.jcim.6b00069 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Application Note

Journal of Chemical Information and Modeling

In a second step, the overall layout is optimized by local changes to chains and ring systems. The optimization procedure searches for the layout with the least number of collisions, longest stretched chains, and the most uncluttered layout. In the postprocessing step, any remaining collisions are fixed by local changes to the angles and lengths of chain bonds. Finally, hydrogen 2D coordinates are calculated, and the diagram is rotated horizontally relative to either the largest ring system or the longest chain.

generation from the NAOMI software library in a single command-line tool. The chemical models as well as the algorithms behind the individual NAOMI functions have been published elsewhere21,29−32 and are only shortly summarized below.



METHODS: UNICON TECHNOLOGY UNICON is a command line tool for Linux, OS X, and Windows systems. The fundamental concept behind UNICON is a consistent chemical model for the appropriate representation of molecules and an intuitive user interface. The internal molecule representation is the key feature for correctness, consistency, and performance in complex workflows. For example, a conversion between different file formats of about 4000 compounds, including error detection and correction, can be achieved in less than 1 s (0.3 ms/cmpd average) for each format. Detailed information about correctness and consistency can be found in the corresponding publications introducing NAOMI.29,31



TAUTOMER AND PROTONATION STATE GENERATION The generation of valid tautomers and protonation states is based on the valence state combination (VSC) model.31 Briefly described, the molecule is first partitioned into so-called tautomeric zones. Afterward, valid VSCs for each zone are generated and scored, and the best particular solutions are retained. The combination of all valid VSCs results in a list of alternative tautomeric forms and protonation states.





CONFORMER ENSEMBLE GENERATION Methods, such as docking, 3D database searching, pharmacophore-based screening, and creation of 3D-QSAR models frequently require conformational ensembles to treat the flexibility of small molecules. UNICON can sample the accessible conformational space for a given molecule with a variant of the CONFECT algorithm21 for the generation of small highly accurate ensembles. Conformations are built by assigning low-energy torsion angles to rotatable bonds taken from a torsion library derived from crystallographic data.39 Precomputed, force field-optimized templates are used for sampling ring conformations. After applying an enumeration scheme, the computed conformers are clustered to form the ensemble. In our conformation generator, the concept of CONFECT is extended by an advanced set of rules40 and a new RMSD-based clustering algorithm further increasing the diversity of the resulting ensemble. Our method features three modes for conformer ensemble generation: Quality level 1 (“fast”) is optimized for speed, whereas quality level 3 (“best”) offers maximum accuracy. Quality level 2 (“standard”) has been derived as the most suitable setup for use with small molecules. A more detailed description of the method can be found in the original publication related to CONFECT.21

CONVERSION UNICON supports the common cheminformatic file formats SDF, MOL, MOL2, and SMILES, as well as the import of small molecules from PDB and PDBx/mmCIF files. Great care was taken to make all conversions between different file formats consistent.29 To ensure the correct representation of drug-like molecules, errors and ambiguities in the input data are identified and resolved consistently or the corresponding molecule is discarded. For further details on how the different operations were implemented see Urbaczek et al.29



3D COORDINATE GENERATION Several cheminformatics workflows require low-energy conformations of the molecules to be processed. To accomplish the coordinate generation, UNICON represents the molecule as a tree structure where every noncyclic bond is represented as an edge. Each node of the tree represents an atom or a ring system. Starting from arbitrary coordinates for the root atom, the remaining atom coordinates are assigned in a recursive procedure. Each recursion consists of two steps. First, the relative spatial orientation of all child atoms are set according to VSEPR geometries.35 Second, torsion angles are applied to every rotatable bond of the current atom on the basis of statistically relevant torsion angles. For nodes representing rings and ring systems, precomputed energetically relevant ring conformations are applied for every ring with a maximum of eight heavy atoms. In case of clashing atoms, a subsequent cleaning procedure further improves spatial atomic arrangements. The described procedure is based on the initial 3D coordinate assignment used in CONFECT.21



RESULTS: USAGE OF UNICON In the following, a series of application scenarios for UNICON organized in four areasare described. It should be noted that UNICON is not bound to these or any specific list of workflows. All elementary steps described can be freely combined in arbitrary order. A list of available parameters can be found in Table S1 or by using the help parameter of the tool. Some results of the main features are shown in Figure 1.





2D COORDINATE GENERATION To create 2D coordinates for molecular structure depictions36 of molecules, UNICON employs the structure diagram generator (SDG) pipeline implemented in MONA.32,37 In the first step, 2D coordinates for the ring systems are calculated. Using the concept of unique ring families,38 the ring systems are split into smaller blocks. Coordinates for each block are calculated either directly or searched in a template database. For complicated blocks, this may fail, and the coordinates are generated as a last resort by a 2D force field.

USING UNICON AS A FILE FORMAT CONVERTER In the following, we use the DrugBank41 “all” data set, version 4.3,34 with 7003 entries to demonstrate elementary cheminformatics tasks. Erroneous entries are automatically corrected if possible or otherwise discarded (e.g., valence errors). Since UNICON is based on the NAOMI library, it is currently limited to organic molecules that can be represented by valence bond structures. Hence, organometallic compounds cannot be converted. Thus, converting the DrugBank leads to 6883 valid C

DOI: 10.1021/acs.jcim.6b00069 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Application Note

Journal of Chemical Information and Modeling

each molecule which is inappropriate for complex searching tasks like virtual screening.6 Task 4: Generating Tautomers and/or Protonation States. UNICON is able to create a set of reasonable tautomers and/or protonation states. In the process, the user can choose between two different settings. With default settings (topscoring), only the tautomeric states with the highest scores are included in the results. Using the ensemble argument results in a selection of molecular states, which are assumed to have reasonable probability of being stabilized under physiological conditions, e.g., by protein binding. Basically, this removes unusual tautomers that would unnecessarily increase runtime in subsequent workflows. Task 5: Generating a Normalized Tautomeric Form. A default tautomeric form of a molecule can be useful to maintain consistency during consecutive workflows or library design. UNICON generates a tautomer that corresponds to the best tautomeric representation of the molecule. If multiple equivalent tautomeric forms exist, a canonization scheme is used to pick a unique solution. The same behavior applies for the protonation state generation procedure. Using the protonation option with the single flag generates a default protonation. The settings best and reasonable can be used accordingly.

entries in about 3 s between all supported formats. Round robin validation experiments converting from format A to B and back results consistently in the same chemical structure representation. Task 1: Converting Molecules. Converting cheminformatic files with UNICON is a straightforward task. The file format is automatically determined by file extension. It is possible to provide several input files in different supported formats separated by spaces. For the sake of convenience and workflow performance, UNICON supports reading input from stdin as well as writing output to stdout. This facilitates the application of UNICON in complex workflows without having to write the output to temporary files. In order to read from stdin, the input parameter has to be replaced by the inputFormat flag. Writing results to stdout works analogously. All warnings and error messages that occur during conversion can simply be silenced with the verbosity flag orif you are interested in the output redirected to file. Task 2: Converting a File with Multiple Components into Multiple Entries. Sometimes files comprise multiple components in one compound entry, e.g., additional salts, which can be problematic in complex workflows. Multiple components are automatically identified and extracted by the NAOMI library. UNICON provides an option to convert only the largest component, which is the molecule of interest in most cases, or all components. Each component is then written to the output as a separated entry. In some cases, the output file may become very large. In order to cope with this issue, UNICON is able to split the output file into several smaller files. In the case of tautomer or conformer ensemble generation, the results for the given number of input entries are written to a single file. For very large input files, a compound range can be specified. This extraction process is very fast because the NAOMI library creates an index of all entries before actually processing them. Task 3: Extracting Small Molecules from Protein Data Bank (PDB) Files.42,43 Reading PDB files is an error-prone task, as the molecules have to be recreated purely from 3D coordinates. UNICON uses a robust method to perceive small molecules from PDB files.30 By converting a PDB file, UNICON automatically extracts all small molecule instances that can then be converted in any supported output format. UNICON supports both the PDB and the PDBx/mmCIF format with the exception of small chemical component files as described by Westbrook et al..44 The conversion of the LigandExpo45,46 PDB data (1,123,204 entries) results in 933,136 converted entries. There are two kinds of common errors: First, f ile errors occur in cases of an invalid PDB format and entries that do not contain atom coordinates (∼148,000 errors). Second, initialization errors occur in cases of erroneous coordinatesatoms are too close or disconnectedand errors due to invalid valence state assignment of the molecules, which include not supported organometallic compounds (∼32,000 errors).



USING UNICON TO GENERATE 2D AND 3D COORDINATES Task 6: Generating 2D/3D Coordinates. In order to be able to handle all common file formats, the generation of 3D coordinates is a prerequisite, especially for SMILES input. In the case of standardized evaluation procedures, it may also be very useful to use unbiased input coordinates. UNICON is able to create an initial coordinate set of a low-energy conformer based on statistically relevant torsion angles.39,40 Conformer generation basically follows an incremental build-up strategy described in CONFECT.21 The method was extended related to clash resolution: While the original approach considered heavy atoms only, UNICON may also consider hydrogen atoms if required. Furthermore, 2D coordinates for structure representation can be produced employing MONA’s structure drawing engine.



USING UNICON AS A CONFORMER ENSEMBLE GENERATOR Conformer ensembles are widely used in cheminformatic processes such as virtual screening techniques like docking and SAR analysis. Many either commercial or freely available tools are able to generate conformations of small molecules. In UNICON, we integrated an enhanced version of the CONFECT conformer generator designed to create small ensembles of bioactive compounds.21 Task 7: Generating a Set of Conformers. UNICON permits the user to switch between different parameter settings for conformer ensemble generation. The default settings generate at most 250 conformers with quality level 2. To simplify the usage of the tool, the interface only provides two different parameters. The user is able to influence the quality of the resulting ensemble with respect to accuracy and the maximum number of conformers. A higher quality level results in larger ensembles and longer computing times. For example, quality level 1 with at most 50 conformers can be used to prepare a large library for high-throughput processing.



USING UNICON AS A TAUTOMER OR PROTONATION STATE GENERATOR Tautomers and protonation states have considerable influence on physicochemical properties of the molecule. Dealing with tautomerism results in a variety of tasks from which canonicalization and selection are the most relevant.47 Some data sets might contain a molecule in different tautomeric forms that hinders duplicate removal and molecule identification. Some other data sets might contain only a single tautomeric form for D

DOI: 10.1021/acs.jcim.6b00069 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Application Note

Journal of Chemical Information and Modeling



PERFORMANCE OF UNICON OPERATIONS All operations in UNICON process molecules sequentially such that data sets of arbitrary size can be handled. The conversion of 4,591,276 molecules of the ZINC48 clean lead data set from SMILES to SDF format, without coordinate generation, runs single-threaded in about 18 min (0.24 ms/cmpd) on an Intel Core i7-4790 with 3.6 GHz. Parallel execution can be easily achieved on the level of the operating system using the --from/--to options. The different conversion times can be found in Figure S1. Three different common data sets are used to show computing times of the aforementioned example processes available in UNICON. Results can be found in Table 1. The DrugBank41 and

well as nonexperts. The described functionalities solve a variety of different tasks from file conversion to conformer ensemble generation. The stable underlying cheminformatics library and a straightforward process flow guarantees the consistency and performance of all operations. UNICON allows the combination of file format conversion and enumeration steps in a single run preventing piping of molecules through different tools, which minimizes the probability of processing errors. Future enhancements of the software will be guided by the performance principle and incorporate features supporting more of the daily tasks of life scientists, for example, duplicate removal or aligned coordinate generation. Also, additional file format support of chemical component dictionary (CCD)44 files could be supported. Software availability: UNICON is available for Linux, OS X, and Windows from http://www.zbh.uni-hamburg.de/unicon for free for academic use and evaluation purposes. All feedback is greatly appreciated and supports the further development of UNICON.

Table 1. Computing Times for Common UNICON Operations on Three Different Setsa Task molecules in set valid molecules convert to SMILES generate tautomers normalize tautomers generate 3D coordinates generate 2D coordinates generate conformations generate conformationsb

DrugBank_all41

Ligand Expo45,46



ZINC48

7003 1,033,516 4,591,276 6883 957,948 4,591,270 2.7 s 1:34 min 25:06 min 1000 compounds random subselection 0.9 s 66.6 s 0.9 s 0.9 s 24.1 s 0.9 s 6:14 min 50.3 s 1:47 min 17.5 s 2.6 s 3.5 s 1:05 h 50:38 min 32:38 min 33:29 min 5:25 min 4:01 min

ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jcim.6b00069. A table that describes all available parameters in UNICON and a table with benchmark information for the conformation generation process. A figure showing the computing times for different conversion steps and a list of example commands for previously described tasks. (PDF)



a

The conversion is performed on complete data sets. All other processes are performed only on subsets of each molecule set (1000 randomly selected entries). bUsing quality level 1 and a maximum number of conformers of 50.

AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. Author Contributions

LigandExpo45,46 sets are used as input in SDF format. The data set from ZINC48 was converted from SMILES to SDF format beforehand. The conversion process was performed on the complete data sets. All other processes were simply run on 1000 randomly selected entries of each set. Calculations were performed on an Intel Core i7-4790 with 3.6 GHz. All computing times shown are averaged over five independent runs. The quality of the NAOMI library functionality used in UNICON such as file conversion, tautomer and protonation state generation, and 2D and 3D coordinate generation has been discussed in previous publications.21,29,31,32 A few example outputs of UNICON are shown in Figure 1. Due to recent enhancements in the conformation ensemble generation process, additional results of the conformation generation process can be found in Table S2. Using standard parameters, our conformation generator produces conformer ensembles with ratio of ensemble size and runtime to accuracy suitable for highthroughput screening. On the IRIDIUM-HT data set,49 an average RMSD of 0.59 Å with an average ensemble size of 91 conformers is achieved. For 84% of the test cases, the best conformation exhibits a deviation to the bioactive conformation below 1.0 Å. The higher quality level produces more conformations with up to 0.5 Å deviation from the bioactive one, while the performance at 1.0 Å does not differ significantly.

All authors have given approval to the final version of the manuscript. Notes

The authors declare the following competing financial interest(s): M.R. declares a potential financial interest in the event that the UNICON software is licensed for a fee to nonacademic institutions in the future.



ACKNOWLEDGMENTS The authors thank all members of the AMD group for their crucial contributions to the NAOMI library. Furthermore, we express our thanks to the developers of BioSolveIT for their valuable contributions to the NAOMI library, especially related to conformer generation.



ABBREVIATIONS CCD, chemical component dictionary; SAR, structure−activity relationship; SDG, structure diagram generator; PDB, Protein Data Bank; VSC, valence state combination



REFERENCES

(1) Karapetyan, K.; Batchelor, C.; Sharpe, D.; Tkachenko, V.; Williams, A. J. The Chemical Validation and Standardization Platform (CVSP): Large-Scale Automated Validation of Chemical Structure Datasets. J. Cheminf. 2015, 7 (1), 30. (2) Meng, E. C.; Lewis, R. A. Determination of Molecular Topology and Atomic Hybridization States from Heavy Atom Coordinates. J. Comput. Chem. 1991, 12 (7), 891−898.



CONCLUSIONS UNICON is a software application addressing the most elementary tasks related to compound collections in flat files. The simplistic interface facilitates the usage for researchers as E

DOI: 10.1021/acs.jcim.6b00069 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Application Note

Journal of Chemical Information and Modeling

Konstanz Information Miner. In Studies in Classification, Data Analysis, and Knowledge Organization (GfKL 2007); Springer, 2007. (28) Oellien, F.; Cramer, J.; Beyer, C.; Ihlenfeldt, W.-D.; Selzer, P. M. The Impact of Tautomer Forms on Pharmacophore-Based Virtual Screening. J. Chem. Inf. Model. 2006, 46 (6), 2342−2354. (29) Urbaczek, S.; Kolodzik, A.; Fischer, J. R.; Lippert, T.; Heuser, S.; Groth, I.; Schulz-Gasch, T.; Rarey, M. NAOMI: On the Almost Trivial Task of Reading Molecules from Different File Formats. J. Chem. Inf. Model. 2011, 51 (12), 3199−3207. (30) Urbaczek, S.; Kolodzik, A.; Groth, I.; Heuser, S.; Rarey, M. Reading PDB: Perception of Molecules from 3D Atomic Coordinates. J. Chem. Inf. Model. 2013, 53 (1), 76−87. (31) Urbaczek, S.; Kolodzik, A.; Rarey, M. The Valence State Combination Model: A Generic Framework for Handling Tautomers and Protonation States. J. Chem. Inf. Model. 2014, 54 (3), 756−766. (32) Hilbig, M.; Urbaczek, S.; Groth, I.; Heuser, S.; Rarey, M. MONA Interactive Manipulation of Molecule Collections. J. Cheminf. 2013, 5, 38. (33) Law, V.; Knox, C.; Djoumbou, Y.; Jewison, T.; Guo, A. C.; Liu, Y.; Maciejewski, A.; Arndt, D.; Wilson, M.; Neveu, V.; Tang, A.; Gabriel, G.; Ly, C.; Adamjee, S.; Dame, Z. T.; Han, B.; Zhou, Y.; Wishart, D. S. DrugBank 4.0: Shedding New Light on Drug Metabolism. Nucleic Acids Res. 2014, 42, D1091−D1097. (34) Drug & Drug Target Database . DrugBank. http://www. drugbank.ca (accessed November 30, 2015). (35) Gillespie, R. J. The Valence-Shell Electron-Pair Repulsion (VSEPR) Theory of Directed Valency. J. Chem. Educ. 1963, 40 (6), 295. (36) Brecher, J. Graphical Representation Standards for Chemical Structure Diagrams (IUPAC Recommendations 2008). Pure Appl. Chem. 2008, 80, 277−410. (37) Hilbig, M.; Rarey, M. MONA 2: A Light Cheminformatics Platform for Interactive Compound Library Processing. J. Chem. Inf. Model. 2015, 55 (10), 2071−2078. (38) Kolodzik, A.; Urbaczek, S.; Rarey, M. Unique Ring Families: A Chemically Meaningful Description of Molecular Ring Topologies. J. Chem. Inf. Model. 2012, 52 (8), 2013−2021. (39) Schärfer, C.; Schulz-Gasch, T.; Ehrlich, H. C.; Guba, W.; Rarey, M.; Stahl, M. Torsion Angle Preferences in Druglike Chemical Space: A Comprehensive Guide. J. Med. Chem. 2013, 56 (5), 2016−2028. (40) Guba, W.; Meyder, A.; Rarey, M.; Hert, J. Torsion Library Reloaded: A New Version of Expert-Derived SMARTS Rules for Assessing Conformations of Small Molecules. J. Chem. Inf. Model. 2016, 56, 1. (41) Wishart, D. S.; Knox, C.; Guo, A. C.; Shrivastava, S.; Hassanali, M.; Stothard, P.; Chang, Z.; Woolsey, J. DrugBank: A Comprehensive Resource for in Silico Drug Discovery and Exploration. Nucleic Acids Res. 2006, 34, D668−D672. (42) Berman, H. M. The Protein Data Bank. Nucleic Acids Res. 2000, 28 (1), 235−242. (43) Protein Data Bank. RCSB PDB. www.rcsb.org (accessed January 18, 2016). (44) Westbrook, J. D.; Shao, C.; Feng, Z.; Zhuravleva, M.; Velankar, S.; Young, J. The Chemical Component Dictionary: Complete Descriptions of Constituent Molecules in Experimentally Determined 3D Macromolecules in the Protein Data Bank. Bioinformatics 2015, 31 (8), 1274−1278. (45) Feng, Z.; Chen, L.; Maddula, H.; Akcan, O.; Oughtred, R.; Berman, H. M.; Westbrook, J. Ligand Depot: A Data Warehouse for Ligands Bound to Macromolecules. Bioinformatics 2004, 20 (13), 2153−2155. (46) LigandExpo. http://ligand-expo.rcsb.org/ (accessed January 14, 2016). (47) Sayle, R. A. So You Think You Understand Tautomerism? J. Comput.-Aided Mol. Des. 2010, 24 (6−7), 485−496. (48) Irwin, J. J.; Shoichet, B. K. ZINC - A Free Database of Commercially Available Compounds for Virtual Screening. J. Chem. Inf. Model. 2005, 45 (1), 177−182.

(3) Baber, J. C.; Hodgkin, E. E. Automatic Assignment of Chemical Connectivity to Organic Molecules in the Cambridge Structural Database. J. Chem. Inf. Model. 1992, 32 (5), 401−406. (4) Hendlich, M.; Rippmann, F.; Barnickel, G. BALI: Automatic Assignment of Bond and Atom Types for Protein Ligands in the Brookhaven Protein Databank. J. Chem. Inf. Model. 1997, 37 (4), 774− 778. (5) Labute, P. On the Perception of Molecules from 3D Atomic Coordinates. J. Chem. Inf. Model. 2005, 45 (2), 215−221. (6) Froeyen, M.; Herdewijn, P. Correct Bond Order Assignment in a Molecular Framework Using Integer Linear Programming with Application to Molecules Where Only Non-Hydrogen Atom Coordinates Are Available. J. Chem. Inf. Model. 2005, 45 (5), 1267−1274. (7) Zhao, Y.; Cheng, T.; Wang, R. Automatic Perception of Organic Molecules Based on Essential Structural Information. J. Chem. Inf. Model. 2007, 47 (4), 1379−1385. (8) Sayle, R. PDB: Cruft to Content (Perception of Molecular Connectivity from 3D Coordinates), 2001. http://www.daylight.com/ meetings/mug01/Sayle/m4xbondage.html (accessed May 2016). (9) Helson, H. E. Structure Diagram Generation. In Reviews in Computational Chemistry; John Wiley & Sons, Inc., 1999; pp 313−398. (10) Clark, A. M. 2D Depiction of Fragment Hierarchies. J. Chem. Inf. Model. 2010, 50 (1), 37−46. (11) Pearlman, R. S. Concord; Tripos International: St. Louis, MO. (12) Gasteiger, J.; Rudolph, C.; Sadowski, J. Automatic Generation of 3D-Atomic Coordinates for Organic Molecules. Tetrahedron Comput. Methodol. 1990, 3 (6), 537−547. (13) OEChem; OpenEye Scientific Software Inc.: Santa Fe, NM. (14) Martin, Y. C. Let’s Not Forget Tautomers. J. Comput.-Aided Mol. Des. 2009, 23 (10), 693−704. (15) Kochev, N. T.; Paskaleva, V. H.; Jeliazkova, N. Ambit-Tautomer: An Open Source Tool for Tautomer Generation. Mol. Inf. 2013, 32 (5− 6), 481−504. (16) SD File Processing with MOE Pipeline Tools. http://www. chemcomp.com/journal/sdtools.htm (accessed July 15, 2015). (17) QUACPAC,version 1.6.3.1; OpenEye Scientific Software: Santa Fe, NM. (18) Source code of the tautomer generation method by Sayle and Delany. http://www.daylight.com/meetings/emug99/Delany/ tautomers/ (accessed July 15, 2015). (19) Physico-chemical property predictors. http://www.chemaxon. com/products/calculator-plugins/property-predictors/ (accessed July 15, 2015). (20) Hawkins, P. C. D.; Skillman, a. G.; Warren, G. L.; Ellingson, B. a.; Stahl, M. T. Conformer Generation with OMEGA: Algorithm and Validation Using High Quality Structures from the Protein Databank and Cambridge Structural Database. J. Chem. Inf. Model. 2010, 50 (4), 572−584. (21) Schärfer, C.; Schulz-Gasch, T.; Hert, J.; Heinzerling, L.; Schulz, B.; Inhester, T.; Stahl, M.; Rarey, M. CONFECT: Conformations from an Expert Collection of Torsion Patterns. ChemMedChem 2013, 8 (10), 1690−1700. (22) Leite, T. B.; Gomes, D.; Miteva, M. a.; Chomilier, J.; Villoutreix, B. O.; Tufféry, P. Frog: A FRee Online druG 3D Conformation Generator. Nucleic Acids Res. 2007, 35, W568−W572. (23) Li, J.; Ehlers, T.; Sutter, J.; Varma-O’Brien, S.; Kirchmair, J. CAESAR: A New Conformer Generation Algorithm Based on Recursive Buildup and Local Rotational Symmetry Consideration. J. Chem. Inf. Model. 2007, 47 (5), 1923−1932. (24) Landrum, G. A. RDKit: Open-Source Cheminformatics Software. http://rdkit.org (accessed July 15, 2015). (25) O’Boyle, N. M.; Banck, M.; James, C. A.; Morley, C.; Vandermeersch, T.; Hutchison, G. R. Open Babel: An Open Chemical Toolbox. J. Cheminf. 2011, 3 (1), 33. (26) BIOVIA Pipeline Pilot. http://accelrys.com/products/ collaborative-science/biovia-pipeline-pilot/ (accessed August 26, 2015). (27) Berthold, M. R.; Cebron, N.; Dill, F.; Gabriel, T. R.; Kötter, T.; Meinl, T.; Ohl, P.; Sieb, C.; Thiel, K.; Wiswedel, B. KNIME: The F

DOI: 10.1021/acs.jcim.6b00069 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Application Note

Journal of Chemical Information and Modeling (49) Warren, G. L.; Do, T. D.; Kelley, B. P.; Nicholls, A.; Warren, S. D. Essential Considerations for Using Protein-Ligand Structures in Drug Discovery. Drug Discovery Today 2012, 17 (23−24), 1270−1281.

G

DOI: 10.1021/acs.jcim.6b00069 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX