The InChI Code - ACS Publications - American Chemical Society

Mar 30, 2018 - unambiguously coding molecular structure into relatively short alphabetic tags. These are ... computer utility, it was clear that bette...
1 downloads 3 Views 339KB Size
Report Cite This: J. Chem. Educ. XXXX, XXX, XXX−XXX

pubs.acs.org/jchemeduc

The InChI Code Paul J. Karol*,† Department of Chemistry, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, United States ABSTRACT: The International Union of Pure and Applied Chemistry has developed a computer algorithmic scheme for unambiguously coding molecular structure into relatively short alphabetic tags. These are now ubiquitous in many descriptions of molecular structure, yet the ease with which they are recognized and used is suppressed by their opaqueness even to the chemically knowledgeable reader. We present a short description of the structure, utility, advantages, and disadvantages of the code to ease familiarization. KEYWORDS: General Public, Analytical Chemistry, Chemoinformatics, Internet/Web-Based Learning, Multimedia-Based Learning, Molecular Modeling, Nomenclature/Units/Symbols

W

searching and matching tasks. To sum up in one sentence: InChI uses structure to provide a precise, robust, IUPACapproved tag for a chemical substance. InChI was developed as a free, nonproprietary identifier for chemical substances that can be used in printed and electronic data sources, thus enabling easier linking of data compilations and unambiguous identification of chemical substances. The InChI algorithm project converts the structure, in the form of its connection table, to a unique (indecipherable) string of letters, generating a machine-interpretable InChI coded key. InChI presently covers organic molecules and, with some limits, organometallics. It ignores bond orders except for analyzing stereochemistry and hydrogen atom migration as in keto−enol tautomers. It does not express positions of electrons. In that regard it is not a conventional method for representing molecular structure. However, it does secure the information needed for unique identification.

hen looking up properties of chemicals on Wikipedia, for example, the list of molecule identifiers invariably includes something called “InChI” (rhymes with DaVinci). This paper’s brief discussion is an attempt to explain what InChI is, why it came about, and how it is used. In doing so, we have taken certain liberties in the interest of brevity and simplicity. Ultimately, the identifier scheme is not at all needed in chemistry curricula nor is a detailed understanding an objective here. InChI is used by professional chemical database providers, publishers, chemistry software vendors, librarians, patent attorneys, and information specialists. It is not and should not be the subject of any general chemistry exam. In March 2000 the International Union of Pure and Applied Chemistry (IUPAC) convened a meeting in Washington, DC, to look into the matter of chemical structure representation since, with the ever-increasing reliance on and potential of computer utility, it was clear that better schemes for nomenclature and/or symbols might be appropriate.





ILLUSTRATING THE CODE Pictures are worth a thousand words. So, caf feine is being used as an illustration. The string, from Wikipedia for instance, will read as:4

WHAT IS THIS NEW CHEMICAL IDENTIFIER? InChI is an acronym for IUPAC International Chemical Identifier. It provides a string of characters representing a unique digital signature for a chemical substance. It is derived solely from a structural representation and is independent of the way that the structure appears. Development of the InChI algorithm and software took place at the US National Institute of Standards and Technology under the auspices of IUPAC. First publications of its implementation began a few years later.1,2 The most current in-depth description appeared in 2015.3 Chemical identifiers can be uninformative, having no information about structure as in the case of a Chemical Abstracts Service registry number. Or they can be illuminating, providing the means to deduce structure as with a systematic name or a computer generated 3D image. The 3D chemical structure of a compound is its true identifier providing a complete representation of the molecule through atomic coordinates and connections between atoms. However, there has been no successful agreement on a unique computerized scheme that would be openly available relating unambiguous chemical structures via the Internet especially for efficient © XXXX American Chemical Society and Division of Chemical Education, Inc.

InChI = 1S/C8H10N4O2/c1‐10‐4‐9‐6‐5(10)7(13)12(3)8(14)11(6)2/h4H, 1 ‐3H3

The first “layer” is the software version number (1S in the example); the second layer is the formula. C8 implies that, in the third layer for connectivity, /c, carbons are represented by numbers 1−8; hydrogens are ignored in this layer. N4 means atoms 9−12 will be nitrogens, and O2 means 13 and 14 will be oxygens. This connectivity-bred layer shows pairing links among the numbered atoms and in which parentheses allow triple links. To this point, one can work backward from the InChI connecting string example for caffeine and generate the structural scheme shown in Figure 1. Received: February 8, 2018 Revised: March 30, 2018

A

DOI: 10.1021/acs.jchemed.8b00090 J. Chem. Educ. XXXX, XXX, XXX−XXX

Journal of Chemical Education

Report

the absence of a key, rapidly grows too large to be readable. The next block of 8 letters above encodes stereochemical and isotopic information which we do not discuss in this document. The final “N” indicates the molecule is neutral: uncharged. Most, if not all, molecular structure commercial software programs provide InChI and InChIKey identifiers as both import and export options. A complete Technical Manual is available via the Internet.4 Even small molecules have InChI identifier keys of this exact length. For example, the chemical with the IUPAC systematic name oxidane has identifiers

Figure 1. Visualization of the InChI numbering and connection scheme for caffeine.

The next layer, /h, handles the fixed hydrogens of caffeine, placing one H at C4 and three each at C1−3. Additional layers, not discussed here, would accommodate isotopic composition, stereochemistry, tautomerism, and ionic charges, if necessary. Rearranging the appearance above generates the arrangement on the left below from which a moderately informed chemist can infer the structure on the right in Figure 2.

CAS registry number = 7732‐18‐5

InChI = 1S/H2O/h1H2

InChIKey = XLYOFNOQVPJJNP‐UHFFFAOYSA‐N

A final point is that InChI will not replace conventional nomenclature. Oxidane is also known as water.



AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected].

Figure 2. Rearrangement of caffeine’s numbered connectivity from Figure 1 is shown on the left, and its implied structure is shown on the right.

ORCID

Paul J. Karol: 0000-0002-3555-6899 Present Address

The string’s format is designed for compactness, not readability, in contrast to systematic nomenclature. The length of an identifier is roughly proportional to the number of atoms in the substance. Note that proceeding in the reverse direction, generating the InChI string from the molecular structure on the right above, would permit a variety of possible numbering arrangements. Computer codes assign the correct one using well-known mathematical procedures called canonicalization. In that canonicalization step, a set of numerical labels is algorithmically generated that does not depend on how the structure was initially drawn. The canonicalization process5 itself is well beyond the scope of this brief report. Among the disadvantages of InChI are that it is not effortlessly readable; it can be long. It does not support additional stereochemistry such as octahedral and square planar geometries and does not retain bond order information.



149 Bryant Street, Palo Alto, CA 94301−1104, United States.

Notes

The author declares no competing financial interest.



ACKNOWLEDGMENTS The author acknowledges and appreciates the comments from the ACS Committee on Nomenclature, Terminology, and Symbols and its Education Subcommittee, of which the author is a member.



REFERENCES

(1) Coles, S. J.; Day, N. E.; Murray-Rust, P.; Rzepa, H. S.; Zhang, Y. Enhancement of the Chemical Semantic Web through the Use of InChI Identifiers. Org. Biomol. Chem. 2005, 3, 1832. (2) McNaught, A. The IUPAC International Chemical Identifier: InChlA New Standard for Molecular Informatics. Chem. Int. 2006, 28, 12. (3) Heller, S. R.; McNaught, A.; Pletnev, I.; Stein, S.; Tchekhovskoi, D. InChI, the IUPAC International Chemical Identifier. J. Cheminf. 2015, 7, 23. (4) English-language Wikipedia entry for the term Caffeine. https://en. wikipedia.org/wiki/Caffeine (accessed March 2018). (5) Stein, S.; Heller, S.; Tchekhovskoi, D.; Pletnev, I. IUPAC International Chemical Identifier (InChI) InChI, version 1, software version 1.04 (2011) Technical Manual; http://www.inchi-trust.org/ download/104/InChI_TechMan.pdf (accessed March 2018). (6) The chances of InChIKey nonuniqueness are small but not zero. (7) InChI Trust. Downloads of InChI Software. https://www.inchitrust.org/downloads/ (accessed March 2018).



UNLOCKING THE CODE So far, the illustration with caffeine has dealt with just the InChI connectivity string. That string is read by computer software to spawn the final “key” code which we now look at. The InChIKey is a condensed alphabetic code for the identifier string and, among other uses, expedites computer-based searching. By definition, the InChIKey length is always 27 characters that are uppercase English letters plus two dashes as separators. It is much shorter than a typical InChI connectivity string, the average length of which is 146 characters, providing a (nearly unique6) representation of the parent InChI and indirectly of the parent molecule. The InChIKey for the caffeine illustration, generated by the open-source freely available software7 is InChIKey = RYYVLZVUVIJVGH‐UHFFFAOYSA‐N

Molecular skeletons like that above in the structure illustration are encoded by computer algorithm into the 14letter opening block. That size basically caps molecular input for InChI at 1023 atoms. The string length from which it is generated scales roughly with the number of atoms which, in B

DOI: 10.1021/acs.jchemed.8b00090 J. Chem. Educ. XXXX, XXX, XXX−XXX