UNIQUE LABELS FOR COMPOUNDS - C&EN Global Enterprise (ACS

IUPAC's chemical identifier project will create precise digital signatures for chemical ... of the structures of compounds to unique and unambiguous d...
0 downloads 0 Views 1MB Size
SCIENCE & TECHNOLOGY

UNIQUE LABELS FOR COMPOUNDS IUPAC's chemical identifier project will create precise digital signatures for chemical structures MICHAEL FREEMANTLE, C&EN LONDON

T

The software development company Advanced Chemistry Development (ACD) is already supplying users with software from its website that connects its ACD/ChemSketch structure-drawing package directly to a test version of the IChI algorithm. "ACD/Labs is supporting the I C h I effort since we are committed to the distribution of systematic approaches to nomenclature," says Antony J. Williams, vice president of scientific development and marketing at ACD. He notes that ChemSketch freeware has been available for over four years and more than 210,000 copies are installed worldwide. More than 250 copies per day are taken from the ACD/Labs website (http://www.acdlabs. com/download/chemsk.html). Heller points out that the IUPAC identifier system will be different from the Chemical Abstracts Service (CAS) and Beilstein registry systems and will serve a different purpose. "Registry numbers, such as those used by CAS and Beilstein, imply that someone is registerïï ing chemicals and creating or χ storing a database," he explains. ζ 'As with the IUPAC nomenclaa. ture rules, no database is being £ created. The aim is to incorpo° rate IChI facilities into s true-

HE INTERNATIONAL UNION OF an IUPAC strategy meeting held in March Pure & Applied C h e m i s t r y 2000 at the National Academy of Sciences (IUPAC) is engaged in a project in Washington, D.C.The meeting brought to develop a set of mathematical together providers and users of chemical rules for converting graphical information to discuss future requirements representations of the structures of comfor nomenclature and other ways of despounds to unique and unambiguous digiignating chemical compounds. tal codes. The codes, known as IUPAC Chemical THE INITIAL WORK on the project is be Identifiers (IChls), are alphanumeric text ing carried out by Heller and N I S T strings. The aim is to employ digitized ver- chemists Dmitrii V. Tchekhovskoi and sions ofthe strings as nonproprietary digital Stephen E. Stein. object identifiers (DOL·) ofcompounds that can be used in printed and electronic sources of chemical data and information. The identifiers are derived in a three-step computer process from graphical input structural information using a set of Ê ture-drawing programs so that algorithms. A single compound, | users can easily convert drawn whether known or as yet unstructures into identifiers and known, can have only one IChI vice versa." regardless of how the structure The need for a universally acis drawn. cepted standard structure-iden"The emergence of comtification system has long been puterized information-hanrecognized at NIST, according dling systems has had an enorto Stein. "The institute is inmous impact on chemistry and volved in the development and chemists," says Alan D. Mc- STANDARD-BEARERS Programs developed by distribution of evaluated referNaught, a senior executive at Stein (from left), Tchekhovskoip and Heller will facilitate ence data collections," he tells the Royal Society of Chemistry electronic communication of molecular structures. C&EN. 'Along-standing, expenCambridge, England, and prèssive, and vexing problem in this ident of IUPAC's Division of Chemical Heller stresses that the project aims to work has been the representation of comNomenclature & Structure Representation. develop public domain standard reprepound identity It is one of the oldest prob"The ease with which chemical inforsentation of chemical structures. The allems in chemistry mation can be shuttled around the world gorithms, when finalized, will be freely "WHEN A NEW reference data value beavailable to personal computer users from is phenomenal," he continues. "However, the IUPAC website and other websites. comes available, an essential first step is to we are only just beginning to realize the IUPAC, he says, is only undertaking work link it to replicate values, that is, to prepotential of the computer for sharing and that commercial firms are not doing. existing data for the same substance," he processing this information. A major stumcontinues. "While chemical structures, in bling block is the lack of agreement on "The identifiers will be the equivalent principle, provide the required informastandard ways of structuring and encodof IUPAC systematic names but will be tion, directly by drawings and indirectly by ing molecular information—that is, chemdesigned to be easily used by computers," chemical names, the extreme variability of ical structures and properties. Progress in Heller adds. "In particular, they are dethese means of representation has required signed to link into desktop chemicalthis area has been disappointingly slow." structure-drawing packages such as expensive expert effort to link the values." The IChI project is the brainchild of ChemDraw, ChemWindow, Chemistry "A scheme for generating a unique label Stephen R. Heller, guest researcher at the 4-D Draw, ISIS/Draw, MDL/Draw, and from a structure was therefore developed National Institute of Standards & TechACD/ChemSketch." at NIST to suit its needs for data coflecnology (NIST). The project originated at ΙΛ

HTTP://PUBS.ACS.ORG/CEN

C&EN

/ DECEMBER

2, 2 0 0 2

33

SCIENCE & TECHNOLOGY tions," Stein adds. "However, the lack of an accepted standard for such labels has prevented its use for the important task of linking N I ST data with external sources of chemical information," he adds. In the first step of the IUPAC labeling process, a set of rules is used to remove re­ dundant information from the graphical representation of a compound and pro­ duce an underlying structure. "In this step, which we call 'normaliza­ tion,' all the information that is not re­

quired for simple identification of a com­ pound is stripped out," Stein explains. "It basically throws away, for example, all the information about the nature of the bond— whether it is a single, double, or triple bond—and the positions of the charge." The underlying structure is then "canonicalized" to generate a unique set of atom labels. Canonicalization is a complicated mathematical procedure that requires a computer, according to Stein. In the third and final step, the canoni-

GENERATOR Process converts input structure to unique identifier in three steps

Graphical structure input (sodium salicylate) Normalization

CH 3 0-(CH 2 CH 2 0) n -CH 2 -CH 2 -0-R

«

Clinically proven technology for improving the performance of your small molecule, peptide or protein pharmaceuticals.

HA-™

£

• mPEG-Acrylate • mPEG2-Aldehyde • mPEG-Amines

«

HC^CH Η

Shearwater's clinically proven advanced PEGylation technol­ ogy currently supports 4 marketed products, 2 products filed with the FDA and 7 products in various clinical trials. Our catalog provides many different options for coupling monoor multi-functional PEG derivatives with molecular weights up to 40 kD to your active pharmaceutical ingredient: • mPEG-Acetaldehyde Diethyl Acetal

Na

Canonicalization

Herder 1

HC 2 CH* H Serialization

• ω-amino-a-carboxyl PEG • PEG-Biotin • t-Boc-Protected Amine

C7H503.Na,8-7(9)5-3H-1H-2H-4H-6l5)10H/-1;+1

IUPAC Chemical Identifier (IChl) output

• mPEG-Double Esters • Fluorescein-PEG-NHS • FMOC-Protected Amine • mPEG-Forked Maleimides • mPEG-Maleimides • Multi-Arm PEGs • NHS-PEG-Maleimide • NHS-PEG-Vinylsulfone • mPEG2-N-hydroxysuccinimides • mPEG-Propionaldehydes • mPEG-Succinimidyl Butanoates • mPEG-Succinimidyl Propionates • PEG-Phospholipids • Polyethylene Glycol Shearwater offers cGMP manufacturing of the PEG derivative, including filing of Type II Drug Master Files, and proven regulatory strategies for PEG-drugs.

enhancing medicine's future

SHEARWATER: an Inhale company 256-533-4201 · 256-533-4805 fax [email protected] Download our catalog at w w w . s h e a r w a t e r c o r p . c o m 34

C&EN

/

DECEMBER

2,

2002

cal labels are serialized to give the text string. "The string is intended for digital rep­ resentation of a molecule and doesn't de­ pend on how the molecule is drawn," Stein says. "The number of characters in the string is more or less proportional to the size of the molecule." At a meeting held in Cambridge, En­ gland, in August 2000, IUPAC decided to represent classes of structural information as separate layers in the IChl. In the present test version, the first lay­ er is the molecular formula of the com­ pound, and the second represents the connections between the atoms. For ex­ ample, the I C h l for 2-nitrobutane is C4H9N02,lH3-3H2-4H(2H3)-5(6)7, and the identifier for D-glucose is C6H1206,7-1H-3H(9H)5H(11H)6H (12H)4H(10H)2H2-8H. "The second layer is derived solely from the simple connectivity information rep­ resented in the input structure," Stein ex­ plains. "It ignores ττ-electrons and charge as well as stereochemical, tautomeric, and isotopic information." HTTP://PUBS.ACS.ORG/CEN

T h e layered approach will enable chemists to represent compounds at a level of detail of their choice. Each layer is output as a string of characters, and the layers are appended to one another in the order that they are computed. If there are insufficient data to generate a layer, that layer is omitted from the identifier. Other layers in the test version include a charge layer and a tautomeric layer. The latter represents classes of compounds that can rapidly isomerize by hydrogen-atom migration. "The order of the layers is partly dependent on how the user draws or wants to represent the molecule," Heller explains. "For example, tautomers will have to be dealt with before connectivity if the user draws a molecule as a tautomeric structure." There is also an isotopic layer, in which different isotopically labeled atoms are distinguished from each other, and a stereochemical layer that includes conventional sp 2 and sp 3 stereochemistry "IUPAC released a PC-based executable version of an IChI test algorithm in March," McNaught notes. "This version was developed to deal with well-defined, covalently

use

bonded neutral and ionic organic molecules. It was given to testers in a form that would accept structure input in a commonly used format and deliver data as tagged text." The testing was carried out on structure databases such as the one at NIST According to Stein, no examples of chemical structures that the program cannot handle have yet been found. "Feedback from the testing was reviewed at a project meeting in Columbus, Ohio, in June, and no problems were reported," McNaught says. "The IChI was received enthusiastically when presented at the CAS/IUPAC Conference on Chemical Identifiers and XML for Chemistry in Columbus in July" The text output from the IChI program is written in X M L (extensible markup language), which is a universal format for

structured documents and data on the Web. The language is an extension to H T M L (hypertext markup language), which is commonly used to view Web pages. Heller notes that the current version of the I C h I software is yet to be finalized. An advanced version will be released early next year. "Following the development of the I C h I for organic covalent structures, we plan to extend the range of applicability to include, for example, organometallic and coordination compounds and polymers," McNaught says. A test version of the IChI algorithm is available on request from Stein ([email protected]), and a demonstration of IChI generation with a freely available structure-drawing software package is available from McNaught ([email protected]). •

A long-standing, expensive, and vexing problem has been the representation of compound identity. It is one of the oldest problems in chemistry/

Loker Hydrocarbon Research Institute

UNIVERSITY

OF SOUTHERN CALÏKORNIA"

125 Years of Friedel-Crafts Chemistry A Kimbrough Research Symposium

Thursday, March 20, 2003 Speakers Shu Kobayashi Jay K. Kochi Benjamin List Craig A. Merlic Teruaki Mukaiyama George A. Olah Nicos A. Petasis G.K. Surya Prakash Scott* D. Rychnovsky

University of Tokyo University of Houston The Scripps Research Institute University of California, Los Angeles Kitasato Institute, Japan University of Southern California University of Southern California University of Southern California University of California, Irvine

For registration and information: Phone: (213)740-5974; Fax: (213)740-5087; e-mail: [email protected] University Park, Los Angeles, California 90089-1661 The Scientific Community is Invited

HTTP://PUBS.ACS.ORG/CEN

C&EN

/ DECEMBER

2, 2002

35