Article
Subscriber access provided by SUNY DOWNSTATE
A Novel Concept for the Search and Retrieval of the Derwent Markush Resource Database Andreas Barth, Thomas Stengel, Edwin Litterst, Hans Kraut, Henry Matuszczyk, Franz Ailer, and Steve Hajkowski J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.6b00082 • Publication Date (Web): 28 Apr 2016 Downloaded from http://pubs.acs.org on May 3, 2016
Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.
Journal of Chemical Information and Modeling is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.
Page 1 of 18
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
A Novel Concept for the Search and Retrieval of the Derwent Markush Resource Database Andreas Barth1*, Thomas Stengel1, Edwin Litterst1, Hans Kraut2, Henry Matuszczyk2, Franz Ailer2, Steve Hajkowski3 1
2 3
FIZ Karlsruhe – Leibniz Institute for Information Infrastructure, D-76344 EggensteinLeopoldshafen, Germany InfoChem GmbH, D-81241 Munich, Germany Thomson Reuters, London, EC1N 8JS, United Kingdom
* Corresponding author: Andreas Barth
[email protected] Abstract The representation of and search for generic chemical structures (Markush) remains a continuing challenge. Several research groups have addressed this problem and over time a limited number of practical solutions have been proposed. Today, there are two large commercial providers of Markush databases: Chemical Abstracts Service (CAS) and Thomson Reuters. The Thomson Reuters ‘Derwent’ Markush database is currently offered via the online services Questel and STN, and as a data feed for in-house use. The aim of this paper is to briefly review the existing Markush systems (databases plus search engines) and to describe our new approach for the implementation of the Derwent Markush Resource on STN. Our new approach demonstrates the integration of the Derwent Markush Resource database into the existing chemistry focused STN platform without loss of detail. This provides compatibility with other structure and Markush databases on STN and at the same time it is possible to deploy the specific features and functions of the Derwent approach. It is shown that the different Markush languages developed by CAS and Derwent can be combined into a single general Markush description. In this concept the generic nodes are grouped together in a unique hierarchy where all chemical elements and fragments can be integrated. As a consequence both systems are searchable using a single structure query. Moreover, the presented concept could serve as a promising starting point for a common generalized description of Markush structures. Keywords Markush database; Information Retrieval of Markush Structures; Topological Concept; Structure Search
1 ACS Paragon Plus Environment
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
1. Introduction: Background and History In 1924 Eugene Markush filed a patent at the United States Patent Office entitled Pyrazolone Dye and Process of Making the Same1 in which he claimed a rather large group of chemical compounds for the manufacturing of dyes. At that time it was forbidden to use the term “OR” in the US patent system to describe several options (variations) of the invention. Instead, Eugene Markush had chosen the expression “material selected from the group” to comprise a list of candidate compounds. After some dispute with the patent office his wording was accepted and since that time chemists have been rather inventive to broaden the scope of their patent claims by using very general descriptions for chemical structure fragments. These generic structures are commonly called Markush structures. The representation of and search for Markush structures provides a real challenge. Several research groups have addressed this problem and over time a limited number of practical solutions have been proposed.2-9 Today, there are two large commercial providers of Markush databases: Chemical Abstracts Service (CAS)10 and Thomson Reuters11. CAS provides their CAS Markush database (MARPAT) via the online service STN and the end-user tool SciFinder. The Thomson Reuters file is currently offered via the online services Questel (named MMS: Merged Markush Service) and STN (named Derwent Markush Resource), and as a data feed for in-house use. The concepts of these databases and the corresponding retrieval systems differ significantly and several comparisons of the two systems as well as customer expectations have been published in the literature.12-18 Some approaches have aimed to build a Markush search system for in-house use.19-21 However, these systems focus more on cheminformatics applications, e.g. combinatorial chemistry, and less on patent retrieval. The focus of this paper is to briefly review the existing Markush online systems (databases plus search engines) and to describe our new approach for the implementation of the Derwent Markush Resource on STN (file label DWPIM: Derwent World Patent Index Markush). In our new approach it is shown how the Derwent Markush concept is integrated into the existing STN concept for structure searching without loss of detail. This provides compatibility with other structure databases and at the same time it is possible to deploy the specific features and functions of the existing Derwent approach. The Derwent patent information system consists of three databases: the Derwent World Patent Index (DWPI) is document-based covering all patent literature, the Derwent Chemistry Resource (DCR) contains all specific structures referenced in DWPI, and the Derwent Markush Resource (DWPIM) covers the Markush structures referenced in DWPI. It is important to note that DWPI itself does not contain chemical structures, but it references the structures in DCR and DWPIM. All three databases are integrated in a single content domain on STN which enables simplified searching of combined text (DWPI) and structure terms (DWPIM and DCR) as well as easy projections of documents between the structure and the document databases. 2. Markush Description of Chemical Structures Specific chemical substances are unique compounds and they are represented by a single chemical structure where all nodes correspond to specific chemical elements.23 A Markush structure can be generated from a specific structure by introducing generalizations and variations which are well known to chemists. In Figure 1 it is shown how a Markush structure can be obtained from a specific structure by applying generalizations called Markush variations. In general, Markush structures in
2 ACS Paragon Plus Environment
Page 2 of 18
Page 3 of 18
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
real patents are described as a main structure (core structure or scaffold) together with a list of variations. The latter are normally grouped together in generic groups (G-groups) which may build a hierarchy of G-groups. In DWPIM there may be up to 50 G-groups per Markush structure, which may contain up to 50 variations per G-group and these G-groups can be nested with a maximum of 4 levels. The description of the substituents may include standard or nonstandard nomenclature, molecular formulas, or even free text. Generic nodes may include additional attributes or element counts. Also, provisos are a common finding in patents, e.g. expressions such as “optionally substituted by” or conditional logic such as “if R1 = CH3 then R2 = H”.
Figure 1. Example for a transformation from a specific chemical structure to a Markush structure. The different types of possible variations have been classified24 as • • • •
Substituent (s-) variation: a list of alternative chemical groups or elements Position (p-) variation: a variable point of attachment for a substituent Frequency (f-) variation: a variable repetition of a substituent Homology (h-) variation: a generic class of homological compounds usually expressed as a generic node.
The homology variations are not uniquely defined and the two commercial providers of Markush databases have chosen different sets of homology nodes. Chemical Abstracts Service uses a more formal concept with a small number of hierarchical generic nodes to describe generalized organic fragments.7 Derwent on the other hand has chosen to define a larger set of nodes which incorporates different chemical aspects.6 In general, the two concepts provide complementary views of Markush structures in patents. Markush structures in patents can become very complex and over the years the complexity and generality has increased significantly.22 In real patents it is common use to include as many specific structures in a single Markush structure with rather broad generalizations in order to ensure a maximal coverage of the invention. In general, Markush structures may contain many variations of the types shown in Figure 1, and this may result in a very large or even infinite number of specific structures. Markush structures can be enumerated only if they do not contain homology variations, 3 ACS Paragon Plus Environment
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
i.e. generic nodes such as carbon chain (Ak) or heterocycle (Hy). Even then the number of incorporated specific structures may be too large to manage them as individual structures in a database. If the Markush structure contains at least one generic node the number of possible specific structures is infinite. 3. Derwent Markush Resource 3.1. Derwent Chemical Content Domain Chemical information is indexed by Thomson Reuters in 3 databases: • • •
Derwent World Patents Index (DWPI) covers all patent literature including Chemistry Derwent Chemistry Resource (DCR) contains all specific structures referenced in DWPI Derwent Markush Resource (DWPIM) contains all Markush structures referenced in DWPI.
The chemistry data content comprises more than 3 million patent families with specific and/or Markush structures. Currently, there are over 1.9 million Markush structures available in DWPIM, and DCR contains about 2.5 million structures. Generic structures are indexed in the Derwent Markush Resource and the corresponding records are structure based, i.e. one structure record contains a single Markush structure with all variations. Not all Markush structures contain homology (generic) nodes. Some structures are defined with specific elements only and for historical reasons there are also some pure specific compounds (see section 3.2). The databases DWPI, DCR, and DWPIM are closely related and all structure keys are indexed in DWPI (see Figure 2). Hence, it is possible to crosslink between the structure databases (DCR, DWPIM) and the DWPI database using the structure keys. Both structure databases can be searched simultaneously as well as independently. The results of a substructure search can be refined with keywords within the DWPI database.
4 ACS Paragon Plus Environment
Page 4 of 18
Page 5 of 18
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
Figure 2. Derwent Content Domain consisting of the databases DWPI, DCR and DWPIM. In DWPI AN is the accession number and TI is the title. The structure key for DCR is 2870380 and for DWPIM 1158-89001, respectively. DWPI documents and DCR or DWPIM structure records have a 1:n relationship, which means that a chemical invention may consist of one or more specific and/or Markush structures. The Derwent Markush Resource is produced by Thomson Reuters and it contains Markush structures from patent documents covering 33 patent issuing authorities, including the back-file indexed by the French patent office (INPI) for the years 1961 to 1998. Data coverage for US, EP and WO patents begins in 1978, with further major countries covered starting from 1980 and 1987; Korean and Chinese patents are covered from 2008 onwards. The database covers all areas of chemistry, except polymers.11, 26 3.2. Derwent Concept of Superatoms In the Derwent Markush Resource generic nodes are called superatoms and they can be clustered in 4 groups: acyclic and cyclic (organic) fragments (8), halogens and metals (7), special generic superatoms (7), and peptide superatoms (30). The first 3 groups of superatoms are shown in Table 1. Table 1. DWPIM superatoms excluding peptides. Group acyclic
cyclic
elements
others
Superatom CHK CHE CHY ARY CYC HEA HET HEF HAL MX A35 ACT AMX LAN TRM ACY DYE PEG POL PRT UNK XX
Definition alkyl, alkylene alkenyl, alkenylene alkenyl, alkinylene aryl cycloaliphatic monocyclic heteroaryl monocyclic non-aromatic fused heterocyclic halogen any metal group III A to V A metal actinide alkali and earth alkaline metal lanthanide transition metal acyl chromophore polymer end group macromolecule residue protecting group any atom or group incl. H any atom or group excl. H
There are essential differences among the 4 groups of superatoms. Acyclic and cyclic (organic) fragments are true superatoms representing organic fragments, halogens and metals are closed lists of real atoms, and other generic superatoms are indexed for general descriptions without any known 5 ACS Paragon Plus Environment
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
structural features. Other superatoms are generic descriptions which are only used when the patent does not include any specific examples, e.g. acyl (ACY) is only indexed if this term is explicitly stated in the patent. Although polymers are not routinely indexed in DWPIM it is necessary to describe macromolecular residues including polymers with the superatom POL and the polymer end groups with the superatom PEG. So called peptide superatoms do not represent a class of structures, they are actually abbreviations for a specific amino acid. For example SER is an abbreviation for the specific amino acid serine. Hence, peptide superatoms can be interpreted as shortcuts like Me for methyl and they can be treated differently from the other superatoms. An analysis shows that the occurrences of the superatoms in Markush structures vary significantly. Figure 3 shows the number of records containing superatoms as a percentage of the total number of Markush structures (1,950,674). It can be seen that CHK (61%) is the most common superatom, followed by ARY (32%), XX (29%), HAL (24%), CHE (23%), and CYC (21%). HEA, HET, CHY, and HEF have a medium frequency while all the others are less frequently used superatoms. Obviously, a specification of the less frequently used superatoms in the query has a high degree of selectivity. It is interesting to note that not all Markush structures contain superatoms. This means that some structures are defined with specific elements only, e.g. using G-groups with variations of specific elements. For historical reasons there are also some additional specific compounds without any variations indexed in DWPIM and not in DCR. The total number of structures containing only specific elements is 478,423 (25%). 70% 61% 60% 50% 40% 32% 30%
29% 24% 23%
21%
20% 14% 14% 13% 10% 10% 4% 3% 2% 1% 1% 1% 1% 0% 0% 0% 0% 0% 0%
Figure 3. Number of records containing superatoms in percent. All organic fragments together with halogen and metal elements build a hierarchy of nodes with the superatom XX at the top26 (see Figure 4). The superatom ARY comprises all carbocyclic ring systems which contain at least one benzene ring while CYC contains all the other carbocycles. Both could be either mono- or polycyclic. HEA describes monocyclic five and six-membered aromatic rings while
6 ACS Paragon Plus Environment
Page 6 of 18
Page 7 of 18
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
HET covers all other monocyclic heterocycles. The superatom HEF comprises all fused heterocycles. It is important to note that the grouping of carbocyclic and heterocyclic generic nodes is different. Carbocyclic compounds are divided into two classes depending on whether they are aromatic (containing benzene) or not. For heterocyclic compounds the division is more complex: the distinction is between mono- and polycyclic and monocyclic compounds are further subdivided in heteroaryls and all other mono heterocycles. CHK, CHE, and CHY describe carbon chains with only single bonds (CHK), with at least a double bond but no triple bonds (CHE), and with at least one triple bond (CHY). The superatom MX contains all metals and this is further divided in five subgroups: A35, ACT, AMX, LAN, TRM. Finally, HAL represents the halogen atoms.
Figure 4. Derwent Markush Resource: original hierarchy of superatoms. The specific elements also fit into this hierarchy. Metal and halogen elements are grouped together as generic element nodes. Non-metal elements on the other hand have no corresponding generic node. The other superatoms (ACY, DYE, etc.) do not fit in this hierarchical scheme and must be treated as isolated superatoms which have no correspondence with other nodes or groups. 4. STN Concept for the Derwent Markush Resource 4.1. STN Concept of Generic Nodes Structure queries may be built by using specific (real atoms), shortcuts, variable groups, and generic (homology) nodes. Shortcuts are abbreviations for chemical groups, e.g. Me for methyl, and they are replaced by the corresponding elements before searching. Variable groups are G-groups which contain a set of alternative elements or fragments, including generic nodes. The generic query nodes of STN are listed in Table 2.
7 ACS Paragon Plus Environment
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Table 2. List of STN generic query nodes. Group acyclic cyclic
elements
STN Generic Node Ak Cy
Definition carbon chain any ring system
Cb Hy
carbocyclic ring system heterocyclic ring system
X M Q A
halogen any metal any atom excl. C any atom incl. H
Attributes saturation, type of chain, # C atoms saturation, type of ring, # of C atoms, # hetero atoms saturation, type of ring, # of C atoms saturation, type of ring, # of C atoms, # hetero atoms chain, ring, ring/chain chain, ring, ring/chain chain, ring, ring/chain chain, ring, ring/chain
All generic nodes may have additional substituent(s). The acyclic and cyclic generic query nodes build a simple hierarchy compared to the DWPIM hierarchy in Figure 4. In addition to the acyclic and cyclic generic nodes there are two closed atom lists for halogens (X) and for metals (M) as well as Q (any atom except carbon and hydrogen) and A (any atom except hydrogen). In addition to the STN generic nodes listed in Table 2 all Derwent generic nodes (superatoms) from Table 1 can also be used for searching. The only exceptions are MX and HAL which have to be replaced by the corresponding STN generic nodes M and X in the query, but the search yields MX and HAL in the DWPIM structures. Hence, the user may choose from a total of 28 generic nodes for structure searching: 8 STN generic nodes (Ak, Cy, Cb, Hy, X, M, Q, A) plus 20 Derwent superatoms (CHK, CHE, CHY, ARY, CYC, HEA, HET, HEF, A35, ACT, AMX, LAN, TRM, ACY, DYE, PEG, POL, PRT, UNK, XX). An example for the use of both STN generic nodes and Derwent superatoms is shown in Figure 5. The isochinoline ring contains 4 different substituents: a halogen atom (X), a carbon chain (Ak), a heteroaryl (HEA), and an aryl ring (ARY).
Figure 5. Example of a structure query using both STN generic nodes and Derwent superatoms. 4.2. Integration of the STN Concept and the Derwent Superatoms The structure search conventions of STN are based on the CAS indexing concept for the REGISTRY and MARPAT databases. DWPIM on the other hand has created a different set of generic structure conventions which have been implemented by Questel. In order to provide a single search 8 ACS Paragon Plus Environment
Page 8 of 18
Page 9 of 18
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
environment on STN it is necessary to develop a concept that integrates DWPIM into STN. In our new approach we have succeeded in integrating the Derwent hierarchy of superatoms into the STN query hierarchy of generic nodes. This is illustrated for the organic fragments in Figure 6. This picture clearly illustrates that both hierarchies fit well together creating new possibilities for the matching between specific fragments/nodes and generic nodes.
Figure 6. Integrated hierarchy of generic nodes. The hierarchy of nodes shows that there are 4 different sets of nodes and structure fragments (from bottom to top): • • •
•
Specific fragments (violet boxes) represent a specific chemical structure, e.g. piperidine. These specific fragments are indexed both by CAS and Derwent. Derwent superatoms (green circles), e.g. HEA represents heteroaryls. These superatoms are characterized by a structural property and they are indexed only by Derwent. CAS hierarchical generic nodes (blue circles), e.g. Cb represents carbocyclic ring systems. These generic nodes are defined in a rather abstract way as chains or cyclic structures. They are indexed only by CAS in the MARPAT database and they are the key components for Markush queries in STN. It is important to note that the Derwent superatoms are unambiguously contained in the STN nodes. As an example, Cb comprises the superatoms ARY and CYC. A very generic node R (brown circle) describes generic chemical nodes and fragments. R is defined and handled differently by CAS and Derwent. This node is indexed only by Derwent. For this node the STN and Questel implementation are different (Questel: R = XX, STN: R includes both XX and UNK).
From this scheme the correspondences between the different sets of nodes are evident. For example, a search for monocyclic heterocycles (Hy, attribute: mono) yields both superatoms HEA and HET plus the corresponding specific fragments. It is important to note that the unified hierarchy of nodes allows searching of both DWPIM and MARPAT with a single structure query based on the STN
9 ACS Paragon Plus Environment
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
generic query nodes (see Table 2). Using the Derwent superatoms allows a refinement of the query structure (see Table 1) exclusively for the Derwent databases DWPIM and DCR. Non-metal elements, except for halogens, are not represented by a common superatom. A search for a set of non-metal elements can only be done using a variable group which includes explicitly the requested elements, e.g. N, O, and S. The other superatoms in Table 1, e.g. ACY or DYE, do not integrate in this hierarchy and they can only be searched as isolated superatoms without any corresponding specific fragments or elements. All searchable node attributes can be implemented easily and match with existing STN node attributes. Correspondences between STN attributes and the corresponding Derwent attributes are listed in Table 3. Table 3. Correspondence between STN attributes and corresponding Derwent attributes of superatoms. STN Attribute
Interpretation
STN Generic Group
Superatom Attribute
Superatom
BRA or LIN
branched or linear
Ak
BRA or STR
CHK, CHE, CHY
Less than 7/ 7 or more
high carbon / low carbon
Ak, Cy, Cb, Hy
HI, MID or LOW
CHK, CHE, CHY, ARY, CYC, HEA, HET, HEF
Exactly 1/ 2 or more
High/low heteroatom
Cy, Hy
n/a
MCY or PCY
monocyclic or polycyclic
Cy, Cb, Hy
MON or FU
ARY, CYC; attribute is already incl. in the definition of HEA, HET, HEF
SAT or UNSAT
saturated or unsaturated
Ak, Cy, Cb, Hy
SAT or UNS
ARY, CYC, HEA, HET, HEF; attribute is already incl. in the definition of CHK, CHE, CHY
In many cases chemical structures can be represented in different ways based on different bonding conventions. In fact, Derwent is using indexing conventions for the representation of chemical structures which are different from the CAS conventions, including the system defaults. As a consequence it is sometimes necessary to understand exactly the commonalities and the differences between the two databases. The differences between Derwent and Chemical Abstracts Service are especially important for the normalization of bonds in rings, tautomerism in general, and keto-enol tautomerism. Tautomeric structures are represented by the formal equilibrium M=Q-ZH ⇌ HM-Q=Z (where Q = C, N, S, P, Sb, As, Se, Te, Br, Cl or I and M and Z can be any combination of trivalent N and/or bivalent O, S, Se or Te atoms) are normalized in the CAS Registry System. In DWPIM these tautomeric systems are handled slightly different. For example, if M and Z equal nitrogen (e.g. amidines, N=C-NH) the respective bonds are normalized while in cases where M and Z are not both nitrogen (e.g. amides, O=C-NH) the bonds are indexed as localized single and double bonds (whereby the double bond is placed 10 ACS Paragon Plus Environment
Page 10 of 18
Page 11 of 18
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
preferentially on the first atom in the sequence: O > S > Se > Te > N) in DWPIM. Since these indexing conventions have been described in the literature25,26 it is not necessary to elaborate on them in more detail here. It is sufficient to note that the indexing conventions for DWPIM have been implemented on STN without changes. 4.3. Application of the STN Match Level Concept to DWPIM Searching in a Markush database requires a mechanism to control the level of searching for both specific and generic nodes. In STN the concept for handling the matching between the various levels of nodes and fragments in the hierarchy is based on match levels. In STN three match levels have been defined: • • •
ATOM: retrieves only specific nodes (standard for ring nodes) CLASS: retrieves both specific and generic nodes (standard for chain nodes) ANY: retrieves specific and generic nodes plus the R-node.
All match levels can be applied to any specific or generic node in a query. However, all atoms in a ring should have the same match level; exceptions are discussed in section 5. Based on the illustration in Figure 6 it is easy to understand how the match levels are applied on the hierarchy. A substructure search (SSS) for pyridine (violet box) with the various match levels will retrieve • • •
only pyridine derivatives (ATOM), pyridine derivatives and the HEA superatom (CLASS) or pyridine derivatives, the HEA superatom and the superatoms XX and UNK (ANY).
On the other side, when the substructure search starts from the generic node Hy (blue circle) it will yield all specific heterocyclic compounds (ATOM), the result of ATOM plus the superatoms HEA + HET + HEF (CLASS), and the results of CLASS plus the superatoms XX and UNK (ANY). An example of the use of match levels is shown in Figure 7 with the application of defaults. The default match level for chain nodes is CLASS and the default for ring nodes is ATOM.
Figure 7. Example for the application of default match levels. 11 ACS Paragon Plus Environment
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
4.4. Example Markush Search The Markush search engine incorporates the retrieval functionalities of STN with the data structure of DWPIM. It has been developed by InfoChem27 and is now in production in STN. In order to demonstrate the idea of Markush searching in the STN search environment we have chosen the query example from Figure 7. The query consists of a pyridine ring with three substituents in ortho and para position with respect to N and the default match levels are applied for the search. A generic query searches against all possible candidate structures (specific and generic) without prior enumeration. In general, the search finds all specific (enumerated) structures for all combinations of G-groups. In this case we have the following matches: • • • •
Pyridine matches with pyridine (match level ATOM) Cl matches with Cl or X (match level CLASS) C matches with CH3 or any carbon chain (match level CLASS) ARY with attribute mono matches with benzene (match level ATOM).
Figure 8 shows an example of a hit structure (1260-89301) where all pieces of the query have matched against specific fragments or nodes (highlighted in red color). Since the query did not contain any restrictions with respect to further substitutions the hit structure contains additional substituents as the bicyclic ring system. In addition, there are several G-groups (G3, G4, etc.) which are further defined outside of the hit structure in a different section of the display.
Figure 8. Example of a Markush answer (1260-89301) for the query in the previous figure. It should be noted that Markush answers are not necessarily resulting from a match with the indexed core structure. Instead, they may spread over several G-groups and in order to find a valid hit it is necessary to assemble the core structure with all G-groups into a single Markush connection table.
12 ACS Paragon Plus Environment
Page 12 of 18
Page 13 of 18
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
The current search technology is based on a classical structure search which has been extended to manage generic nodes in queries and target structures. In fact, the Markush search engine is able to handle sophisticated Markush searches, supporting all generic nodes together with the STN match level concept as described above. Most queries are completed within less than 5 minutes but complex queries may need longer. Depending on the query the system may not be able to complete the search within the time limit. Any potential hit structures not fully resolved are returned to the user, labelled as ‘iteration incomplete’ so that these can be checked manually. Currently, a new core technology is in development which will improve the search performance, in particular for very generic query descriptions and it will also provide the possibility for a deeper analysis of Markush structures, e.g. to perform an overlap and gap analysis between two structures. It is based on a maximum common substructure search (MCS) algorithm which has been developed and successfully applied to reaction center recognition and reaction classification.28 4.5. Comparison between Match levels (STN) and Translate Options (Questel) Questel has implemented the Derwent Markush database on the basis of the Derwent superatom concept.5,26 The database which is available on Questel is called MMS (Merged Markush Service) and it contains both Markush and specific structures. Specific atoms, shortcuts, and all superatoms (as listed in Table 1) are indexed and can be searched using the software Markush DARC. In addition, a wildcard can be used to search all non-hydrogen nodes and fragments. The level of searching can be controlled by a concept called translation and there are four translate options available on Questel:
• • • •
EQ: no translation, retrieves equal real atoms or superatoms NT: narrow translation, retrieves all superatoms and/or real atoms more specific than the original plus the node itself (the result of EQ) BT: broad translation, retrieves all superatoms and/or real atoms more generic than the original plus the node itself (the result of EQ) ANY: any translation is equal to NT + BT.
The two concepts for search control from STN (match level) and Questel (translate option) show some differences. As discussed in 4.3 the application of match levels is independent of the node type. According to Figure 7 ATOM always retrieves specific nodes, CLASS retrieves specific plus the corresponding generic nodes and ANY retrieves anything within the hierarchy of nodes. This is different for translate options since they operate similar to relationships in a thesaurus, e.g. BT (broad translation) is equivalent to broader terms in a thesaurus file. As a consequence, translate options are different for specific and generic nodes. Table 4 shows a comparison between match levels and translate options. It can be seen that translate options are sometimes redundant and do not provide all the options required by the users.
13 ACS Paragon Plus Environment
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Table 4. Comparison of match levels and translate options. Match Level ATOM CLASS ANY Translate Option EQ NT BT ANY
Translate Option specific query nodes EQ none BT, ANY
generic query nodes none NT ANY
Match Level specific query nodes ATOM ATOM ANY ANY
generic query nodes none CLASS none ANY
In general, the comparison shows:
•
• • •
With match levels it is easy to increase answer sets stepwise from ATOM to CLASS and to ANY. This offers a simple way to exclude or include the generic node R (XX and UNK) which is not possible with translate options. Such a stepwise refinement of the structure search is not possible with translate options. Match levels work consistently with specific and generic nodes while translate options work differently on the two types of nodes, depending on the relative position in the hierarchy. Match levels do not have the possibility to restrict a search to obtain only superatoms (no EQ option). With translate options it is not possible to restrict the search to specific nodes or groups when searching for a superatom. For example, with match levels it is possible to use a generic node like HEA and retrieve all specific heteroaryls (match level ATOM). This is not possible with translate options, since there is no equivalent to match level ATOM.
5. Discussion of Some Topological Issues Markush expressions in patents are not standardized. In fact, the language is open for new expressions or very generic terms and this poses big problems for Markush indexers, the search systems, and the searchers. Hence, a closed set of generic Markush nodes is not sufficient to describe all possible Markush structures (see cyclic, acyclic and element nodes in Table 1). Derwent has chosen to add some miscellaneous superatoms which describe a feature rather than a structural component (see Table 1). In addition, there is a super node (XX) which comprises all unspecific and unknown chemical structures. With the superatom XX it is possible to include vague and unknown parts of Markush structures both for chains as well as for ring systems. This allows the description of two components (A and B) which can be linked together to form a Markush structure (A – XX – B) without any knowledge about the connecting structural unit. It can be envisioned that this concept could also be used for the retrieval of Markush structures. Consider the following request: The components A and B are known and they can be linked by anything (L: Linker) and anything could be
14 ACS Paragon Plus Environment
Page 14 of 18
Page 15 of 18
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
something specific (L has match level ATOM), including generic nodes (L has match level CLASS) or really anything (L has match level ANY). In the previous graphical description of ring systems we have assumed that rings can be described either as specific (e.g. pyridine) or as generic (e.g. HEA). If we extend this concept to allow generic nodes to become part of a specific ring system we can also describe very generic hybrid rings. Derwent has decided to allow the superatom XX to become a valid node in a specific ring system; other generic nodes are not allowed. The superatom XX can close or add rings which are only partly described in the patent. Some examples used in the Derwent description of partly known Markush structures are shown in Figure 9. The examples I and II describe ring closures with carbon and nitrogen atoms, i.e. “C (N) forming a ring”. In the case of example III an aromatic ring is created from the unit “C – N – C” together with the superatom XX.
Figure 9. Examples for the description of partly known ring systems using an XX superatom. A search for these ring systems may be performed as follows. To search for generalized aziridines (example II in Figure 9) N must be assigned as “ring”, the match level is set to ANY and two free sites are allowed. However, this may result in other ring systems which are not requested. In order to overcome this obstacle, two different match levels in a ring system could be allowed. In our example, the query structure would be aziridine where nitrogen has the match level ATOM and the two carbons have match level ANY. 6. Outlook: Towards a Common Markush Language Our new Markush concept is based on the integration of the existing commercial systems from Chemical Abstracts Service and Thomson Reuters (Derwent) and to our best knowledge there exists no other approach which unites the two Markush languages. As a consequence it would be possible to use this concept as the starting point for a common generalized description of Markush structures. Mike Lynch and his research group had already developed such a common Markush language called GENSAL.2-4 Based on this concept, CAS6,7 and Derwent5,6 had developed their own Markush languages. In this paper we have shown that a unified solution resulting in a general scheme is possible and the result is a general Markush scheme. This has the advantage that it is based on practical solutions with a long-term experience in Markush indexing from Chemical Abstracts Service and Thomson Reuters (Derwent). In this concept the generic nodes are grouped in a unique hierarchy where all chemical elements and fragments can be integrated. Vague generic expressions such as dyes or polymer groups can be handled separately from this hierarchy using special generic nodes. The latter set could be extended if necessary. Some flexibility with respect to the bonding conventions is also possible. Cosgrove et 15 ACS Paragon Plus Environment
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
al21 have started to work on a Markush XML and it is possible to extend this approach with our generic Markush topology, i.e. to include the integrated hierarchy of generic nodes (section 4.2) and the control options (section 4.3) as required for Markush retrieval. There are many applications for a standardized Markush language. One can imagine that large chemical and pharma companies or the major patent offices would like to handle Markush structures in a highly structured and searchable way, especially for managing their own portfolios. With a common set of rules it would also be possible to exchange Markush structures easily. Further research in this area should focus on an extension of the existing Markush language to build a common language including the unified hierarchy in order to support the Markush databases of the commercial providers. 7. References 1. Markush, E. A. U.S. Patent 1 506 316, 1924. 2. Barnard, J.M.; Lynch, M.F.; Welford, S.M. Computer Storage and Retrieval of Generic Chemical Structures in Patents. 2. GENSAL, a Formal Language for the Description of Generic Chemical Structures. J. Chem. Inf. Comput. Sci. 1981, 21, 151–161. 3. Lynch, M. F.; Holliday, J. D. The Sheffield Generic Structures Project: A Retrospective Review. J. Chem. Inf. Comput. Sci. 1996, 36, 930−936. 4. Downs, G.M.; Barnard, J.M. Chemical Patents and Structural Information – the Sheffield Research in Context. J. Doc. 1998, 54, 106-120. 5. Dubois, J.E.; Panaye A.; Attias R. DARC System: Notions of Defined and Generic Substructures. Filiation and Coding of FREL Substructure (SS) Classes. J. Chem. Inf. Comput. Sci. 1987, 27, 74–82. 6. Shenton, K.E.; Norton, P.; Ferns, E.A.: Generic Searching of Patent Information. In Chemical Structures: The International Language of Chemistry. Ed. Warr WA. Springer-Verlag: Berlin; 1988, pp 169-178. 7. Fisanick, W. The CAS generic chemical (Markush) structure storage and retrieval capability. 1. Basic concepts. J. Chem. Inf. Comput. Sci. 1990, 30, 145–154. 8. Ebe, T.; Sanderson, K. A.; Wilson, P. S. The Chemical Abstracts Service generic chemical (Markush) structure storage and retrieval capability. J. Chem. Inf. Comput. Sci. 1991, 31, 31−36. 9. Downs, G. M.; Barnard, J. M. Chemical Patent Information Systems. Wiley Interdiscip Rev Comput Mol Sci. 2011, 1, 727−741. 10. Chemical Abstracts Service. http://www.cas.org/ (accessed April 7, 2016). 11. Thomson Reuters. http://ipscience.thomsonreuters.com/product/derwent/ (accessed April 7, 2016). 12. Simmons, E.S. The Grammar of Markush Structure Searching: Vocabulary vs Syntax. J. Chem. Inf. Comput. Sci. 1991, 31, 45–53. 13. Simmons, E.S. Markush Structure Searching Over the Years. World Pat. Inf. 2003, 25, 195–202. 14. Schmuff, N.R. A Comparison of the MARPAT and Markush DARC Software. J. Chem. Inf. Comput. Sci. 1991, 31, 53–59. 15. Barnard, J.M. A Comparison of Different Approaches to Markush Structure Handling. J. Chem. Inf. Comput. Sci. 1991, 31, 64–68.
16 ACS Paragon Plus Environment
Page 16 of 18
Page 17 of 18
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Chemical Information and Modeling
16. Berks, A.H.; Barnard, J.M.; O’Hara, M.P. Markush structure searching in patents. In Encyclopedia of Computational Chemistry: Volume 3. Ed. Schleyer PvR. John Wiley & Sons: Chichester; 1998, pp 1552–1559. 17. Berks, A.H. Current State of the Art of Markush Topological Search Systems. World Pat. Inf. 2001, 23, 5–13. 18. Geyer, P. Markush Structure Searching by Information Professionals in the Industry – Our Views and Expectations. World Pat. Inf. 2013, 35, 178−182. 19. Barnard, J.M.; Wright, P.M. Towards in-house searching of Markush structures from patents. World Pat. Inf. 2009, 31, 97−103. 20. ChemAxon. http://www.chemaxon.com/ (accessed April 7, 2016). 21. Cosgrove, D.A.; Green, K.M.; Leach, A.G.; Poirrette, A.; Winter J. A System for Encoding and Searching Markush Structures. J. Chem. Inf. Model. 2012, 52, 1936–1947. 22. Sibley, J.F. Too Broad Generic Disclosures: a Problem for All. J. Chem. Inf. Comput. Sci. 1991, 31, 5–9. 23. Warr, W. A. Representation of Chemical Structures. Wiley Interdiscip Rev Comput Mol Sci. 2011, 1, 557−579. 24. Dethlefsen, W.; Lynch, M.F.; Gillet, V.J.; Downs, G.M.; Holliday, J.D.; Barnard, J.M. Computer Storage and Retrieval of Generic Chemical Structures in Patents. 11. Theoretical Aspects of the Use of Structure Languages in a Retrieval System. J. Chem. Inf. Comput. Sci. 1991, 31, 233–253. 25. Mockus, J.; Stobaugh, R.E. The Chemical Abstracts Service Chemical Registry System. VII. Tautomerism and Alternating Bonds. J. Chem. Inf Comput. Sci. 1980, 20, 18-22. 26. Derwent World Patents Index – Markush DARC User Manual. Thomson Scientific, ISBN 978 1 905935 14 7 (Edition 1, 1993). http://ipscience.thomsonreuters.com/m/pdfs/mgr/Markush_Darc_User_Manual.pdf/ (accessed April 7, 2016). 27. InfoChem. http://www.infochem.de/ (accessed April 7, 2016). 28. Kraut, H.; Eiblmaier, J.; Grethe, G.; Löw, P.; Matuszczyk, H.; Saller, H. Algorithm for Reaction Classification. J. Chem. Inf. Model. 2013, 53, 2884–2895.
17 ACS Paragon Plus Environment
Journal of Chemical Information and Modeling
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
For Table of Contents Use Only A Novel Concept for the Search and Retrieval of the Derwent Markush Resource Database Andreas Barth, Thomas Stengel, Edwin Litterst, Hans Kraut, Henry Matuszczyk, Franz Ailer, Steve Hajkowski
18 ACS Paragon Plus Environment
Page 18 of 18