Wiswesser line notation: an introduction

Che or rhr ~mtr.r imp,,rtant sk~ll.;ni~ chemist iitht~iil~ilirv >trllc Iurr ~cwresp~~t~Cli ttr :I giwn ~ioin!iwl. The ntrt>~tim-. ;,re. 1%) iwrt.h itl...
0 downloads 0 Views 4MB Size
Wiswesser Line Notation: An Introduction John J. Vollrner Department of Physical Science, Mills College, Oakland, CA 94613 Che or rhr ~mtr.r imp,,rtant s k ~ l l . ; nchemist i~ iitht~iil~ilirv itlr 1,c.c ific iniormotion i n the .rcirnriiic liremruw. The problems associated with an efficient search have increased tremendously in the last two decades, because the volume of the literature and the number of known chemicals have grown enormously. The use of computers for information storage and retrieval has hecome increasingly important and has required mastery of new search techniques. The search for a specific Chemical in Chemical Abstracts is normally based on its name, chemical formula, or specific registry number. The use of computer systems in such searches has been reviewed recently (1). Another approach utilizes Current Abstracts of Chemistry and Index Chemicus (CAC & IC) and involves a different indexing system, the Wiswesser Line Notation (WLN). This notation provides a very simple yet definitive description of the structure of a varticular comvound. It is verv readilv. vrocessed bv comouter. . ; ~ n dinrrrc~m\.rrsi,mwirh rhr na,ruaI s r r ~ ~ c n1~5 nut r e diilicnlt 1%) iwrt.h

>trllcIurr ~cwresp~~t~Cli t t r :I giwn ~ioin!iwl.The ntrt>~tim-. ;,re nlw c cm1111rativeIvh r t tn!d ~ 3 1va:iIv 1 IIV tiled ~ [ N sI e m I1t.d manuall; or by computer. This system allows rapid identification of related compounds and makes substructure searches quite simple. An example may illustrate the simplicity of this approach. In WLN, Aspirin is represented by QVR BOV1; derivatives will have the same notation with additional symbols. Related compounds will have part of the notation in common; for example, henzoic acids are QVR, and those with an ester group in the ortho position will contain the fragment QVR BOV. Clearly, such related compounds could be located readily by looking up WLN symbols which are arranged alphabetically for each of the fragments. The basis of representing a molecule in WLN is a procedure to divide the structure of the molecule into fragments, to represent these by specificletters or numbers, and to list these in a varticular order which clearlv and unamhi~uouslvcorrrspmid> 10 the l~tm&t~g C ~ tI h i r a~~ m r m

1 V l (HIC-:

4H

1 and --CO-: V) (CHsCH,CH&H,-: 4 and -H:

ROR ( ~ h e n y lR : and -0-: 3 M 3 (CH,CH,CH,-:

Manv researchers have found this avnroach to be an efficient

H)

0)

3 and -NH-:

M)

In each case, particular molecular fragments are represented by a specific letter or numher. Many elements can be desienated hv a sinele letter. as in the case of hvdroeen (H)

\-,.

The usefulness of WLN goes far hevond the retrieval of cieni registration system; each notation represents the concise structure of a chemical in an average of only twenty characters. These notations are filed easily and can be stored and retrieved efficiently. Thev are also well suited for the search of specific molecul& fragments, by computer or by manual scanning. Such searches are especially valuable for correlating the structures and propertiesof different compounds. In order to use this new indexing tool, the researcher must be able to translate a structure into WLN and to read the notation system. Fortunately, WLN is not difficult to learn and after a few hours of studv. " , most chemists will he uroficient enough to attempt a substructure search. This article will present a brief introduction to this important notation system. Basic Notation The Wiswesser Line Notation is a system for representing structural formulas hy a short combination of numerals, capital letters, and punctuation marks. The system uses special symbols to denote common structural fragments. These symbols are formed into a specific code by writing them in the same order in which the fragments are connected in the structural formula. The resulting notations are unique and unambiguous. Only one notation is possible for a given structure and only one 192

Journal of Chemical Education

for e a c h l ~ a r h o nchains which are saturated and unbianched are reoresented hv the number of carbons thev contain. The simple benzene ring, unfused, is very common in organic compounds, and so, for simplicity, it is assigned its own symis bol, the letter R. Likewise, the carhonyl group (-CO-) very common and is designated by the letter V. The -NHgroup and the oxygen in an ether are represented by the letters M and 0. If the molecule is not symmetrical, encoding requires a special rule because in this case two different notations are bossihle, depending on which end fragment is selected as starting point. For example, the hydroxyl group (--OH) is represented by the letter Q and therefore ethanol could he represented as 2Q or Q2 and henzoic acid as RVQ or QVR. The relevant rule specifies that the starting symbol shall be the one which has the latest position in the following series:

(numbers) (letters)

This alphanumeric list is easy torememher because it follows such a regular pattern. Only the symbols for the benzene ring (R)and hydrogen (H) are treated in a specialway, and not as part of the letter section of the list. The R is assigned the earliest vosition, and the H is cited directly after the symbol d t h e irognienr 11. which i t 15 ar13cI1ed. 'l',~kin: r h * ordering i n r t , W , . A I I ~t rII,i : i t i ~ ~111u3t l he reprv-

sented as Q2 since letters are later than numbers in the list. Likewise, henzoic acid becomes QVR, because the Q occurs later than the letter R. Note that the relative order of the symbols cannot be changed; VQR represents a different structure because the molecular fragments no longer occur in the same order. The following examples will illustrate basic ordering further:

Representation of Molecular Fragments The representation of different molecular fragments will be discussed next, one element at a time, with specific exam-

CHa NH2 X X The symbol C is not very common; i t represents a carbon atom which is double-bonded to two other atoms or which is triple-bonded to an atom other than carbon. The only common examples of compounds containing such carhon atoms are the allenes and the nitriles.

2) Oxygen ( 0 , Q, V, or W ) .When oxygen is single-bonded to two atoms other than hydrogen, it is listed as the letter 0. The hydroxyl group is represented by the letter Q.

parts; the saturated and unbranched segments are represented by the number of carbons they contain.

As shown in the last two examples, these carbon segments may contain. at the end. a carbon which is nart of a carhon-carbon mu111~,lc hmd. 'l'hi5 will wrmiln:n~(.I h(. c ,~rl,miS C ~ I W I I Ibring ~n~~mlwrvd. :in11 the u ~ i ~ a r ~ ~will r ~ 1t 1i l,~l~ellcd ~m sc~arzitrl\~. The letter'^ signals the presence of a double bond and the letters U U are indicative of a triple bond. The notations for the preceding three examples then become: QV7,4U2, and 4UU3; extremely brief and absolutely precise. In the preceding examples, each carbon which is accounted for is bonded to one or two atoms other than hydrogen. If a carbon is bonded to three atoms other than hydrogen, i t will be represented by the letter Y. Carbonyl carbons are not included in this designation; carbonyl groups have a separate symbol, the letter V.

The carhonyl group (-COP) is very common and has its own designation, the letter V. When two oxygen atoms are bonded only to an atom other than carbon, they are collectively represented by the letter W. The main examples of this fragment are the N O 2 and -SOagroups.

3) Nitrogen (Z, M, K, or N). The letter Z identifies the or =NH group. -NH2 group and the letter M the -NHA nitrogen atom with three bonds and none to hydrogen is represented by the letter N. If nitrogen is bonded to fonr other atoms, the letter K is used. 0 C

H

.

~- I ! ~ C H ~

~

H

M

3 cH3-i&cH3

~ ( c H ~ l ~

N

Carbon atoms which are bonded to fonr atoms other than hydrogen are indicated by the letter X.

H

M

Z

cHi-c-N

C

K

N

4) Halogens (3, I, G, or E).Whenever possible, elements are represented by the first letter of their international atomic symbols; therefore, the letters F and I are used for fluorine and iodine. For hromine and chlorine this is not possible, since these letters (B and C) are already in use for boron and special carbon atoms. Two letters could he used to represent each, but both elements are very common and this solution would prove cumbersome. Therefore, the letters G and E are used to indicate chlorine and bromine.

Volume 60 Number 3

March 1983

193

5) Sulfur ( S ) .The sulfur atom is represented by its atomic symbol, the letter S. 6 ) Hydrogen (HI. The presence of hydrogen is rarely cited separately. Hydrogen is normally included in other groups, or alkyl; it is simply included such as -OH, N H 2 , -NH-, in the symbol's definition. When hydrogen needs to he cited, the letter H is used, and it is written directly after the symhol of the group to which i t is attached. For unbranched alkanes the hydrogen needs to he indicated after the numeral representing the chain length.

is based on the relative position of the symbols in the alphanumeric list shown above. In the first example below, there are two possible starting points (Q and G ) ,but Q is ranked later in the list and so it becomes the first svmbol in the notation. Next, a path through the other symbols must he chosen. For straight-chain compounds the proper path hegins at the starting point and contkues through the graphic formula to the other end. In the example on hand, the path is shown by the arrow in the third representation. The symhols are then listed in the order in which they appear along the chosen path (QIVG), and this is the correct WLN notation for the example.

a structural formula

graphic formula

------Functional Groups Another way to look at the WLN symbols representing various fragments of an organic molecule is to view them in terms of the common functional groups. This comparison is summarized in the table below. It should he noted that in the WLN Family

Symbols for Common Functional Groups Functional Group

Alkane Alkene Alkyne

R-H

CHB

I I OH

U

1

Br -CH&H,-C-CH&H1

UU

R

--+

E2X 2

-X -OH -0-CO-CO-H -COPOH -CO-0-

Amide Acid halide Anhydride

-CO-N-GO-X -C0-0-CO

Amine Nitro cornpo~nd Thiol Nitriie

-NH,-NH-,-N-NO2 -SH -C=N

E2.2

----------------------

-la I

ZV, VM, or VN VF, VG, YE, or Vt VOV I

2, M, NW SH CN'

01 N

In this case the unsaturation 1s not expressed. it is simply Implied

Encoding Procedure The easiest way to encode a particular chemical is to follow a stepwise procedure. The first step should always be to write out the structure. Next, it should he rewritten using WLu" symhols for each fragment, keeping each in the same position. The resulting structure is called a graphic formula. The proper starting point for the notation should be selected next; this Journal o f Chemical Education

IQx2&l&2E]

-----------------------

The symbols for the additional substituents (2 and 1) are inserted after the symbol of the group to which they are attached. The order of listing is again determined by the alphanumeric list; therefore, 2 will precede l. However, in this case a separation of symbols is needed and the ampersand (&) is used for this purpose. In general, this symbol is used to separate symhols of suhstituents attached to the same atom, if clarification is needed. I t is used to separate alkyl suhstituents and other substituents that are not clearly terminal, i.e., groups such as the halogens, amino, or hydroxyl. The basic procedure for encoding is illustrated further in the following examples: OH

context of a particular structure, multiple letter symhols may be reversed. For example, an ester group may appear as VO or OV in different molecules, depending on the relative priorities of the other fragments

--+

Q 1

Alkyl halide Alcohol Ether Ketone Aldehyde Carboxylic acid Ester

194

WLN notation

The same procedure is followed in the next example, hut branching complicates the selection of the path. The proper path will he the longest continuous path in the graphic forrnu1a.l The starting point chosen is Q because of the four possibilities (Q,E,1,2), it ranks last in the alphanumeric list. The chosen path (QXZE) is shown by the arrow in the third representation.

#H

c=cc=c-

Aromatic

a

WLN Symbol

graphic formula with path

CH,

I

I

CH3-CH-CH-CH3

--*

e

,z====z=z==7

--* -----------QY1&Y1'l

------------

Br

I

F

Br --+

E RXl

--t

OH

'

In complex molecules, the path chosen in the graphic formula must include the largest possible number of branch symbols.

Decoding Procedure

Decoding WLN "names" is comparatively simpler than encoding and, with some practice, much WLN notation can be read directly. The basic procedure is simply the reverse of that described for encoding. The symbols should be rewritten, showing the path and possible branching. Then each term should he translated into molecular fragments and these are then interconnected. The process is shown for two examples below:

L4TJ AVH BQ

L6TJ A 2 A2

0

LSVJ

L6VTJ BVQ BG

As seen in the last three examples, a carbonyl group is considered part of the ring, and its symbol is included within the ring notation, i.e., between the letters L and J. Partial unsaturation is treated in a similar manner. If the ring is completely unsaturated, except for one carbon, that carbon's locant is listed with the symbol H, indicating that i t has an extra hydrogen as a substituent. The absence of the T before the closing letter J indicates that the ring is fully unsaturated otherwise. If an unsaturated ring contains more than one such carbon, the ring is considered saturated, and the unsaturation is shown in the normal manner, indicating its location with locants.

Addltional Notation

Substituted benzenes. If benzene is suhstituted in more than one position, the relative orientations of the substituents must be indicated. This is accomplished by citing the substituent's "locant," a letter representing ring position. The first substituent has locant "a" and the second is selected according to its position on the henzene ring:

For disubstituted benzenes the locants will he "h" for ortho-, "c" for meta-, and "d" for para-substitution. Small letters are used for locants in the graphic formula to avoid confusion with substituents. In the final notation, the locant is given as the corresponding capital letter, but i t must be preceded by a blank space to distinguish it from the symbols used for various suhstituents.

ca

L6V DVJ

COOH

OH

..-

L5 AHJ

L6U CUTJ

L6U DUTJ CG CG

Heterocyclic ring systems are treated in a similar manner, hut symbol T instead of L is used to start the bracketing notation. In such a ring system the hetero atom is in the "a" position and is identified in the normal manner.

TSSJ

T6NJ BZ

T60 BUTJ BG

Bicyclie fused rings. The basis for encoding fused rings is the same as that for monocyclic rings. After the opening bracket (L or T) two numbers indicate the numbers of atoms tion, proceeds through the smaller ring, through the juGction, and then, the larger ring. If the rings are of equal size, they will be labelled so that the lowest locants will be obtained for the substituents, hut the starting point remains the ring junction.

NO,

QR C1

ZR BVQ

WNR DQ CE

Monocyclic rings. Simple carbon rings are differentiated within the notation by special symbols; L indicates the start of the rine and J its end. These symbols are used here in a thCrLg size and the degree of unsaturation will he indicated. The opening symbol, L, is followed by the number of carbons in the ring. For saturated rings, the symbol T precedes the closing symbol J. Absence of the letter T indicates full unsaturation of the ring system. Substitution is indicated in a manner similar to that used for benzene by citing locants and implying locant "a" for the first substituent in the ring. The substituents on the ring are listed after the ring system, i.e., after the closing symbol J, in the alphabetic order of their locants.

n

L3TJ

0 L4TJ

0

L6TJ

L66J BG DVQ

L56TJ AQ EQ T66 BO CHJ HQ

Complex molecules. Ring systems can be much more complex than those described here. They can be polycyclic, and these rings can be fused, bridged, spiro-linked, or a combination of these. The overwhelming complexity of such compounds becomes very evident in the Ring Index (Parent Compound Handbook); yet every compound in that index has been encoded in WLN,every compound has received aunique and unambiguous notation, and almost all consist of fewer than 50 characters. The encoding of complex molecules will not be discussed here; it is beyond the scope of this introductory article. An excellent text is available for further work; i t is the definitive work on WLN,written by Professor E. G. Smith (4).This book lists the rules on which WLN is based, and it explains each with many examples. The exercises at the end of each chapter areconvenient for self-study. Volume GO

Number 3

March 1983

195

It is not necessary tomaster WLN totally before using an index based on the system, because decoding is simpler than encoding. This can be seen readily by attempting a substructure search in an index using WLN (5). For example, locate various benzoic acids with oxygen in the meta-position. For more difficult searches the Chemical Substructure Dictionary ( 6 )may be consulted. Summarv

particularly suitable for indexing. They can be ordered readily, they are retrieved efficiently by humans or computers, and translation into structural formulas is not difficult. Acknowledgment

I would like to thank Dr. Richard Luibrand of California State University a t Hayward for much encouragement and many helpful discussions. Literature Cited (1) Dessy,R. E., and Starling, M. K., Anal. Chem., 51,924A (19791;Ruseh,P. F., J.CHEM

denote the structural fragments of a molecule, and they are formed into a specific notation by writing them together in the same order in which the fragments are connected in the molecule. Alternative notations are resolved on the basis of a simple alphanumeric list which is easily remembered. The resulting notations are unique-only one notation is possible for a given structural formula. They are unambignou-nly one structural formula can be decoded from a given notation. The notations are comparatively short and are

196

Journal of Chemical Education

Scientist.87.16 (July 3,1980). (4) Smith, E. G.,"The Wiswesser Line~FormulaChemical Notation? McGraw-Hill, New York. 1968: Smith. E. G., and Baker. P. A . "The Wiswesser Line~FormulaChemieal Notation? Chemical Information Management, Inc., Cherry Hill, NJ, 1976. ( 5 ) 'Chemical Substruetuie Index," Institute for Scientific Information, Philadelphia. PA, "Parent Comoound Handbook." Chemical Absbad Senice. Columbus. OH. Gr-Ui.