Syntax to facilitate the word processing of chemical formulas - Journal

May 1, 1983 - Syntax to facilitate the word processing of chemical formulas. G. H. Kirby and Susan Milward. J. Chem. Inf. Comput. Sci. , 1983, 23 (2),...
0 downloads 11 Views 446KB Size
57

J. Chem. Inf. Comput. Sei. 1983, 23, 57-60

Syntax To Facilitate the Word Processing of Chemical Formulas G. H. KIRBY* and SUSAN MILWARD Department of Computer Studies, University of Hull, Hull, England

Received November 11, 1982 A syntax is presented for the unambiguous representation of chemical formulas as a string of

characters on a single line without super- or subscripts. Software is described which analyzes formulas with thii syntax and prints them with super-and subscripts as in normal printed chemical text. Rules are given for the recognition of singleline formulas from other text and demonstrate that chemical text could be handled by word-processing systems enhanced by softwareto recognize and analyze formulas. A technique is also described for indicating and printing Greek letters, Roman numerals, and other special characters all of which may occur in chemical formulas. Examples are given of input to and output from the software as implemented for a CBM microcomputer system. INTRODUCTION Chemical formulas embedded in text pose various problems for input and display on computer systems. The inclusion of chemical formulas in text for word processing requires careful consideration regarding the handling of super- and subscripts and the possible appearance of special characters such as Greek letters or Roman numerals. Various systems for the unambiguous representation of chemical formulas on a single line and for conversion to the standard notation, with super- and subscripts, have been described.' Parsers have been developed for chemical formulasa3 although these tend to be more concerned with checking the syntax mainly for interactive computer-based learning or data-base query systems. These approaches are, however, somewhat limited in application, since they cannot distinguish between numbers denoting mass numbers, atomic numbers, simple atom or group repetition factors, nor repetition factors preceding charges. Where the handling of chemical formulas has been incorporated into word-processing systems, the usual method adopted is to add special characters to the text to identify super- and subscripts. These characters explicitly indicate that the printer needs to move the paper up or down half a line or to backspace so that, for example, the charge may appear directly above the repetition factor. During printing, these characters are translated into the appropriate control codes.4 Facilities are not provided for viewing the desired output on the screen prior to printing. The identification of super- and subscripts is a problem present in the input of scientific text generally. The use of special characters introduces difficulties in the correct preparation and entering of material, a task that may well be performed by scientificallyunskilled personnel. Furthermore, the available character set is restricted; where this has been overcome by the use of escape sequences, the input is made even more cumbersome. Finally, the meaning of the formula is obscured by these additional characters. For example, in the Wordcraft word-processing system on Commodore comp u t e r ~[02+] ~ [PtF6-] would be input as [OESC-2ESC+ESC+ESC!+ESC-I [PtFESC-6ESC+ESC+ESC!-ESC-I where ESC+ gives positive half line-feed, ESC- gives negative half-line feed, and ESC! gives a backspace. We have developed a simple syntax for chemical formulas presented on a single line which aids the recognition of formulas from ordinary text and allows the determination, by 0095-2338/83/1623-0057$01.50/0

context, of those characters which are super- or subscripts. Only one reserved character is essential, acting as a separator between possible adjacent numbers such as the repetition factor of an atom and that of its charge, as in SO4%,or between mass and atomic numbers, as in i3C. This syntax applies to formulas as commonly used in inorganic chemistry and to empirical formulas of organic compounds. It is not intended to handle the many ways in common use for indicating both the organization of atoms and groups and the types of bonding within organic compounds. We also show how characters encountered in chemical formulas that are not available in normal character sets, such as Greek letters and Roman numerals, can be indicated on input and displayed when the printing peripheral can produce user-defined characters. THE SYNTAX

-

In general, chemical formulas consist of specific characters appearing in restricted positions as follows: + - superscript {) [I () A...Z a...Z natural (Le., middle or normal) line 0...9 any of the three levels, including subscript The three dots denotes a subrange of the characters, so A...Z means any of the upper case letters from A to Z. Inputting a chemical formula on a single line means that in our scheme, for example CuS04 would be input as CuSO4 S042- would be input as S04:2-

[O,'] [PtF6-] would be input as [02:+] [PtF6:-] The colon has been chosen as the reserved character on the basis of previous discussion of the alternatives.' It acts as a separator in two situations: (i) to separate numbers; e.g., in the above examples we want S042- and not SO4*- or SO4,-; (ii) to separate the repetition factor from a single charge; e.g., and not 02+. While it should be possible to find we want 02+ which of these alternatives is correct through semantic checking, it would require sizeable tables of information and more sophisticated software, particularly to allow complete generality. In microcomputer applications, the size and speed of code are crucial; it is far better to impose a standard input format which is, in this case, very simple. In order to process formulas in this way, the syntax given in Table I was derived. Organic representations such as 0 1983 American Chemical Society

KIRBYAND MILWARD

58 J. Chem. lnf. Comput. Sci., Vol. 23, No. 2, 1983 The f l u o r i d e s w i I l rear-% w i t h s t r o n g f l u o r o - L e w i s

a c i d s such as SbF5

has a molecular. l a t t i c e ,

o r HsF5 t o g i v e adducts. H l t h o u g h XeF2.IF5

in

o t h e r cases, f l u o r i d e i o n t r a r i s f e r occurs t o give s o l i d s t h a t c o n t a i n XeF+,. XeF5:+..

.

ard :(HsF6:->

o r XeF5:+PtF6:-.

n

There are s u b s t a n t i a l d e p o s i t s o f Iirfiestone,, CaCO3. C:aC03. PlgCO3 ,

.

w ~ c f corna.1 I ite, C:C I MgCl2. SH20.

dolomite..

Less ahundarit w e

fl1 I i s o t o p e s o f riadiurn are

s t r o n t i a r l i t e , , SrSCl4..

w~cf tiar:y*tes, B&Cl4.

radioacti1,Je.

which occurs i n t h e :238U decay series,.

:ZE.SRa,

was f i r s t isolated froro t h e urat-liurn o r e p i t c h b l e n d e .

.rl

T h e lsery a.i r-sens it i1 . m sta-inous

ion

,. SnZ+ .. may

be o b t a i r e d by

the r e a c t i a n

.n .

i::crr:Cl04::i2

+

Sn,*’Hg = Cu

+

SnZ+

+ ZCl04:-

rl

H y d r o l y s i s g i i . ~ e sCSn3COH::~432+.,w i t h SnOH+ and CSn(ClH>Z32+ i n minor

armctnts.

T h e f l u o r i d e s w i l l r e a c t w i t h s t r o n g f l u o r a - L e w i s a c i d s such as SbF5 H l t h o u g h XeF2.1F5 has a rnolecular. l a t t i c e ,

o r RsF5 t o g i v e adducts.

i n o t h e r cases.. f l u o r i d e i o n trarisfer o c c u r s t o g i v e s o l i d s that c o n t a i n XeF+., XeF&.. and Xe2F$ i o n s as i n (Xe2Fs> CHsFz> o r XeFSPtFz. T h e r e are s u b s t a r l t i a l d e p o s i t s o f I imestone, CaC03.MgC03,.

strontianite,

w~dcarnal1 i t e ,

K:Cl .MgC12.6HZ0.

SrS04.. at-icl b w y t e s . , BaS04.

r,acfioar-tive. 266Ra,

CaC03, dolomite..

Less abundant w e

H I 1 isotopes o f r a d i u m w e

which occurs i n t h e 238U

decay s e r i e s .

was f i r s t

i s o l a t e c l frorn t h e uraniim ore p i t c h b l e n d e .

T h e very a i r - s e n s i t i v e

stannous ion., Sn2+.-

r f w d be o b t a i n e d by t h e

r e a c t ion

i : ~ r ~ : : C : l O ~ :+ : ~ ~C;n..+Ig Hydro Iys is g ives I:Sn3 < OH >

= C:cr 3

+

Sn2+

’+ .. IU

+ 2Cl03

it h S n t ”

and C S n < OH :,2 12+

i n m inor

mounts.

Figure 1. Chemical text before and after processing by the formula recognizer and analyzer software. The input in the upper half uses the diktive .n to indicate new paragraphs.

CH3.CH2.CH2.CH3are covered, as well as the more usual inorganic formulas such as K2Cr207or ions like [Fe(H20),(0H)l2+. Each must be terminated by a space so that a formula is treated like a word of text. These syntax rules are expressed in the standard syntactic meta-language defined by the British Standards Institution? Note that symbols enclosed by an apostrophe are represented by themselves, optional symbols are enclosed in square brackets, Le., [I, and symbols that can be repeated any number of times, including zero, are enclosed in braces, i.e., I). The rules prevent any ambiguity when an atom X is possibly followed by super- and subscripts (i.e., charge and repetition factor) and the next atom Y is possibly preceded by su erand subscripts (i.e., mass and atomic numbers), as in XI,*. Thus, in the single-line format, the repetition factor i must precede any chargej, a reasonable rule since this is the order in which chemists refer to ions. Note that where the repetition factor and charge are both present, the colon separator must be used, as previously mentioned, to avoid the possible amSimilarly, mass and biguity in, for example, SO4*-and 02+. atomic numbers, which can precede an atomic symbol, must both be preceded by a colon to separate them from a preceding repetition factor or to identify them. The exception to this is the case of a mass number following a charge, which itself delimits the previous atom. While a mass number may appear with or without an atomic number, an atomic number cannot

P

Table I. Syntax for Chemical Formulas Presented on One Line

fullformula = formula, C.’, formula}, ’ ’; formula = [integer], body; body = atom, {body} I bracketsequence, {body}; bracketsequence = bra, body, ket, [repfactcharge] ; atom = [ massatno], atsymbol, [repfactcharge] ; massatno = [’:’I, massno, [’:’, atno]; repfactcharge = repfactor I [repfactor, ’:’I, [digit], sign I [repfactor, ’:’I, sign, {sign};

bra=’(’ I,[’

l’f;

ket = ’)’ 1’1’ I ’ y ; sign = ’+’ I ’-’; atsymbol = capital, [lowercase]; capital = ’A’ I ’B’ I . . . . I ‘Z’; lowercase = ’a’ I ’b’ I . . . . . I ’z’; massno = integer; at no = integer ; repfactor = integer; integer =digit, {digit}; digit = ’0’ 1’1’ I . . . . . 1’9’;

.

appear on its own, a situation reflecting general usage of mass and atomic numbers. Hence :13C means 13Cand is distinct from 13C; i3C is represented as :13:6C. Software was written to process formulas according to this syntax. This software first looks for any prefix and then takes atoms in sequence. When an open bracket is encountered, a bracket sequence is started, and the process of searching, either

WORDPROCESSING OF CHEMICAL FORMULAS

J. Chem. In$ Comput. Sci., Vol. 23, No. 2, 1983 59

There is an extensive chemistry o f mixed compounds such as CrCC0>3 o r C%20%4-C4H6>Fe3. Some organo compounds i n higher ox i dation states are known, howeuor , mai n I Y f o r the cyclopentadienyl ligand as in C%2Qt5-C5H5>2Ti%18%1SCl2, PFe%13, and C2C:o*13%111+. n. n Pt%l3, Pt$13%19, and to a lesser extent Pdtl3 have a strong tendency to form %21-bondsr while Pdtl3 very readily forms t22-allyl species. PtS13, PtS13%19.. and to A lesser extent Pdt13 have a strong tendency f o f o r m %21-bonds,. while Pd013 very readily forms $22-allyI species. n. n The dark red brown IrS133lSCl6:2- ion is rapidly and quantitatively reduced in strong OH- scllution to give yellow-green Ir313$11C16:3-: .n 2IrCl6:2- + 20H- S40S41 2IrCl6:3- + $42*43#0Z# + HPO

.

.

There is ar~extensive chemistry of mixed cornpounds such as C r < C 0 > 3 o r < y 4 - ~ : 4 ~ 6 > ~ e < Some ~ ~ > 3organo . compounds in higher oxidation states are known, however,, mainly f o r the cyclopentadienyl I igand as in (yS-C5H5>2Ti IJzCIZ, PFeP,.~ and C