The melting point of aspirin is 135°, and its molecular ion has the formula C9H8 O4+..
tags precisely define a paragraph, a unit for structuring the document. A machine could now easily count the paragraphs in a document and the number of characters (but not individual words, which are separated not by tags, but by spaces) in each. HTML provides a flexible (perhaps rather too flexible) document structure for text (paragraphs, headers, tables, lists), embedded images (and other multimedia objects), human interactivity (through forms), programs (through scripts, applets, and plug-ins), styling (some degree of formatting and screen layout), and metadata (essentially descriptions of data).
In The ACS Style Guide; Coghill, A., et al.; The ACS Style Guide; American Chemical Society: Washington, DC, 2006.
90
➤ The ACS Style Guide
Downloaded by UNIV OF MINNESOTA on April 16, 2013 | http://pubs.acs.org Publication Date: June 1, 2006 | doi: 10.1021/bk-2006-STYG.ch008
Whereas this is a substantial list, its success has generated many problems, which HTML in its original form cannot solve: • HTML can only support a fixed tagset (for example 59 in the latest specification for XHTML 2.0), and even this number is regarded as close to unmanageable (no software yet implements this full set consistently, accurately, and completely). Any other tags that might be present (e.g., ) are simply ignored (strictly speaking, they should be marked as invalid HTML, although they may be valid for other languages). • Much of the behavior (semantics) is undefined. This lack has led to specific disciplines creating their proprietary methods of supporting functionality (e.g., through scripting languages, plug-ins, applets, and other software). • HTML was designed to be error-tolerant, in recognition that it would be authored (and viewed) mostly by humans. Browsers may try to recover from nonconforming documents and may do so in different ways. Humans are good at recognizing and often correcting errors in HTML (missing links, broken formatting, incomplete text). Machines cannot normally manage broken HTML other than in a “fuzzy” manner. • Author-provided metadata is often entirely absent. If present, it will likely adhere to a general form (the so-called Dublin Core schema, http://dublincore. org/), of limited utility in scientific, technical, and medical (STM) areas. • The emphasis on presentation in many of the original tags (such as fonts, colors, and layout) muddled the separation of content from style. The World Wide Web Consortium soon developed technologies (CSS, or cascading style sheets, and XSL, or extensible stylesheet language; more information on this and other XML issues is available at http://www.w3c. org/) to help overcome this problem, but as with HTML, CSS is variably implemented in most browsers. Most commercial tools for authoring HTML emphasize presentation or interactivity (to capture the reader’s attention), and in such HTML, the content is subservient to the style.
Examples of XML and of Chemical Markup Language These conventional markup approaches (HTML, CSS, and XSL) are inadequate for datuments because there is usually no domain-specific support. XML, or extensible markup language, was introduced as a solution to this problem. XML was designed to be simple, easy to use, and small; it is a fully conforming subset of the older SGML (essentially “SGML lite”). It allows new markup languages to be defined through an XML schema formalism (see http://www.w3c.org/XML/ Schema). A schema specifies a set of rules (syntax, structure, and vocabulary) to which a document must conform; those that do are said to be “valid”. Schemas allow more precise constraints, allow the definition of datatypes, and enhance the potential for machine processing.
In The ACS Style Guide; Coghill, A., et al.; The ACS Style Guide; American Chemical Society: Washington, DC, 2006.
Chapter 8: Markup Languages and the Datument
➤
91
Downloaded by UNIV OF MINNESOTA on April 16, 2013 | http://pubs.acs.org Publication Date: June 1, 2006 | doi: 10.1021/bk-2006-STYG.ch008
We start by explaining the terms used in the XML language, illustrated with a small and simple example (kept brief for simplicity and hence not relating to a real molecule) (Scheme 8-1). 150-78-2 1.0/C9H8O4/c1-6(10) 13-8-5-3-2-4-7(8)9(11)12/...
Scheme 8-1. The basic features of an XML document.
The core of the language consists of a set of data containers, or more formally elements (not to be confused with the chemical elements), the enumeration of which is ideally defined by a schema. In this example, the elements are , , , , , and . These have a clearly defined relationship to one another (illustrated above by indentation of the text). Thus the element is said to be the parent of a child element termed , and both are children of the top-level element , which can also be called the document root element. This hierarchy among elements is precisely defined and must carry no ambiguity. Elements can specify data or information in two ways. First, data can be contained between the start and end of any particular element, such as and in the example. Such content can of course be other (child) elements, but it can also be character or numeric data, as in the example above, which uses both the CAS Registry Number, a unique identifier assigned to chemical structures by CAS (see Appendix 12-3), and a unique canonical molecule identifier known as InChI (International Chemical Identifier) (for more on InChI, see Appendix 8-1). Second, data can also occur as the value of an attribute to the element. In the example above, the element has two attributes, version=“1.0” and convention=“InChI”. Both the name of the attribute and its value can be enumerated if needed by the schema; if the attribute is unknown, or its value is outside defined limits, the entire document or datument can be flagged as invalid by suitable software. Thus has attributes elementType=“C” and x2= “-5.4753”. For the former, a value of “C” is allowed (because it is recognized as the standard symbol for the chemical element carbon), but a value of say “CX” would not be allowed. The second attribute is defined (in the schema) as the x coordinate of a set of two-dimensional molecular coordinates. As such, its presence implies that it should be paired with a y2 coordinate. One can specify in the
In The ACS Style Guide; Coghill, A., et al.; The ACS Style Guide; American Chemical Society: Washington, DC, 2006.
Downloaded by UNIV OF MINNESOTA on April 16, 2013 | http://pubs.acs.org Publication Date: June 1, 2006 | doi: 10.1021/bk-2006-STYG.ch008
92
➤ The ACS Style Guide
schema what kind of behavior to impose if, say, y2 were to be missing. One might decide that its presence would be inferred and that its value should be y2=“0.0”, although in practice that would be a dangerous assumption, and it would be better to flag its absence as an error. Decisions also have to be made regarding the value of this attribute. With two-dimensional coordinates, no assumptions can really be made about the units in which the coordinates are specified, and it would be up to any software to process the values in a sensible manner. Whereas a human might think that, e.g., x2=“-54753.0” looks unreasonable, it may still be internally consistent with the other coordinates. Such software would probably also be expected to trap conditions such as two atoms with identical coordinates, or truly unreasonable values. However, one can be a little more specific about, e.g., x3=“-5.4753”. This would be interpreted as the x coordinate of a three-dimensional set, and as such a reasonable implicit behavior would be to treat this value as corresponding to Angstrom units unless otherwise specified. It is also worth noting that elements which specify data in the form of attributes need not enclose any further data; thus in this case represents both the start and the end of the element (in other words it is an empty container). The preceding discussion has been fairly precise and meticulous, if only to illustrate how XML can be used to impose well-defined structures and relationships on data. We emphasize, however, that it would not normally be a human who has to cope with such levels of detail and precision; the design is such that in fact software will carry almost all of the burden of producing the XML in the first place and then validating and using it subsequently. The preceding argument served only to illustrate how such software can be made to safely operate without the need for human intervention in the process. The second example (Scheme 8-2) is an elaboration of the first fragment, but formalized below as CML (chemical markup language) (3). We emphasize that this chapter is not meant to be an instructional manual for any given markup language, with CML here serving only to illustrate the general principles involved. Many other scientific applications of XML have been developed (3, 4), and syntactically, either of these examples could be replaced by other such modularized markup languages. This more extensive example illustrates how a wider range of properties can be defined and also contains a new feature called a namespace. The purpose of this namespace is to enable this entire XML fragment to be combined or aggregated with other XML languages so that no conflict between the names used for the elements can arise. This aggregation is achieved by prefacing each element with a unique (to the document) short string: . An attribute xmlns:cml is now used to define what is called a URI (uniform resource identifier), which stamps a globally unique identifier on the meaning of the cml: prefix. This uniqueness will allow this datument to coexist with other XML languages without conflict (an example of which is described later in this chapter).
In The ACS Style Guide; Coghill, A., et al.; The ACS Style Guide; American Chemical Society: Washington, DC, 2006.
Downloaded by UNIV OF MINNESOTA on April 16, 2013 | http://pubs.acs.org Publication Date: June 1, 2006 | doi: 10.1021/bk-2006-STYG.ch008
Chapter 8: Markup Languages and the Datument
➤
93
1.0/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h1H3,25H,(H,11,12) 150-78-2 136
Scheme 8-2. A CML datument describing a property of aspirin.
The molecule element in this example contains five child elements: cml:metadata, cml:identifier, cml:atomArray, cml:bondArray, and cml:propertyList. Of these, cml:metadata, cml:atomArray, and cml:bondArray have no children and are empty containers, defining only attribute/value pairs, whereas cml:propertyList has one child, cml:property. The latter itself has a child: cml:scalar. As well as the namespace, the cml:molecule element itself has two other attributes, id and title. The and elements reference namespaces other than CML. This is done to facilitate aggregation with other XML components, such as dictionaries, and by this means to reduce what has been called “tag soup”. For example, defines a namespace for a dictionary reference called chem:mpt. Any processing software that might need to process a melting point property would be directed to this dictionary for further information on the semantics of this term. Similarly, the
In The ACS Style Guide; Coghill, A., et al.; The ACS Style Guide; American Chemical Society: Washington, DC, 2006.
94
➤ The ACS Style Guide
attribute units=“unit:c” would handle the conversion of scientific units, and dataType=“xsd:decimal” would handle the basic datatype (i.e., the definition of a decimal number) itself. This mechanism avoids overburdening CML itself with the need to specify such semantics. No other elements or attributes in this example have XML-defined semantics; all other semantics are imposed by CML itself. Thus, the CML schema defines an enumeration (list) of allowed elementTypes and defines their meaning, use, and boundaries. These aspects are discussed in more detail below.
Downloaded by UNIV OF MINNESOTA on April 16, 2013 | http://pubs.acs.org Publication Date: June 1, 2006 | doi: 10.1021/bk-2006-STYG.ch008
The Use of Identifiers The examples of XML shown in Schemes 8-1 and 8-2 illustrate the use of two types of identifiers. The identifier attributes seen in, e.g., are used internally to enable specification of, e.g., and should be unique within the datument (but not necessarily globally) to ensure that this XML document is well-formed and valid. The second type is an element containing identifiers. Various identifiers could be used, such as SMILES, CAS-RN, and the InChI canonical identifier (as shown here), precisely derived from the molecule connection table and used to establish molecular global uniqueness. Any two datuments that contain the same InChI identifier (in this example, C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/ h1H3,2-5H,(H,11,12)) should be presumed to refer to the same molecule (in the sense of a connection table, but not necessarily other properties, such as 3-D coordinates, for example). The whole aspect of identifiable data is pivotal to the concepts used here. The CAS-RN is widely used and offers comprehensive coverage from simple molecules to polymers to Markush structures. The InChI can be derived from the structure, but it is still a fairly new standard. It does not yet have wide support and does not yet offer complete coverage of all materials.
Display of XML and Specific XML Languages The default way of “displaying” or “browsing” any XML-compliant language, such as CML, is as a so-called tree view, outlining the structure of the document (Figure 8-1) but to which no style has been applied. Most modern Web browsers will support this feature (we recommend Firefox). Of more utility is to associate a specific style or transform with this datument. A technology known as XSLT is essentially a specification of how an XMLbased datument might be transformed into a different representation (or subset) of the data. Four examples of how this might be used to transform this datument are listed below: • Extraction of atom two- or three-dimensional coordinates and rewrapping with appropriate syntax for interactive display on screen using appropriate
In The ACS Style Guide; Coghill, A., et al.; The ACS Style Guide; American Chemical Society: Washington, DC, 2006.
Chapter 8: Markup Languages and the Datument
➤
95
(A) +
Downloaded by UNIV OF MINNESOTA on April 16, 2013 | http://pubs.acs.org Publication Date: June 1, 2006 | doi: 10.1021/bk-2006-STYG.ch008
(B) - + 150-78-2 +
(C) - + 1.0/c9h8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h1H3, 2-5H,(H,11,12) 150-78-2 -