Markup Languages and the Datument - ACS Publications - American

terms where you can check a more precise definition should the need arise. ... queried if needed; after all, agencies such as Chemical Abstracts Servi...
1 downloads 21 Views 369KB Size
➤➤➤➤➤

CHAPTER 8

Downloaded by UNIV OF MINNESOTA on April 16, 2013 | http://pubs.acs.org Publication Date: June 1, 2006 | doi: 10.1021/bk-2006-STYG.ch008

Markup Languages and the Datument Peter Murray-Rust and Henry S. Rzepa

A

style guide is presumed in its basic objective as guiding authors in the task of producing a document describing their work. In a conventional sense, style is traditionally presentational in objective and is aimed at producing a (visually) homogeneous form suitable for aggregation into a journal issue or a book. At the outset of this chapter, we should pin our flag firmly to a rather different mast. This chapter is less about the application of style to the presentation of a document and more about the general principles involved in its application to data to achieve a homogeneous form capable of being reused. It is not so much about how to create a document, but rather about how to create a more data-focused entity we call a datument (1), that is, a container for data and its associated descriptions. The datument can have conventional attributes of authorship, affiliation, and other familiar structural components, such as sections, tables, figures, and bibliography, but it extends this by also carrying data. This differs from, for instance, tabulated or quoted data typical of that found in most chemical documents. To illustrate this most important difference, consider the following assertion: The melting point of aspirin is 135°, and its molecular ion has the formula C9H8O4+·.

As a human trained in chemistry, you probably understand much of the semantics, but consider how much potential ambiguity and implied meaning is contained in this admittedly concise statement. • You understand what is meant by the term melting point, but you might have to seek a librarian’s help to locate a relevant dictionary of chemical terms where you can check a more precise definition should the need arise. Copyright 2006 American Chemical Society In The ACS Style Guide; Coghill, A., et al.; The ACS Style Guide; American Chemical Society: Washington, DC, 2006.

87

Downloaded by UNIV OF MINNESOTA on April 16, 2013 | http://pubs.acs.org Publication Date: June 1, 2006 | doi: 10.1021/bk-2006-STYG.ch008

88

➤ The ACS Style Guide

• After a little thinking, you conclude that the glyph ° indicates that the number preceding it is a temperature expressed in Celsius units. You probably also recognize that this number is probably only accurate to ± 1 °C, only because you have made such measurements yourself. At the back of your mind is probably the knowledge that a value of 135° is reasonable for an organic compound (and that 1135° would not be), that this implies that this substance is a solid at room temperature and that this value may be used as an approximate indicator of chemical purity. • You certainly recognize the term aspirin as a trivial, unsystematic but commonly used description of a nevertheless well-defined molecule (the structure, or more accurately the connection table, of which you may need to ascertain). • The term molecular ion is a term that tends to be used in a particular branch of spectroscopy known as mass spectrometry. • You have little difficulty in recognizing the molecular formula as by convention listing the number of carbon atoms as the subscript to the initial C, then followed by the number of hydrogen and then other atoms in alphabetical order. The final suffix indicates the charge on the system and reconciles with the use of the term ion, and you will notice that the suffix reveals the presence of an unpaired electron. • You probably are aware of rules that would allow you to check the validity of this formula (i.e., are the + and the • consistent with each other; is the formula physically possible within the constraints of valence theory, etc.). • You might want to derive a molecular mass from the formula, in which case you need to know about atomic weights, isotopes, and other concepts. • You might want to relate the formula to a two-dimensional structural representation, indicating perhaps where the charge might reside, or how the ion might fragment, or a three-dimensional representation for molecular modeling. • As a human, you will infer properties such as aromaticity or the probable presence of a hydrogen bond. A knowledgeable chemist could probably make quite a few more inferences from the above, but the purpose here is to illustrate how much implicit (i.e., undeclared) information there is in such a brief statement (all well-defined chemical terms are shown above in italics to emphasize this). The purpose in writing it all down is to reinforce the idea that although such a statement (as might be found in a document) contains much data, only a human can really make significant use of it. Well, that would be true only if the human were to give undivided attention to the contents of such a document; the human certainly could not cope well with more than a few such documents, and not at all with, say, millions of documents. A normal response to such issues of scale would be to say that the essential data and semantics of the above phrase need to be abstracted (with inevitable loss

In The ACS Style Guide; Coghill, A., et al.; The ACS Style Guide; American Chemical Society: Washington, DC, 2006.

Downloaded by UNIV OF MINNESOTA on April 16, 2013 | http://pubs.acs.org Publication Date: June 1, 2006 | doi: 10.1021/bk-2006-STYG.ch008

Chapter 8: Markup Languages and the Datument



89

of some data and information) into a database representation and then suitably queried if needed; after all, agencies such as Chemical Abstracts Service exist to provide such a service. However, consider that the process of converting even a brief statement such as the above into a true chemical abstract requires an expert (and it must be said, error-prone) human. Even then, the knowledge required to construct the list above would have to be acquired from other sources. Could instead much, perhaps all, of this work be handled by a computer? The answer is a clear “yes”, but only if the ground is well prepared for such a task. The purpose of this chapter is to outline some of the basic principles (but not the technical details) of how this could be done and to set out the grander vision that creating an infrastructure that adopts such principles would be the first step toward what has been described as a semantic web of information and knowledge. Not only humans but machines could roam, on vast scales if need be, on such a web, and by doing so discover connections between data and concepts which in days past might have been described as the art of scientific serendipity (2).

Markup Languages and the World Wide Web The World Wide Web arose from the need for high-energy physicists at Conseil Européen pour la Recherche Nucléaire (CERN) to communicate and exchange data and information within a large dispersed community. The basic design, which has been well documented, involved the creation of structured documents containing identifiable information components linked by uniform resource identifiers using markup tags that could be recognized by machines and used as formatting or styling instructions rather than being part of the actual content. Using the HTML version of the previous example:

The melting point of aspirin is 135°, and its molecular ion has the formula C9H8 O4+..



The “tags” in the angle brackets are recognized by the processor as markup and are used as instructions rather than content to produce the previously rendered sentence. Although such examples will be familiar to many readers, we emphasize it here because it illustrates the critical importance of separating content from style (or form). The

tags precisely define a paragraph, a unit for structuring the document. A machine could now easily count the paragraphs in a document and the number of characters (but not individual words, which are separated not by tags, but by spaces) in each. HTML provides a flexible (perhaps rather too flexible) document structure for text (paragraphs, headers, tables, lists), embedded images (and other multimedia objects), human interactivity (through forms), programs (through scripts, applets, and plug-ins), styling (some degree of formatting and screen layout), and metadata (essentially descriptions of data).

In The ACS Style Guide; Coghill, A., et al.; The ACS Style Guide; American Chemical Society: Washington, DC, 2006.

90

➤ The ACS Style Guide

Downloaded by UNIV OF MINNESOTA on April 16, 2013 | http://pubs.acs.org Publication Date: June 1, 2006 | doi: 10.1021/bk-2006-STYG.ch008

Whereas this is a substantial list, its success has generated many problems, which HTML in its original form cannot solve: • HTML can only support a fixed tagset (for example 59 in the latest specification for XHTML 2.0), and even this number is regarded as close to unmanageable (no software yet implements this full set consistently, accurately, and completely). Any other tags that might be present (e.g., ) are simply ignored (strictly speaking, they should be marked as invalid HTML, although they may be valid for other languages). • Much of the behavior (semantics) is undefined. This lack has led to specific disciplines creating their proprietary methods of supporting functionality (e.g., through scripting languages, plug-ins, applets, and other software). • HTML was designed to be error-tolerant, in recognition that it would be authored (and viewed) mostly by humans. Browsers may try to recover from nonconforming documents and may do so in different ways. Humans are good at recognizing and often correcting errors in HTML (missing links, broken formatting, incomplete text). Machines cannot normally manage broken HTML other than in a “fuzzy” manner. • Author-provided metadata is often entirely absent. If present, it will likely adhere to a general form (the so-called Dublin Core schema, http://dublincore. org/), of limited utility in scientific, technical, and medical (STM) areas. • The emphasis on presentation in many of the original tags (such as fonts, colors, and layout) muddled the separation of content from style. The World Wide Web Consortium soon developed technologies (CSS, or cascading style sheets, and XSL, or extensible stylesheet language; more information on this and other XML issues is available at http://www.w3c. org/) to help overcome this problem, but as with HTML, CSS is variably implemented in most browsers. Most commercial tools for authoring HTML emphasize presentation or interactivity (to capture the reader’s attention), and in such HTML, the content is subservient to the style.

Examples of XML and of Chemical Markup Language These conventional markup approaches (HTML, CSS, and XSL) are inadequate for datuments because there is usually no domain-specific support. XML, or extensible markup language, was introduced as a solution to this problem. XML was designed to be simple, easy to use, and small; it is a fully conforming subset of the older SGML (essentially “SGML lite”). It allows new markup languages to be defined through an XML schema formalism (see http://www.w3c.org/XML/ Schema). A schema specifies a set of rules (syntax, structure, and vocabulary) to which a document must conform; those that do are said to be “valid”. Schemas allow more precise constraints, allow the definition of datatypes, and enhance the potential for machine processing.

In The ACS Style Guide; Coghill, A., et al.; The ACS Style Guide; American Chemical Society: Washington, DC, 2006.

Chapter 8: Markup Languages and the Datument



91

Downloaded by UNIV OF MINNESOTA on April 16, 2013 | http://pubs.acs.org Publication Date: June 1, 2006 | doi: 10.1021/bk-2006-STYG.ch008

We start by explaining the terms used in the XML language, illustrated with a small and simple example (kept brief for simplicity and hence not relating to a real molecule) (Scheme 8-1). 150-78-2 1.0/C9H8O4/c1-6(10) 13-8-5-3-2-4-7(8)9(11)12/...

Scheme 8-1. The basic features of an XML document.

The core of the language consists of a set of data containers, or more formally elements (not to be confused with the chemical elements), the enumeration of which is ideally defined by a schema. In this example, the elements are , , , , , and . These have a clearly defined relationship to one another (illustrated above by indentation of the text). Thus the element is said to be the parent of a child element termed , and both are children of the top-level element , which can also be called the document root element. This hierarchy among elements is precisely defined and must carry no ambiguity. Elements can specify data or information in two ways. First, data can be contained between the start and end of any particular element, such as and in the example. Such content can of course be other (child) elements, but it can also be character or numeric data, as in the example above, which uses both the CAS Registry Number, a unique identifier assigned to chemical structures by CAS (see Appendix 12-3), and a unique canonical molecule identifier known as InChI (International Chemical Identifier) (for more on InChI, see Appendix 8-1). Second, data can also occur as the value of an attribute to the element. In the example above, the element has two attributes, version=“1.0” and convention=“InChI”. Both the name of the attribute and its value can be enumerated if needed by the schema; if the attribute is unknown, or its value is outside defined limits, the entire document or datument can be flagged as invalid by suitable software. Thus has attributes elementType=“C” and x2= “-5.4753”. For the former, a value of “C” is allowed (because it is recognized as the standard symbol for the chemical element carbon), but a value of say “CX” would not be allowed. The second attribute is defined (in the schema) as the x coordinate of a set of two-dimensional molecular coordinates. As such, its presence implies that it should be paired with a y2 coordinate. One can specify in the

In The ACS Style Guide; Coghill, A., et al.; The ACS Style Guide; American Chemical Society: Washington, DC, 2006.

Downloaded by UNIV OF MINNESOTA on April 16, 2013 | http://pubs.acs.org Publication Date: June 1, 2006 | doi: 10.1021/bk-2006-STYG.ch008

92

➤ The ACS Style Guide

schema what kind of behavior to impose if, say, y2 were to be missing. One might decide that its presence would be inferred and that its value should be y2=“0.0”, although in practice that would be a dangerous assumption, and it would be better to flag its absence as an error. Decisions also have to be made regarding the value of this attribute. With two-dimensional coordinates, no assumptions can really be made about the units in which the coordinates are specified, and it would be up to any software to process the values in a sensible manner. Whereas a human might think that, e.g., x2=“-54753.0” looks unreasonable, it may still be internally consistent with the other coordinates. Such software would probably also be expected to trap conditions such as two atoms with identical coordinates, or truly unreasonable values. However, one can be a little more specific about, e.g., x3=“-5.4753”. This would be interpreted as the x coordinate of a three-dimensional set, and as such a reasonable implicit behavior would be to treat this value as corresponding to Angstrom units unless otherwise specified. It is also worth noting that elements which specify data in the form of attributes need not enclose any further data; thus in this case represents both the start and the end of the element (in other words it is an empty container). The preceding discussion has been fairly precise and meticulous, if only to illustrate how XML can be used to impose well-defined structures and relationships on data. We emphasize, however, that it would not normally be a human who has to cope with such levels of detail and precision; the design is such that in fact software will carry almost all of the burden of producing the XML in the first place and then validating and using it subsequently. The preceding argument served only to illustrate how such software can be made to safely operate without the need for human intervention in the process. The second example (Scheme 8-2) is an elaboration of the first fragment, but formalized below as CML (chemical markup language) (3). We emphasize that this chapter is not meant to be an instructional manual for any given markup language, with CML here serving only to illustrate the general principles involved. Many other scientific applications of XML have been developed (3, 4), and syntactically, either of these examples could be replaced by other such modularized markup languages. This more extensive example illustrates how a wider range of properties can be defined and also contains a new feature called a namespace. The purpose of this namespace is to enable this entire XML fragment to be combined or aggregated with other XML languages so that no conflict between the names used for the elements can arise. This aggregation is achieved by prefacing each element with a unique (to the document) short string: . An attribute xmlns:cml is now used to define what is called a URI (uniform resource identifier), which stamps a globally unique identifier on the meaning of the cml: prefix. This uniqueness will allow this datument to coexist with other XML languages without conflict (an example of which is described later in this chapter).

In The ACS Style Guide; Coghill, A., et al.; The ACS Style Guide; American Chemical Society: Washington, DC, 2006.

Downloaded by UNIV OF MINNESOTA on April 16, 2013 | http://pubs.acs.org Publication Date: June 1, 2006 | doi: 10.1021/bk-2006-STYG.ch008

Chapter 8: Markup Languages and the Datument



93

1.0/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h1H3,25H,(H,11,12) 150-78-2 136

Scheme 8-2. A CML datument describing a property of aspirin.

The molecule element in this example contains five child elements: cml:metadata, cml:identifier, cml:atomArray, cml:bondArray, and cml:propertyList. Of these, cml:metadata, cml:atomArray, and cml:bondArray have no children and are empty containers, defining only attribute/value pairs, whereas cml:propertyList has one child, cml:property. The latter itself has a child: cml:scalar. As well as the namespace, the cml:molecule element itself has two other attributes, id and title. The and elements reference namespaces other than CML. This is done to facilitate aggregation with other XML components, such as dictionaries, and by this means to reduce what has been called “tag soup”. For example, defines a namespace for a dictionary reference called chem:mpt. Any processing software that might need to process a melting point property would be directed to this dictionary for further information on the semantics of this term. Similarly, the

In The ACS Style Guide; Coghill, A., et al.; The ACS Style Guide; American Chemical Society: Washington, DC, 2006.

94

➤ The ACS Style Guide

attribute units=“unit:c” would handle the conversion of scientific units, and dataType=“xsd:decimal” would handle the basic datatype (i.e., the definition of a decimal number) itself. This mechanism avoids overburdening CML itself with the need to specify such semantics. No other elements or attributes in this example have XML-defined semantics; all other semantics are imposed by CML itself. Thus, the CML schema defines an enumeration (list) of allowed elementTypes and defines their meaning, use, and boundaries. These aspects are discussed in more detail below.

Downloaded by UNIV OF MINNESOTA on April 16, 2013 | http://pubs.acs.org Publication Date: June 1, 2006 | doi: 10.1021/bk-2006-STYG.ch008

The Use of Identifiers The examples of XML shown in Schemes 8-1 and 8-2 illustrate the use of two types of identifiers. The identifier attributes seen in, e.g., are used internally to enable specification of, e.g., and should be unique within the datument (but not necessarily globally) to ensure that this XML document is well-formed and valid. The second type is an element containing identifiers. Various identifiers could be used, such as SMILES, CAS-RN, and the InChI canonical identifier (as shown here), precisely derived from the molecule connection table and used to establish molecular global uniqueness. Any two datuments that contain the same InChI identifier (in this example, C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/ h1H3,2-5H,(H,11,12)) should be presumed to refer to the same molecule (in the sense of a connection table, but not necessarily other properties, such as 3-D coordinates, for example). The whole aspect of identifiable data is pivotal to the concepts used here. The CAS-RN is widely used and offers comprehensive coverage from simple molecules to polymers to Markush structures. The InChI can be derived from the structure, but it is still a fairly new standard. It does not yet have wide support and does not yet offer complete coverage of all materials.

Display of XML and Specific XML Languages The default way of “displaying” or “browsing” any XML-compliant language, such as CML, is as a so-called tree view, outlining the structure of the document (Figure 8-1) but to which no style has been applied. Most modern Web browsers will support this feature (we recommend Firefox). Of more utility is to associate a specific style or transform with this datument. A technology known as XSLT is essentially a specification of how an XMLbased datument might be transformed into a different representation (or subset) of the data. Four examples of how this might be used to transform this datument are listed below: • Extraction of atom two- or three-dimensional coordinates and rewrapping with appropriate syntax for interactive display on screen using appropriate

In The ACS Style Guide; Coghill, A., et al.; The ACS Style Guide; American Chemical Society: Washington, DC, 2006.

Chapter 8: Markup Languages and the Datument



95

(A) +

Downloaded by UNIV OF MINNESOTA on April 16, 2013 | http://pubs.acs.org Publication Date: June 1, 2006 | doi: 10.1021/bk-2006-STYG.ch008

(B) - + 150-78-2 +

(C) - + 1.0/c9h8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h1H3, 2-5H,(H,11,12) 150-78-2 -