AC Webworks: An XML Model for Analytical Instrument Data

Could a new file format bring us closer to true compatibility and archiving of data? Don Kuehl. Anal. Chem. , 2003, 75 (5), pp 125 A–127 A. DOI: 10...
1 downloads 0 Views 84KB Size
a c we b wo rk s

An XML Model for Analytical Instrument Data Could a new file format bring us closer to true compatibility and archiving of data? Don Kuehl

T

hink about the oldest file on your computer. The likelihood is that it is not very old if you are working on any late-model PC or Mac. In fact, WordPerfect 4 files from the late 1980s cannot be opened by some of today’s widely used word processors, even though the world of word processing is relatively compatibility-conscious. If this is the case with word processor files, just imagine how exponentially more complex the problem could be in the sciences, where file formats and hardware are going the way of the dodo bird. If the computer platform or operating system used for the instrument is no longer supported or operational, organizations may be left with vaults of archived data with no means of access. For example, satellite data from the 1970s, which could help with deforestation and land-use analysis, cannot be read by current hardware platforms. A clear need exists for a standard file format that not only satisfies the requirements for all different analytical data types but facilitates their exchange and longterm archiving. The apparent support for a public-domain, platform-neutral standard, such as those based on eXtensible Markup Language (XML), by the U.S. Food and Drug Administration (FDA) and other regulatory authorities has encouraged the pharmaceutical and chemical industries to evaluate data standardization’s potential. This has spurred an examination of current file formats and the

development of new formats, including an industry-specific application (known as a schema) of XML called Generalized Analytical Markup Language (GAML).

Legacy problems There have been several attempts to address the lack of instrument file format

standardization over the past 10–15 years. Efforts by standards bodies such as the International Union of Pure and Applied Chemists, Analytical Instrument Association, and the American Society for Testing and Materials have generally focused on individual measurement techniques and data interchange but not long-term stor-

M A R C H 1 , 2 0 0 3 / A N A LY T I C A L C H E M I S T R Y

125 A

a c we b wo rk s

age. Ideally, standardized data must be readable using any type of hardware and software environment, and current standard data formats fall short of this. Furthermore, published data formats such as AnDI and JCAMP cannot represent more complex experiments, particularly multidetector systems where the data relationships are important for interpretation. The FDA, although not directly involved in file format development, has opened the door for such data formats by not insisting on exact copies of electronic records. The key disadvantage of preserving electronic records in their raw state is that they must be retrieved in their original computing environment. To render all the instrument data accessible for the long term, organizations are obliged to retain and maintain legacy hardware, operating systems, and instrument systems. Instead, the FDA changed the language in section 11.10(b) of 21 CFR Part 11 to indicate that “accurate and complete” copies of electronic data need to be provided. This allows an instrument’s data to be saved in a format other than the instrument file format, opening the way for standardized file formats to preserve data.

structured documents and representations of any type of data could be exchanged among computing devices, applications, and websites. And because XML is now recognized as an industry standard format for data interchange, it should persist for many years to come, regardless of the evolution of operating systems and computer hardware. XML also has other qualities that make it attractive for long-term storage and access for instrument data. First, XML uses ASCII as the storage mechanism because ASCII is fundamentally “human readable” and, therefore, has longevity—a characteristic required by regulators for an archive format. ASCII files also migrate easily to any operating system, hardware platform, or storage medium. Second, XML is based on a public-domain standard controlled by an independent body, the World Wide Web Consortium (W3C). Third, XML can describe and share complex data structures via the use of publicdomain schemas, which can encapsulate binary information to maintain numerical accuracy and precision using standard ASCII characters. It is also fundamentally designed to be “extensible” while maintaining complete backwards compatibility. Using the XML format for analytical XML benefits data presents some problems, many of XML was originally created so that richly which have already been addressed by other organizations. For most analytical FDA’s new guidance instruments, after analog-toOn Sept. 5, 2002, the FDA published the latest in its series of digital conversion and possiGuidance for Industry on 21 CFR Part 11 documents (Docket 00Dble preprocessing, the mea1539), which focuses on the maintenance of electronic records sured data are stored as a for the duration of the record’s life span. It defines two acceptbinary value (single- [32-bit] able approaches to achieve compliance in this regard. or double-precision [64-bit] floating point values). Since The first is the “time capsule” approach, which involves the XML is based on ASCII, it preservation of the exact computing environment in which the is tempting to represent data was acquired and processed. The document accedes that each measured value by its this approach is only viable as a short-term solution because of ASCII representation (e.g., cost and technology advancement. “3.1416”). The problems The second is the “data migration” approach, which involves with this are twofold. First, to ensure an accurate reprethe translation and migration of data as computer technology sentation of the data, apevolves. This is proposed as a viable long-term strategy. Point proximately 7 significant 6.2.1.4 states: “In the migration approach, the new computer sysdigits must be used for sintem should enable you to search, sort and process information in gle-precision values and 15 the migrated electronic record at least at the same level as what for double precision. This you could attain in the old system (even though the new system more than quadruples the size (compared with the size may employ different hardware and software).” 126 A

A N A LY T I C A L C H E M I S T R Y / M A R C H 1 , 2 0 0 3

of the file in its original format) of the stored value when represented as ASCII. Second, when using ASCII, the programmer is forced to write the ordinate value to its full stored precision. A solution to this problem is to encode the binary data into standard ASCII characters using Base64. The W3C has specifically added support for data types to solve exactly these types of problems, and Base64 encoding is the same standard that is used for sending e-mail attachments of binary files over the Internet. Another difficulty is multidimensional data. A simple experiment may consist of a single detector measuring a set of values as a function of time, frequency, temperature, or other variables, but more complex experiments may involve multiple detectors such as an LC/MS/ photodiode array (PDA); in addition, imaging experiments, multidimensional NMR, and MS may involve higher orders of dimensionality. Not only must the schema support n dimensions of data, but it must also preserve relationships between the dimensions. Fortunately, XML provides a mechanism for establishing links between data. For example, in an LC/MS/PDA experiment, a relationship must be established between the elution time and the spectra measured at that time. A link can be specified between the chromatogram elution time and the corresponding MS and PDA spectra for that time. This concept is similar to the bookmarks within an HTML webpage, such as “back to top”. Additionally, linking can establish relationships between any number of simultaneous detector measurements. Some laboratory experiments use higher orders of dimensionality that also need to be handled by the same data format. For example, the number of dimensions in some NMR and MS experiments can be arbitrarily large, and this must be taken into account. This is accomplished by defining a “coordinate” element that holds an array of values, which identify the location of the measured value(s) in an additional coordinate system. For each additional dimension, an additional coordinate element is used, allowing an arbitrarily large number of dimensions to be represented.

a c we b wo rk s

GAML To extend XML for use in new fields, schemas need to be developed and standardized so that programs and people in these fields can understand the information and create and use the data in these files. GAML was developed to take advantage of the attributes of XML in handling analytical data. The schema and additional information on GAML can be downloaded from www.gaml.org. One of the primary methods being used to extend XML to handle analytical data is to develop rules for describing stored parameters with metadata. Early attempts at developing standard file formats mapped instrument settings and parameters from different vendors’ instruments into a common dictionary. This type of dictionary contains information, called metadata, about what the values in part of a file are and where they come from. However, fixed dictionaries have not worked well, and it is actually impos-

sible to develop and maintain a complete dictionary of descriptive tags for all possible parameters from the wide variety of current and future instrumentation. Therefore, when developing an analytical archiving format, it makes more sense to keep the dictionary as small as possible and devise a way to store virtually any parameter in the file, regardless of its relevance to any other data system. Rather than creating a complex dictionary to handle metadata from the various instrument vendors, types, and software implementations, GAML utilizes a single element named for all metadata items. The element uses a series of attributes to describe the origin of the parameter and assign meaning to it for a human reading the file. In this manner, software systems reading a GAML file would not be required to interpret the data but rather merely to display the contents and the identifying attributes. It is then up to the person using the software

to view the file (or reading the file itself) to interpret that parameter’s meaning in the context of the data and the system used to collect it. The ability to access, review, and explore any instrument data format at any workstation, without relying on the original computing platform, would be a major breakthrough. It would simplify an organization’s ability to comply with regulatory requirements, such as 21 CFR Part 11, and facilitate genuine knowledge management across global enterprises, such as those evolving in the pharmaceutical market. By accepting a public-domain XML schema such as GAML, the scientific community will move one step closer to achieving the significant goal of reaping fully the rewards of its achievements. Don Kuehl is the president and co-founder of Thermo Galactic, a supplier of software for spectroscopy and chromatography data analysis.

M A R C H 1 , 2 0 0 3 / A N A LY T I C A L C H E M I S T R Y

127 A