A: The Color of Data - Analytical Chemistry (ACS Publications)

A: The Color of Data. Raymond E. Dessy. Anal. Chemi. , 1998, 70 (21), pp 738A–740A. DOI: 10.1021/ac982030e. Publication Date (Web): June 2, 2011...
1 downloads 3 Views 9MB Size
A/C

WebWorks

The Color of Data The modern WebLab may ask, "What is the color of raw data?" (1) In regulated industries, which are legally obligated to store data in its original form, that color may be "first-collected". But modern Intranets, compound documents, multisource spreadsheets, and databases require other colors, other formats. This complex spectrum makes interesting demands on users, applications software, and the transaction systems involved at various levels. This column examines trends driving the future of data formats. Additional online pages available as supporting information (http://pubs.acs.org/journals/ ac) further develop these themes. The materials include interviews with chemists familiar with the problems inherent in sharing data Their comments decide what is best for your lab Intuitively and from traumatic experience, the scientist recognizes that mixing data from different instrument types and vendors requires some juggling and changes in the data format. To better understand and control their lab environment, analytical chemists need to address the following questions: Why do I need to change data formats? How do I do so with greatest safety? Do data format changes create problems with certification and validation? Why and how: Analytical labs are rushing to place spectra on an Intranet for distribution. Structural and analytical information is being directed into institutional databases or spreadsheets so the data can be "mined" by everyone. At the core of this revolution are tools with names such as middleware, objects, and metafiles, Middleware sits in between the netware/ operating system and the application software on your Web server. Middleware makes it possible for users to get at the data they need regardless of differences in plat-

R a y m o n d E. Dessy Virginia Tech 738 A

forms, operating systems, and protocols, while providing information-service personnel with the controls they need to protect the quality and security of that data. Middleware provides the consistent interface—the single dashboard that makes driving the information superhighway easy. Several vendors provide middlewarebased software that combines various data types and performs format interchanges useful to the analytical chemist. Advanced Chemistry Development's program Ilab is a 2-D/3-D molecule sketch pad and a unique physical property and spectrum predictor (XH and 13C NMR). The company's SLIMS is a spectrum-oriented LIMS—a data management system designed to analyze and track spectral information and to incorporate chemical structures into the databases. ChemFolder Is a personal structure/spectra/ text database. These programs provide connectivity to other popular molecular graphics packages Mantra Software offers NuGenesis which coalesces reports and graphics from a large variety of analytical tools for Internet distribution Galactic Industries' Grams/32 provides similar connectivity as well as enhanced data manipulation programs Formats: A plethora of formats (such as mol, 2sk, gif,jpg, 3D, and raz) often make interchange of molecular graphics a difficult experience. Strip-chart-like data have their own set of formats (jcamp, andi, asciii csv, and dif) that present similar challenges. Each type of file has strengths and weaknesses. Are you interested in conciseness, complete instrument setup parameters, resizing, printer compatibility, or commonality? Do you deal with multicolor shaded images, B/W line art, or 3-D molecular CAD? How does a program tell the format of a standard graphics file? It just reads the first few bytes. Gif'files start with the string "47494638376(9) lh". Jpeg files, with "ffd8ffh". Tiff, with "4d4d002ah" or "49492a00h". Why two forms for tiffi One is for the so-called big-endian files (2), which

Analytical Chemistry News & Features, November 1, 1998

Example of objects embedded in text.

have the most significant byte storedfirst(at the lowest address). The other is for littleendianfiles,which have the least significant bytefirst.Because the order of die bytes is switched, one form would read "swap each byte", while the other reads "paws hcae etyb". The two conventions arose because of differences in hardware auto-increment addressing modes and software architecture, and programmer prejudices. Objects and things: Today's chemists have an opportunity to merge and share text, data, and graphics in documents, spreadsheets, databases, and slide presentations. To a large extent, the chemist is shielded from the tools that make all this possible. But let's take a look at what happens when you paste an image into a document First your word processor has to allocate room on the page for the object. Then it either links the reserved space to areas that contain the graphic data or embeds information about the data and its graphics driver in the text. This is object linking and embedding (OLE). Page descriptor language scripts (also called page-description language, page-definition language, or PDL scripts) are often included to generate printed page layout. A control section in the object interfaces with the applications program via a messaging interface. An object request broker (ORB) "makes the deal" and facilitates transaction processing. When you respond to the screen querv "Do vou wish to register this program OLE

nent now?"—your operating system builds list of commands that can be played to the structures that can implement such inter- draw a graphic. The Computer Graphics action. You can even embed spreadsheet Metafile (CGM), that stores images, is an cells in a document and have them automati- example. Windows Metafile (WMF, Wincally updated when the spreadsheet is aldows 3.xx, 16 bits) and Enhanced Metafile tered. To do this on WindowssIntel plat(EMF, Windows '95/Office '97,32 bits) are forms, look for the link, embed, or paste spe- proprietary formats that include extensive cial commands or look under help for OLE. manipulation of both raster and vector imA few years ago, competition was fierce ages. Because it is easy for Windows applibetween Microsoft's proprietary Object link- cation programs to support WMF/EMF, ing and Embedding (OLE), coupled with its most of them accommodate it. But historiComponent Object Model (COM and Discally, some features were lacking in WMF. tributed COM, COM+), and the consortium- Things such as scaling information, device based OpenDoc and Common Object Reindependence, or PostScript functionn were quest Broker Architecture (CORBA))The often weak or absent. This led to a host of advent of Microsoft's Transaction Server "new" third-party metafile architectures, (MTS), which helps unify the COM and such as the Aldus/Adobe/PosAScn/tf/PDF OLE worlds and delivers compound docuseries and to a multitude of shareware opment "plumbing", provided developers with tions In fact conversion packages abound an easy way to build simple server applicapermitting users to migrate from one fortions for lab mat to another Some standardization 11 o p seems to be coming Will it be driven by (the front end). Thus, COM front- or back-end needs and desires? has become the de facto standard. But it Data Problems: In many companies may not be the best choice for the widely there are often two or more copies of the distributed enterprise-level systems mat same data in different formats. One is the span entire organizations (the back end) sacrosanct, FDA approved, "first-collected" They often deal with much traffic and raw data form. As this is exchanged with with large diverse databases and platforms Whether COM is solid and secure enough to other users and employed for other purposes, transformations are made for compatbe scaled for such systems is in question This may be a case in wh,ch robustness has ibility with Internet protocols, application program needs, and better viewing. But that been sacrificed for auick inexpensive software development and many companies are process can be complex. For example, ,f you considering other alternatives such as multi- hadfiveformat sources and a corresponding set offivedestination types, you'd need 25 lincnial ORBs for their distributed systems In such environments, distributed trans- direct converter packages. On the other hand if a single common intermediate foraction coordinators span multiple computmat were involved you'd need only 10. ers and resources. Data transformation services are designed for moving data beThen there is the question of converting tween disparate sources. Microsoft has its the data back again. Not all conversions sights set on becoming a major force in can run in both directions, and systems such data warehousing. So have Sun, Ora- houses have often used their own propricle, Sybase, and Inprise. Sales, suits, and etary format as the intermediary, adding strategies will determine the outcome. another hurdle to the process. But that Metafiles: Metadata contain informamight be changing, as market pressures tion about information. These can refer to point to an industry standard based on identification, description, keywords, consome popular middleware. tent, lab location, and so on. A metafile is a

On top of that, the regulated industries must contend with the added constraint of having the convertedfiles"identical" in content to the original. The meaning of the word "identical" is, in legal and regulatory matters, a bit fuzzy. The look and feel of the transformedfilesare not the same as the original, because the purpose of changing the files is often to give different segments of the lab community different views of the data. And, unfortunately, running a transformed file through a conversion routine may not perfectly restore the original format. This is where some concerns have arisen Is it acceptable to have multiple files representing the same data? What does the FDA want? What will it accept? Many regulations ,uch as the FDA's rule in the Code of Federal Regulations on digital signatures and electronic records (21CFR Part 11) are vague 21 CFR Part 58 which deall with gooo laboratory practices contains one definition of "raw data" used'by the FDA and industry "Raw data means any laboratory worksheets, records, memoranda, notes, or exact copies thereof that are the result of original observations and activities of a nonclinical laboratory study and are necessary for the reconstruction and evaluation of the report of that study. In the event that transcripts of raw data have been prepared, the exact copy or exact transcript may be substituted for the original source as raw data. Raw data may include photographs, microfilm or microfiche copies, computer rrintouts, magnetic media, including dictated observations, and recorded data from automated instruments." But a more frightening interpretation of the term exists—that the only acceptable "raw data" will be the first captured electronic format. The Need: The Web has transformed our world, and it requires the transformation of our data. Chemists can see new data in new ways, squeezing out all the information it contains. Old data can be digested easily by both classical and more modern

Analytical Chemistry News & Features, November 1, 1998 739 A

A/C

WebWorks

chemometric techniques, eliminating the frustration of old, obsolete, and dysfunctional instrument workstations, yet inexpensively adding to our knowledge. More and more Web pages and graphs are generated dynamically, on the fly, at the server. But differences of opinion exist about what is acceptable. Two interviews, available online as supporting information, provide a vendor's and user's view of die dichotomy. The first chemist, who is with a major systems house, supports die use of middleware. The second, who is widi a pharmaceutical firm, has concerns about that solution. There are also URLs leading to a serpent's coil of regulatory legerdemain. While scientists in basic research are free to enter the new world of scientific data visualization those in the regulated industries inhabit a world where the government unable to control encryption direcdy waffles over format decisions and to succeed anywav " "Write that down,' the King said to the jury," in Lewis Carroll's classic book, "and the jury eagerly wrote down all three dates

http://www.microsoft.com/com/ http://www.microsoft.com/msdn/news/ feature/010598/mts/ http://www.microsoft.com/com/mtsfaq

on their slates, and then added tiiem up, and reduced the answer to shillings and pence." Somehow, it made sense to Alice, from inside Wonderland. Thanks to Mike Starling, Union Carbide and Ching-Wan Yip of Wake Forest for valuable input. Comments are invited at http://www.chem. vt.edu/chem-dept/dessy/internet. Middleware-based programs: http://www.acdlabs.com http://www.mantrasolt.com http://www.galactic.com Middleware and file formats: http://www.sybase.com/inc/sybmag/ quarter4_96/entwpcompt/middleware/ miuuieware.html http://www.landfield.com/faqs/graphics/ fileformats-faq/partl/preamble.html Object embedding: http://www.idg.net/idg_frames/english/ content.cgi?vc=docid_9-4ol / 9.html tittp://www.bell-labs.com/~emerald/ dcom_cort>a/Paper.html

Metafiles: http://www.cs.uu.nl/wais/html/na-dir/ graphics/nleformats-faq/.html http://www.agocg.ac.uk/Graphics/CGM/ cgm.html http://www.gentech.com/emf/ win95emf.html http://www.companionsoftware.com/ http://www.companionsoftware.com/PR/ WMRC/WindowsMetafileFaq.htrnl http://www2.echo.lu/oii/en/ raster.html#GIF Regulations: http://www.fda.gov/cder/esig/plllflnr.pdf http://vm.cfsan.fda.gov/~lrd/cfr58.html (1) Tide inspired by The Color of Water, by James McBride. Riverhead Books: New York, 1996. (2) From Jonathan Swift's Gulliver's Travels. The Big-Endian and Little-Endian parties debated over whether soft-boiled eggs should be opened at the big end or the little end.

WHEN YOU'RE TARGETING THE VERY BEST

AIM

F0R......

THE

DYNAMO Dynamic

Extraction

Win95 User

Interface

Multiple Sample Dedicated MALDI

Scanning Software

Automated Sample

Processing

l-HERMO

ANALYSIS BioMolecular Instruments

www.maldi.com 740 A

604 Airport Road, Santa Fe, New Mexico 87805 USA

™i^pBl«HHraWliSllfil

Analytical Chemistry News & Features, November 1, 1998

Voice: (505)471-3232

Fax: (505) 473-9221