Electronic Data Archiving: Ensuring Accessibility, Durability, and

Aug 1, 2002 - The amount of scientific data generated in current times is growing at an ever-accelerating rate. The desire, need, and requirement to c...
0 downloads 3 Views 1MB Size
Chapter 17

Electronic Data Archiving: Ensuring Accessibility, Durability, and Usability

Downloaded by EAST CAROLINA UNIV on January 4, 2018 | http://pubs.acs.org Publication Date: August 1, 2002 | doi: 10.1021/bk-2002-0824.ch017

Edward J. McDevitt DuPontCrop Protection, DuPont, Wilmington, DE 19880-0038

The amount of scientific data generated in current times is growing at an ever-accelerating rate. The desire,need,and requirement to collect and maintain these data in a readily accessible andtamper-proofway that ensures a high degree of integrity over an indeterminate number of years exist. Changing at an equally accelerated rate is the technology used to collect, store, and retrieve data. This chapter describes the challenges for long-term data storage. It profiles the various strategies necessary for maintaining an electronic data archive.

Introduction History is full of examples of human beings trying to preserve data, information and knowledge for future generations. The ancient libraries at Alexandria, Dead Sea scrolls, oral story telling, paintings on cave walls, stain glass windows of the Middle Ages, monastic scriptoriums, and the National Archives of governments around the world are evidence of this need. Each of these examples is different enough to demonstrate the problems inherent in the method of archiving used, be it languages no longer spoken, transcription errors, media that are fragile, media that are not portable or sheer volume. These examples also demonstrate that archiving information cannot be a one-time event for a given set of data, but a process that needs to be managed for the length of time the data, information and knowledge are believed to have value. Failure to set up a process will result in lost data, information and knowledge.

124

© 2002 American Chemical Society

Garner et al.; Capturing and Reporting Electronic Data ACS Symposium Series; American Chemical Society: Washington, DC, 2002.

Downloaded by EAST CAROLINA UNIV on January 4, 2018 | http://pubs.acs.org Publication Date: August 1, 2002 | doi: 10.1021/bk-2002-0824.ch017

125

Similar issues exist in the electronic age. It would be hard to argue that the Information Technology world has not had a positive impact on science and the business world. The ease of electronic data creation and collection has opened new ways to model and solve complex problems to a level of precision never before imagined. The breadth and depth of Information Technology capabilities continues to grow and expand. It is pervasive in most things we do. It is not, however, without challenges to the preservation and the accessibility of data, information and knowledge forfixturegenerations. Software and hardware become obsolete. New products and new versions of existing products are released regularly. People and organizations rush to embrace the promise of new functionality, ease of use and performance, often giving little thought to the data stored in the current systems. The persistent reality is that technology changes will always be with us and that a process to manage this change is necessaryforthe successful preservation of data, information and knowledge for future generations, and for maintaining a high degree of integrity of such data, and where necessary for legal defensibility. John Carlin, Archivist of the United States, summarizes, "Electronic records pose the biggest challenge ever to record keeping in the Federal government and elsewhere. There is no option to finding answers... the alternative is irretrievable information, unverifiable documentation, diminished accountability, and lost history."

Definitions Basic terms for records management are defined in 36 CFR 1220.14. The following are found in 36 CFR 1234 • • • • • •

Database - a set of data, consisting of at least one datafilethat is sufficient for a given purpose. Data base management system - a software system used to access and retrieve data stored in a database. Data file - related numeric, textual, or graphic information that is organized in a strictly prescribed form and format. Electronic information system - a system that contains and provides access to computerized records and other information. Electronic record - any information that is recorded in a form that only a computer can process and that satisfies the definition of a Federal record in 44U.S.C. 3301. Electronic record keeping system - an electronic system in which records are collected, organized, and categorized to facilitate their preservation, retrieval, use, and disposition.

Garner et al.; Capturing and Reporting Electronic Data ACS Symposium Series; American Chemical Society: Washington, DC, 2002.

126



Text documents - narrative or tabular documents, such as letters, memoranda, and reports, in loosely prescribed form and format.

Strategic Components Three strategic components of a successful electronic data archiving process

Downloaded by EAST CAROLINA UNIV on January 4, 2018 | http://pubs.acs.org Publication Date: August 1, 2002 | doi: 10.1021/bk-2002-0824.ch017

are: • • •

a strong Records Management program a mature Information Technology (IT) Life Cycle Management program an organization committed to the principles of these programs

All these components must work in concert with one another. To be successful, an electronic data archiving process needs a welldeveloped Records Management program that defines rules by which records and documents are governedfromcreation to disposition. Records are classified by type such as research records, personnel records, tax andfinancialrecords, etc. Each record type has a defined retention time. The preamble for DuPont's Corporate Records and Information Management says: "Proper records management is an important function of every successful corporation. An effective records management program ensures that all records that are required to conduct the business of the corporation, to fulfill its legal responsibilities, and to support its tax liabilities are maintained and available. "An effective Records Management program also preserves the corporate memory and protects the corporation by ensuring compliance with local and federal laws. "Significant costs are associated with the creation, maintenance, distribution, and storage of records. Therefore stewardship must be exercised..." - DuPont Corporate Records and Information Management preamble. To be successful, an electronic data archiving process must have a welldeveloped IT Life Cycle Management program that defines the rules governing the four life cycle phases: introducing new technology, mainstreaming technology, containing technology and retiring technology. Retiring technology requires decisions as to what data and functionality continue to migrate forward based on the rules set out in the Records Management program. It is often a handoff to technology in the introductory

Garner et al.; Capturing and Reporting Electronic Data ACS Symposium Series; American Chemical Society: Washington, DC, 2002.

Downloaded by EAST CAROLINA UNIV on January 4, 2018 | http://pubs.acs.org Publication Date: August 1, 2002 | doi: 10.1021/bk-2002-0824.ch017

127

phase. This process presents an opportunity to reassess the enduring nature of the data stored in the retiring technology. Laboratory technology, for example analytical equipment, not traditionally thought of as a component of Information Technology, needs to be a defined part of IT Life Cycle Management for R&D and manufacturing for those devices that generate, store, transmit or render data. Analytical equipment, today, is fully IT enabled. They have PC controllers, processors, and disk drives. They have access to the network, on board software for collection, reduction and rendering. Generally speaking, IT Life Cycle planning today is often project, reaction, or necessity-based rather than based on a well-maintained master plan. This is not necessarily bad. Projects are sponsored by the local organization and hence project teams are closer to where the needs and recordkeeping rules are defined. However, the project teams need to understand the technology directions of the larger organization. This ensures that the proper infrastructure is in place to support the production system. The following are examples of technology changes in industry: Macintosh computers were the tools of choice for many years in R&D environments. Industry convergence and corporate policies moved many organizations to Window PCs. Similarly, email systems changed. Some movedfromAll-in-one to LotusNotes or Microsoft Outlook. Relational databases changed with many choosing Oracle or Microsoft Access. File and Print Servers also changed with many organizations movingfromNovell to Windows NT. Each of these technology transitions offers a different set of capabilities and limitations. For example, fonts available in Macintosh Microsoft Office may not be available in Microsoft Office for Windows. In addition, software may be available on the Macintosh but may not be available on the Windows platform. Each of these technology changes required careful planning and project management to ensure no disruption to the organization and no loss of data. There is also a tactical component of IT Life Cycle Management, which is physical media management. Tapes, disks, and other electronic media degrade over time. They need to be refreshed every ten years. Ideally, this is part of the standard operating procedures for the data centers and archive facilities. Finally, an electronic data archiving process can only be successful with a committed organization with ethical individuals supporting Records Management and IT Life Cycle Management programs. Non-JT members of the organization need to assume stewardship roles over the data, information and knowledge generated by their organizations. Senior management must support the enforcement of these programs as well as understand the need to track evolving government regulations in the area of record management. After all, electronic data, information and knowledge are valuable organizational assets.

Garner et al.; Capturing and Reporting Electronic Data ACS Symposium Series; American Chemical Society: Washington, DC, 2002.

128

Downloaded by EAST CAROLINA UNIV on January 4, 2018 | http://pubs.acs.org Publication Date: August 1, 2002 | doi: 10.1021/bk-2002-0824.ch017

Managing Accessibility Accessibility is defined as the ability to locate and use data, information and knowledge known to exist in the organization. In the ideal world there would be a master index containing pointers to all data and information stores in the organization. This is typically an unrealistic expectation. The "means and will" required for maintaining such a repository is high. Most organizations manage the "master index" within small groups with varying degrees of formality. Documents and data required for regulatory or patent purposes are often given very formal attention. Different organizational requirements will dictate the rules on accessibility. Availability is an aspect of accessibility. The organization may need anytime and anywhere access. Speed of access to historic data is sometimes a requirement. Recall times can vary. They may be hours, days, weeks, etc. The requirements will most likely depend on the type of data being requested. There are also certain cost implications depending on the requirements needed. Online storage is most convenient but as repositories of data grow, system performance maybe effected. Hardware and software must be scalable to accommodate such potential growth. Data centers typically charge a premium for such ready access. Off-line storage on tapes or CDs is often less costly but carries with it the latency of having to retrieve and load the data set. Off-line may also mean off site. Fundamentally, however, those records, that are deemed to have enduring value and need to be preserved and available, must be identified and planned for in advance, instead of later reacting to technological change, which could jeopardize access. Disaster recovery plans and appropriate levels of system redundancy also help ensure access to data, information and knowledge.

Managing Durability Storage media ages. It is necessary to refresh the physical media about every ten years. Storage media options also change. Eight-inch andfiveand a quarto* inch diskettes are hard to come by. Similarly, magnetic tapes and tape drives change requiring the transfer of stored information to new media types. One way to address this problem is to maintain outdated equipment. This, however, simply delays the inevitable need to migrate. Maintaining old storage equipment is arguably just as expensive as migrating to new media and new equipment. Parts and service become scarce and expensive. The procedures used to transfer from media to media or media type to new media type must be validated and QA'd to ensure accuracy and reliability in the new copy. Backup and recovery procedures may also need to change. The Records Management

Garner et al.; Capturing and Reporting Electronic Data ACS Symposium Series; American Chemical Society: Washington, DC, 2002.

129

Downloaded by EAST CAROLINA UNIV on January 4, 2018 | http://pubs.acs.org Publication Date: August 1, 2002 | doi: 10.1021/bk-2002-0824.ch017

principles require the destruction of the old copy once the new copy has teen successfully created. What is true of the storage media is also true of the software used to store and access the data. Software versions become out of date and are no longer vendor supported. People with the skill sets necessary to support the software are expensive and become difficult to find. The need to migrate to new versions is a necessity. The new software version and the migration plan need to be validated and QA'dfroma data preservation and functional need perspective. In May 1996, the US Task Force on Archiving Digital Information reported that: 1

"Neither 'refreshing nor emulation sufficiently describes the full range of options needed and available for digital preservation. Instead, atetterand more general concept to describe these options is migration. "Migration is a set of organized tasks designed to achieve periodic transfer of digital materials from one hardware/software configuration to another, or from one generation of computer technology to a subsequent generation. The purpose of migration is to retain the ability to display, retrieve, manipulate and use digital information in the face of constantly changing technology. Migration includes refreshing as a means of digital preservation but differsfromit in the sense that it is not always possible to make an exact digital copy or replica of a database or other information object as software and hardware change and still maintain the compatibility of the object with a new generation of technology."

Managing Usability It is not enough to migrate only "raw" data - the characters, the numbers, the bit and bytes - forward to ensure usability. The metadata and the context for the application or database must also be migrated forward. Metadata is the code to the machine-stored bits and bytes. Metadata is the data about the data. It describes the data in the database. For example, it indicates that a field or column called LASTNAME exists and is 40 characters wide. The metadata indicates this is a secondary key to a table called TEST. This is a required field. The additional system and user documentation further indicates that this is the last name of the experimenter performing the procedure. The metadata documentation describes the method of data capture, the application used to access the data, security rules for the tables and columns as well as other descriptive and procedural information. For derived or calculated data, it is important to know what algorithm or protocol was used. The documentation then becomes something else that needs to be preserved.

Garner et al.; Capturing and Reporting Electronic Data ACS Symposium Series; American Chemical Society: Washington, DC, 2002.

Downloaded by EAST CAROLINA UNIV on January 4, 2018 | http://pubs.acs.org Publication Date: August 1, 2002 | doi: 10.1021/bk-2002-0824.ch017

130

It is important to note that without the metadata in the above example the reader will only see a series of alphabetic characters. Without the entire described context associated with the data, it has no meaning. An example of an issue that can arise when the old metadata does not cleanly carry forward is die use of a DATE field in an older research database. These data were generated in the 1930's. They werefirstrecorded electronically in the 1960's in a system that did not have a modern date field. In this example, the date was simply recorded as month and year. This met the organization's needs and technical capabilities at the time. In the new database, the date is the full date of day, month and four-digit year. Another example is a NAME field in an older system that in the migrated future version is recorded as FIRSTJsiAME, MIDDLE_NAME, and LASTJ^AME. Conversion rules need to be defined and documented. This involves the IT organization and the organization's stewards for the data being converted. Additionally, if there are data quality problems, it is important to address them prior to archiving and migrating. The act of archiving, by itself will not improve data quality. When the data are retrieved at some future time, it will be difficult, if not impossible, to address and correct data quality issues. If the data, information and knowledge are said to have enduring value, thai the quality must be kept high throughout their retention period.

Conclusion The need to develop an electronic recordkeeping strategy is critical and fundamental to the success of retrieving archived data, information and knowledge that has enduring value for future generations. This strategy must be developed within the current Records Management process in partnership with both the Information Technology Life Cycle Management program and the stewards of the data, information and knowledge. The latter are the non-IT managers in the organization Value determination is not the job of Information Technology, it is the job of the data, information and knowledge stewards. These managers are responsible for determining the rules by which the data will be selected, archived and retrieved. It is the job of Information Technology to maintain and execute these rules using the appropriate technologies over time. It is important to set expectations on the archiving process. The organization needs to know what can actually be brought back and in what form. It needs to know to what level of completeness the data can be retrieved. It also needs to know how much confidence it can place in the retrieved data. Wherever possible, in the Information Technology selection process, establish die need for common usage rules among systems, compliance with

Garner et al.; Capturing and Reporting Electronic Data ACS Symposium Series; American Chemical Society: Washington, DC, 2002.

131 record keeping standards, and technology that supports record management and migration.

Downloaded by EAST CAROLINA UNIV on January 4, 2018 | http://pubs.acs.org Publication Date: August 1, 2002 | doi: 10.1021/bk-2002-0824.ch017

Acknowledgements I would like to thank my colleagues at DuPont for their help and input in bringing this papa* together, in particular Iris Fisher, Manager Corporate Records Management team, Aster Wu, Data Architect, Cecelia Smith, Information Management, Sandra Hughes, Application Specialist and Bruce Lockett, Analytical Sciences. I would also like to thank the US National Archives and die Australian National Archives for their availability by telephone and for die wealth of information made available to the public.

Bibliography 1. Australian Government, "A Guide to the Metadata Fields of the Marine and Costal Data Directory for Australia, Blue Pages", September, 1997 2. Kingma, Bruce, "The Cost of Print, Fiche, and Digital Access", D-Lib Magazine, February, 2000 3. Moore, Baru, Rajasekar, Ludaescher, Marciano, Wan, Schroeder, and Gupta, "Collection-Based Persistent Digital Archives - Part 1", D-Lib Magazine, March, 2000 4. Moore, Baru, Rajasekar, Ludaescher, Marciano, Wan, Schroeder, and Gupta, "Collection-Based Persistent Digital Archives - Part 2", D-Lib Magazine, April, 2000 5. National Archives of Australia, "Keeping Electronic Records", http://www.naa.gov.au/recordkeeping/er/keeping_er, March 1995 6. United States National Archives and Records Administration, "Electronic Records Management", 36 CFR Part 1234, last amended July, 1998 7. United States National Archives and Records Administration, "Fast Track Guidance Development Project", January, 1999 8. United States National Archives and Records Administration, "Transfer of Electronic Records", 36 CFR 1228.270 9. Waters, Garrett, "Preserving Digital Information", Commission on Preservation and Access, May, 1996

Products Mentioned 1. Microsoft Access, Microsoft Office, Windows, Windows NT are products and registered trademarks of Microsoft Corporation

Garner et al.; Capturing and Reporting Electronic Data ACS Symposium Series; American Chemical Society: Washington, DC, 2002.

132

Downloaded by EAST CAROLINA UNIV on January 4, 2018 | http://pubs.acs.org Publication Date: August 1, 2002 | doi: 10.1021/bk-2002-0824.ch017

2. Oracle RDBMS is a product and registered trademark of Oracle Corporation 3. Macintosh is a product and registered trademark of Apple Computer Inc. 4. LotusNotes is a product and registered trademark of Lotus Development Corporation 5. All-in-1 is a product and registered trademark of Compaq Computer Corporation 6. Novell is a product and registered trademark of Novell Inc.

Garner et al.; Capturing and Reporting Electronic Data ACS Symposium Series; American Chemical Society: Washington, DC, 2002.