Genomic Messaging System Language Including ... - ACS Publications

Genomic Messaging System Language Including Command. Extensions for Clinical Data Categories. Barry Robson* and Richard Mushlin. IBM Research, T.J. ...
1 downloads 0 Views 816KB Size
Genomic Messaging System Language Including Command Extensions for Clinical Data Categories Barry Robson* and Richard Mushlin IBM Research, T.J. Watson Research Laboratory, Route 132, Yorktown Heights, New York 10598 Received August 20, 2004

This paper, in the area of clinical bioinformatics, highlights relatively efficient means of storing, exchanging, protecting, and searching human and other genomic data, so as to make the data securely accessible to researchers while respecting patient privacy. One important idea is that the GMSL language can be considered as an extension of the way DNA and protein sequences are written so as to carry with them the wishes of the patient in regard to fine-grained consent (as well as retaining the medical experts’ cautions, instructions for use, and annotation), and this is carried, whatever environment (e.g., XML) that the data is from or whatever it is going to. At the deepest level, a stream of data expressed in GMSL resembles highly compressed stream of self-checking machine code. For the reader less familiar with the computational aspects, some simple examples illustrate how the raw language looks and works as a raw stream of (interpreted) bytes. The bioinformatics applications are not confined to the clinical domain. This paper completes the initial specification of the language as previously presented and reports on some important extensions including clinical data categories. Keywords: clinical genomics • bioinformatics messaging • Genomic Messaging System • language extension

1. Introduction The material presented here is self-contained because the notion of the kind of language as described is somewhat unusual but in principle quite general. However, the specific context in which it has been reduced to practice and developed is the Genomic Messaging System (GMS) as previously reported.1 See also an earlier bioinformatics study.2 The GMS system is not only concerned with storage and transmission of clinical and genomic information, but also provides facilities to invoke a variety of clinical, bioinformatics, and computational biology tools, and these are routinely used to demonstrate the value of the approach and to test new developments. They include tools already reported in this journal, i.e., for the modeling of weakly homologous and patient polymorphic proteins,1 and assessment of the fold quality in absence of specific experimental data,3 and for clinical and genomic data mining.4-6 Genomic Messaging can be seen as the glue which binds such diverse tools together, and importantly to the source data in archives of patient records (this raises a standards issue which has been addressedssee Section 1.4). In the present report, we complete the full instruction set for Genomic Messaging System Language (GMSL) for efficient storage, transmission, annotation, and management of DNA sequences and which can be contained within the DNA sequences themselves. Tables 3-6 are new recommendations. * To whom correspondence should be addressed. E-mail: robsonb@ us.ibm.com. 10.1021/pr0498483 CCC: $30.25

 2005 American Chemical Society

The earlier paper1 could be regarded as a specification which would allow a reasonable programmer to implement the concepts, without needing to use the same codes “bit by bit”, i.e., in precise detail as to binary representation. However, while this may work and indeed there may be several opportunities for improvement, the binary representations were not chosen lightly, in regard to efficiency and error checking, and not least in regard to incorporating actual and proposed extensions of the language as discussed here. The present paper is thus a more detailed and hence “under the hood” description, describing all the bytes, “bit by bit”. GMSL instructions and many other kinds of other data can also be outside (around, before, or after) as well as inside a DNA sequence, and also there is no absolute requirement to have DNA data present at all. A sequence of DNA data and/or instructions which is transmitted or stored is called the “GMS stream”, or just stream for short. To give some sense of how this stream can be constructed, and how it looks on transmission and receipt, examples are given in the Discussion section below. In essence, an incoming stream of GMSL, as received for example by a researcher, can be considered as an incoming series of data and commands in machine code, and GMSL is simple but machine-code-like. A further feature is that, with the exception of data in specific data statements, the DNA sequence data and all GMS commands can considered as instructions at equal rank level of implementation. For those familiar with computational theory, it is also helpful to consider these data and instructions existing on the tape of a Turing machine represented by GMS, save that the tape cannot be Journal of Proteome Research 2005, 4, 275-299

275

Published on Web 02/03/2005

research articles rewound and the machine contains more internal states (variables) to memorize relevant data read as a basis for future action. Incoming information could be destroyed as it is acted upon, byte by byte. Data statements can be a generic data statement, or of a variety of types, indicating the kind of data they carry. Additional data in data statements may include applets and other executable code. This could be used to profoundly alter the behavior or mode of use of the GMS system, so here we refer to standard practice, as exemplified in the previous report.1 The current version uses Perl, but the Perl may also wrap and execute other programming languages. In addition, a new version written in, and capable of transporting, Java directly is under development. Apart from comprising part of the transmission, extra instructions in GMSL or other languages can be plugged into the GMS system as so-called “cartridges”. These are not transmitted in the stream, but are used to assign specialized functions to the GMS system at any installation site or at any node in a bioinformatics network. In the previous description, GMS was set up to process one medical record at a time. This is a usual setup, though it is not a restriction on (or of) GMSL. For the present work, when connections to data mining and statistical analysis are discussed, it may be assumed that the GMS system is used multiple times to access or build, transmit and manage many records, which are then used to assemble an archive of multiple records which is analyzed. However, some patient records are becoming very rich and contain multiple data, such as time courses of calcium efflux in cardiac tissue, and these could, of course, be subjected to data analysis. Also, though it is not common practice, there is nothing to stop a GMSL message for one patient record containing a larger archive of record summaries which could be used on receipt to perform, say, a diagnosis of the specific patient. This work discusses a detailed code representation (in terms of bit patterns such as 01001000) which represent a byte, and where each byte represents a unit of data or a basic command. These may either be considered as a set of specific proposals tested in actual clinical research scenarios, or examples of some general scientific principles, to help other workers develop their own analogous clinical bioinformatics systems. In either case, it should be noted that the codes are not arbitrarily chosen but are believed to have certain desirable features in regard to security or brevity of typical messages. It discusses the modes by which protein sequence and other data can be efficiently transmitted, introduces proposals for the primary extended instruction set which is primarily concerned with carrying supporting clinical information. GMSL builds on (and is compliant with) early work by Robson and Greaney2 for representation of protein sequences in which each byte consists of bits encoding physicochemical properties of the amino acids, so facilitating searches by physicochemical and evolutionary homology.2 In comparison to that early work, however, it emphasizes the DNA aspect and introduces many features to assist both in clinical medicine and biomedical research, including security. 1.1 General Description. The GMS system is in two parts. The user who prepares or “encodes” the stream with data, file requests, password checks, and settings which initiate certain actions of receipt, is the “sender” or “transmitting user”, and might typically be a clinical bioinformaticist working at the physician’s office, hospital, or service site preferable within the physician’s office or institutional firewall. The user who “de276

Journal of Proteome Research • Vol. 4, No. 2, 2005

Robson and Mushlin

codes” or receives the data is the “receiver” or “receiving user”, and might for example be a physician, paramedic at an accident site, a researcher, or clinical genomicist who must proof read and sign-off on the content. The process of sending may however be one of storage and recovery, rather than of actual transmission over a long distance. Both sending and receiving components are available to the sender. The system on sending will always automatically attempt a local receive, in the manner analogous to compilation, and this allows he or she to verify that correct encoding and decoding will take place. The receiving component and possibly the sending mode will be accessible to the receiver, but (if present) the sending mode will be inactive when receiving occurs. The GMS is the key component but there are a considerable number of possible add-on tools (such as protein modeling) which allow us to explore communication standards issues. It is also helpful to think of the GMS language (GMSL) as a means of representing DNA sequences with embedded annotation including not just comments of biomolecular and medical interest, but also instructions to direct display and use of the DNA data. Hence, a simple sequence of bases (“base pairs”, units) such as GATTACAGATTAC (i.e., lacking any such annotation) is also valid GMSL. An example with intrinsic GMSL annotation features which is also valid GMSL is GATTACA;deletion;GATA;snpC;GAT;insertion;(TACA)GC, where the site of a deletion is noted, a section TACA highlighted as inserted, and C is highlight as an SNP. An important feature is the 64 types of brackets which can be overlapped (as opposed to solely nested) and associated with annotation, to describe overlapping sequence features: GATTAC ‘B epitope in protein: ’(AGAT ‘T epitope in protein)’ (20TACACATTAGA) ATT)20ACA. Incidentally, although the previous account1 required a semicolon or line break (carriage return, new-line) to terminate each command, and the annotation required introduction by a command such as ‘data’ and associated square brackets, the omissions of semicolons is a matter of relatively straightforward parsing, and the terminators and other features are not fundamental. Indeed, the conversion from other forms including legacy patient records and/or XML is a function of cartridges plugged into the GMS system.1 The GMS language is sufficiently rich that with appropriate software such as the GMS engine1 it can for example help convert legacy clinical patient record data and add DNA data, transmit, reconstitute, and manage XML (including HL7) documents, handle layered security, combine manual and fully automated annotation of DNA consensus sequence features and the protein consensus sequence features of the implied protein sequence, and initiate statistical analysis, data mining, and computational biology techniques such as modeling of patient polymorphic proteins.1 1.2 Fragment and SNP Usage. There has been a growing trend to hope or assume that a limited number of base-pair differences are sufficient markers of genetic individuality and medical effect. Recently, this trend has reversed, leaning toward the other extreme that almost every base pair is at least of potential interest, including for example unexpected “recent” mutations. GMS is set up to hold fragments (as sequence 1, 2, etc.) and, if only SNP information is required, an SNP is considered a fragment of length one unit. Within a larger sequence, an snp can be marked as in one of the examples above as snp A, snp G, snp C, or snp T, which like insertion and deletion, is an intrinsic “DNA marker” command.

GMS Language Including Command Extensions

1.3 Nongenomic Usages. This paper is in part concerned with an extended description of clinical applications of GMS for the biomedical researcher. As noted above, a simple DNA sequence is valid GMSL. Conversely, GMSL is also so rich that it can be useful even if it does not contain DNA data at all, and the tools of GMSL can be used to process other molecular biological data. In such a case, it may be described as the language CLaMSL of the Clinical Laboratory Messaging System, CLaMS. A “mode switch” (to Mode 6) discussed below can disable the primary DNA-related feature while retaining the exact 6-bit counterpart of the 8-bit command set. 1.4 Standards. Since one role of the system is to transmit or create on receipt standard forms such as XML (especially HL7 CDA), FASTA biosequence format, protein data bank formats, standard comma-separated values files and so on, other software can readily replace and extend the examples given in the previous paper1. Some less widely used formats such as GOR secondary structure files, GSML, an internal biosequence markup, CMML the CliniMiner Data Mining Markup Language (also called FANOML) are also capable of being generated by GMS. These were constructed or selected because they are used in various collaborations but also because they have a general form which is readily “tweaked” to more widely accepted, but still diverse, standards.

2. Theory The Genomic Messaging Language is constructed according to certain specific philosophies or principles. 2.1 General GMSL Principles. GMSL is transmitted and stored as a stream of bytes called the “stream”. A key feature of GMS is that the stream of bytes has no memory except for specific working variables set in the program. That is to say, the stream envisaged as a tape is never rewound (except back to the beginning to duplicate previous processing). Hence, for enhanced privacy, bytes may be deleted as read (in which case, the stream cannot be rewound). As discussed below the number of bits in a byte can be varied at least for internal working. The bits can however be repacked into bytes. The bytes actually stored or transmitted tend to be of 8 bits, or multiples thereof on certain devices. The GMS language is very concise and machine-code-like. The primary mode and main instruction set, MIS, consisting of 8 bits, is made explicit in Table 1, which describes what may be considered as the “main mode”. Each bit is a command or data. It is a theoretical consideration that the bit patterns for commands and data are arranged in such a way that expressing the binary number to less accuracy allows us to eliminate or return different aspects of the language which have differing relevance in different modes. With the exception of certain error recovery and encryption matters which will be discussed elsewhere, this may be argued to be more useful for the conceptual framework of the human user than for the computer. However, it certainly benefits the science, or at least philosophy, of GMS development. The binary code for true commands in main mode is XXXXXX00, i.e., they end in 00. Commands which look like data can be considered as commands for writing data on receipt. The same argument could be applied to any data command; be that as it may, only bytes with the 00 suffix contain items which are commands by any criteria. Further, those with suffix 11 are always base pair triplets such as GCA, coded into the remaining six bits at two

research articles bits per base pair (G,C,A,T) Furthermore, those with suffixes 01 and 10 are reserved for opening and closing brackets of 64 types as described in the Introduction. The details are given in ref 1. It thus follows that if we wish to compress information, we can dispose of the bracket capability, then the principle DNA coding capability, and finally certain subsets of the commands, by progressively removing rightmost bits. Certain commands are data statements which in general contain ASCII characters and the list of ASCII characters is terminated by certain termination conditions. They are typically the back-slash ‘\’ or an end-of data statement which is generally {...data}. The ellipsis ... indicates that there are characters which define different actions on termination. An important data statement is the number data statement, which contains the numeric representation of a number. The user may specify a new terminator or form {...data} which is extremely unlikely to be encountered even in, for example, and image file. GMS calculates the probability that the given terminator will occur by chance, and checks with the user. For highly compressed data streams such as streams of binary data where every bit is meaningful, such as for DNA represented by 2 bits per base pair (4 base pairs A,G,C,T per byte of 8 bits), the only means of termination are an end of file condition in the operating system, a failure to receive further bits after a preset period of time has elapsed in transmission, or by a specific instruction telling GMS how many bits or bytes to read. Only the last is considered safe practice, though the mode in which GMS reads DNA sequences temporarily from a file, is tolerated. Some commands read the next byte to obtain an integer parameter in the range 1-255 inclusive and the general rule is that if the byte is zero (00000000) then it looks to the memory of the last data statement. This allows access to numbers larger than 255, real numbers, negative numbers and numbers in scientific notation. The above is the only number-acquiring method in the normal mode. An alternative approach for reading larger integers than 255 would be as follows. If the byte is zero, then the next two bytes are read to obtain a value. If the value of those two bytes is zero (00000000 00000000), then it looks to the three bytes after that, and so on. This occurs only in alternative modes (and in practice to date, only in mode 4) which are described as follows. Note that error checking must be suppressed by the number command byte to read zero bytes. 2.2 Alternative Modes and Command Sets. The above describes the “standard mode” or “main mode” instruction set. Although the primary GMSL commands are based on the 8-bit byte, a general mode M (M ) 1,2,3,...) can be created in which there are M bits per byte, in which case the bytes are interpreted differently, though usually with comparable sense where appropriate and possible. For a B-bit per byte reading or transmission system (typically B ) 8 or multiples thereof), M bytes of B bits are read and joined into a pack of M × B bits, which is then split into B “operating bytes” of M bits each in memory, interpreted sequentially. It is also possible to redefine instructions dynamically, within the GMS stream with respect to name, action, and number of bits per byte, by information transmitted in the stream. However bytes are repacked and split back into M bytes of B bits after computation and prior to storage and transmission. Generally speaking, if the GMS stream is set to a different mode, it has to be a different command set, though the structure usually borrows as closely as possible from that of Journal of Proteome Research • Vol. 4, No. 2, 2005 277

research articles

Robson and Mushlin

Table 1. Genomic Messaging System Language Commands (true commands and instructions to store or transmit DNA sequence data), Representing the Main Instruction Seta no.

278

bits

command

0

00000000

warning

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

00000001 00000010 00000011 00000100 00000101 00000110 00000111 00001000 00001001 00001010 00001011 00001100 00001101 00001110 00001111 00010000

17 18 19 20 21 22 23 24

00010001 00010010 00010011 00010100 00010101 00010110 00010111 00011000

(0 or ( )0 or) AAA A (1 or (_ )1 or _) AAG validate (2 or (__ )2 or __) AAC AA (3 or (___ )3 or ___) AAT comment (alternative: xml cdata) (4 or |_ )4 or _| AGA toggle xml (5 or |__ )5 or __| AGC and protein

25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

00011001 00011010 00011011 00011100 00011101 00011110 00011111 00100000 00100001 00100010 00100011 00100100 00100101 00100110 00100111 00101000 00101001 00101010 00101011 00101100 00101101 00101110 00101111 00110000

(6 or |___ )6 or ___| AGT AG (7 or /_ )7 or _/ AGT end of task (8 or /__ )8 or __/ ACA snp A (9 or /___ )9 or ___/ ACG new dna (10 )10 ACC AC (11 )11 ACT index

49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

00110001 00110010 00110011 00110100 00110101 00110110 00110111 00111000 00111001 00111010 00111011 00111100 00111101 00111110 00111111

(12 )12 ATA ] (13 )13 ATG \e (14 )14 ATC AT (15 )15 ATT

Journal of Proteome Research • Vol. 4, No. 2, 2005

action

Issue warning (byte frame shift detection). Place two of these after command ‘skip 2’ (see skip below). If skip 2 cannot be read, a sequence of 8 zero bits will be encountered and flag an error. Bracket Bracket Triplet Singlet Bracket Bracket Triplet Command. Advance validation counter by 1 Bracket Bracket Triplet Doublet Bracket Bracket Triplet Command, write CDATA (xml comment) Bracket Bracket Triplet Command, toggle xml writing off/on Bracket Bracket Triplet Command. XML annotation is interlaced with the automated annotation of the DNA and of the resulting protein sequences explored in all six reading frames. Bracket Bracket Triplet Doublet Bracket Bracket Triplet Command Bracket Bracket Triplet SNP (DNA feature marker) Bracket Bracket Triplet command, open xml dna start tag Bracket Bracket Triplet Doublet Bracket Bracket Triplet Write content of last data statement to index file (usually an xml output file). Bracket Bracket Triplet End of data type Bracket Bracket Triplet Command, force file end, return to mainstream Bracket Bracket Triplet Doublet Bracket Bracket Triplet

research articles

GMS Language Including Command Extensions Table 1 (Continued) no.

bits

command

action

64

01000000

65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96

01000001 01000010 01000011 01000100 01000101 01000110 01000111 01001000 01001001 01001010 01001011 01001100 01001101 01001110 01001111 01010000 01010001 01010010 01010011 01010100 01010101 01010110 01010111 01011000 01011001 01011010 01011011 01011100 01011101 01011110 01011111 01100000

execute this (alternative: perl) (16 )16 GAA G (17 )17 GAG xml (18 )18 GAC GA (19 )19 GAT hl7 (20 )20 GGA ]0 (21 )21 GGG dicom (22 )22 GGC GG (23 )23 GGT base pairs

97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112

01100001 01100010 01100011 01100100 01100101 01100110 01100111 01101000 01101001 01101010 01101011 01101100 01101101 01101110 01101111 01110000

(24 )24 GCA snp G (25 )25 GCG protein (26 )26 GCC GC (27 )27 GCT filedata

113 114 115 116 117 118 119 120

01110001 01110010 01110011 01110100 01110101 01110110 01110111 01111000

(28 )28 GTA ]1 (29 )29 GTG password

Data statement, execute elected code content on receipt (default is Perl) Compare ‘execute code’ Bracket Bracket Triplet Singlet Bracket Bracket Triplet Data statement, store xml and protect xml symbols Bracket Bracket Triplet Doublet Bracket Bracket Triplet Data statement, HL7 CDA xml. Protect xml Bracket Bracket Triplet End of data, fail if parity odd Bracket Bracket Triplet Data statement (medical image) Bracket Bracket Triplet Doublet Bracket Bracket Triplet data type, comment text including DNA sequence data in ASCII Bracket Bracket Triplet SNP (DNA feature marker) Bracket Bracket Triplet Data statement, protein sequence in ASCII Bracket Bracket Triplet Doublet Bracket Bracket Triplet Data statement. Optionally names file. Ending with {by X unlock data} requests password X. 2004A Version Extension: Ending with {applet data} executes file immediately assuming code type as specified by ‘elect code’ (default gms) Bracket Bracket Triplet End of data, fail if parity even Bracket Bracket Triplet Data statement. Request password attempt from user. Attempted password is memorized. Test now for match with X if terminator is {by X protect data}. Otherwise wait until {by X protect data} or {by X unlock data} (used in file data statement) is encountered. Other content of data statement is ignored except in first data statement (usually a password data statement) received, which much match initial password entered by receiver. Journal of Proteome Research • Vol. 4, No. 2, 2005 279

research articles

Robson and Mushlin

Table 1 (Continued)

280

no.

bits

command

121 122 123 124 125 126 127 128

01111001 01111010 01111011 01111100 01111101 01111110 01111111 10000000

(30 )30 GTC GT (31 )31 GTT instruction

129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160

10000001 10000010 10000011 10000100 10000101 10000110 10000111 10001000 10001001 10001010 10001011 10001100 10001101 10001110 10001111 10010000 10010001 10010010 10010011 10010100 10010101 10010110 10010111 10011000 10011001 10011010 10011011 10011100 10011101 10011110 10011111 10100000

(32 )32 CAA C (33 )33 CAG data (34 )34 CAC CA (35 )35 CAT number (36 )36 CGA insertion( )37 (37 CGG name (38 )38 CGC CG (39 )39 CGT code (alternative: gms)

161 162 163 164 165 166 167 168

10100001 10100010 10100011 10100100 10100101 10100110 10100111 10101000

(40 )40 CCA snp C (41 )41 CCG conditional

169 170 171 172 173 174 175 176

10101001 10101010 10101011 10101100 10101101 10101110 10101111 10110000

(42 )42 CCC CC (43 )43 CCT skip (alternative: skip to)

177 178 179

10110001 10110010 10110011

(44 )44 CTA

Journal of Proteome Research • Vol. 4, No. 2, 2005

action

Bracket Bracket Triplet Doublet Bracket Bracket Triplet Data statement, contents are extended instructions (executed on receipt) Bracket Bracket Triplet Singlet Bracket Bracket Triplet data type, general (catch-all) Bracket Bracket Triplet Doublet Bracket Bracket Triplet Data statement, contents define active ‘number’ Bracket Bracket Triplet dna feature marker Bracket Bracket Triplet Data statement, contents become active ‘name’ Bracket Bracket Triplet Doublet Bracket Bracket Triplet Data statement, content is code. Default is gms if command “elect code” has not be issued, or it elected gms. Code in the data statement is also executed on receipt of these data statement if it ends with string {applet data}. Bracket Bracket Triplet SNP (DNA feature marker) Bracket Bracket Triplet Data statement, contents are simple conditional test (usually with variables {...} and constant text, and operators )), >, ),data

Metadata:)42

Age:)8

8

7

6

5

4

Action

Subsequent instructions if not defined/redefined are read from the first 8 bits only and interpreted as standard 8-bit commands. Otherwise read as redefined. Default: use as defined Table 1. If command ‘byte mode 8’ is encountered in this mode, switch to secondary mode, the 8-bit Robson-Greaney code (2). Drop use of GMSL brackets. Last digit is 0 for a command and 1 for a triplet. If command ‘end of task’ is encountered revert to 8 bit command set. If command ‘byte mode 7’ is encountered in this mode, assume 5-bit Robson-Greaney code (2) with following additions of two bits to right which represent further instructions: 00 - do nothing(but two bytes of 0000000 will be used as frame shift check and issue a warning) 01 - skip next two 7-bit bytes (frame shift check) 10 - re-interpret preceding 5 bits as a mode to which to switch. 11 - return to calling mode So-called “CLaMS mode”. Read six bits per byte, and assume bits 7 and 8 were ‘00’ in interpreting Table 1. This retains the Command set only, losing triplets and brackets. If command ‘byte mode 6’ is encountered in this mode, switch to secondary mode, the protein sequence mode. Drop use of brackets and commands. Read as protein sequence with amino acids corresponding to codon triplets,to end. End at stop codon, and return to previous (calling) mode. Telegram mode. Read text as compressed 32 character set (errorcheck) (skip number of 5-bytes specified by next byte) (end mode 5) (blank) ABCDEFGHIJKLMNOPQRSTUVWXYZ01 where runs of 01 will be interpreted to base 10 on receipt. Reserved strings to insert characters on receipt are SWITCHCASEZZ, OPENBRACZZ, CLOSEBRACZZ, STOPZZ, COMMAZZ, COLONZZ, SEMICOLONZZ, PLUSZZ, MINUSZZ, ASTERISKZZ, SLASHZZ, XXASCIZZ, XXXXUNICODEZZ where XXX is a three digit (base 16 coded by characters A,B,..P) ASCII or UNICODE code for a required character. String ENDZZ is reserved to mean return to previous (calling) mode. Extended minimal flexible mode. Read as XX01 commands:0001)A,0101)G,1001)C,1101)T, XX00 commands:0000)warning (frame shift check), 0100) skip number of bytes specified by next byte (0-16), if zero, read 2 bytes after that, if that is zero read 3 bytes after that, and so on. 1000)switch to mode specified by bytes, if zero, read 2 bytes after that, if that is zero read 3 bytes, and so on 1100)switch to previous (calling) mode, XX10 commands:0010) read next 2 bytes as standard 8 bit instruction 0110) read next 2 bytes as extended 8 bit instruction 1010) verify (increment count) 1110) read successive pairs of bytes as ASCII characters until STOPZZ is encountered XX11 commands:0011) general data statement, 8 bits per byte 0111) password data statement, 8 bits per byte 1011) conditional data statement, 8 bits per byte 1111) end of task (terminate stream) Journal of Proteome Research • Vol. 4, No. 2, 2005 293

research articles

Robson and Mushlin

Table 6 (Continued) (a) GMS Modes M Mode. Corresponds also to number of bits per byte specified, with the exception of a ‘mode zero’ which invokes the 8-bit Robson-Greaney protein sequence code. 3

Action

Minimal flexible mode. Read as 001)A,011)G,101)C,111)T, 000)warning (byte frame shift check) 010)skip over the next two bytes 100)switch to mode specified by next two bytes, 110)switch to previous (calling) mode Read as 00)A,01)G,10)C,11)T to end-of-data or end-of-file. Transmit binary to end-of-data or end-of-file. Read next byte in stream to switch on RobsonGreaney protein sequence code (2) for amino acids sequences based on physicochemical properties expressed by 1-5 bits and runs to end-of-data or end-of-file. 0,6,7,>8 fails. 8 switches on the full 8bit Robson-Greaney code.

2 1 0

(b) Robson-Greaney Codesa 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

00000 00001 00010 00011 00100 00101 00110 00111 01000 01001 01010 01011 01100 01101 01110 01111 10000

17 18

10001 10010

19 20 21 22 23 24 25 26

10011 10100 10101 10110 10111 11000 11001 11010

27 28

11011 11100

29 30 31

11101 11110 11111

warning W Y F L M I V C(-S-S-) C(-SH) X A hydroxyproline P blank G T modified (e.g., phosphothreonine) T S modified (e.g., phosphoserine) S N-glycosylated N Z(Q or E) Q B(N or D) D E modified (e.g., pyroglutamate) E H modified (e.g., methylhistidine) H K R

reading frame check Non-polar, large aromatic Non-polar, large aromatic Non-polar, large aromatic Non-polar, large aliphatic Non-polar, large aliphatic Non-polar, large aliphatic Non-polar, large aliphatic Non-polar, small, polar aspect Non-polar, small, polar aspect Unknown Non-polar, small, polar aspect Polar neutral, small, non-polar aspect Polar neutral, small, non-polar aspect Deletion Polar neutral, small, non-polar Polar neutral, hydroxyl derivative Polar neutral, hydroxyl Polar neutral, hydroxyl derivative Polar neutral, hydroxyl Polar neutral, acid amine derivative Polar neutral, acid amine Polar, acid or acid amine Polar neutral, acid amine Polar, acid or acid amine Polar charged, acid amine Polar charged, acid Polar charged, acid Polar charged, amine derivative Polar charged, amine Polar charged, amine Polar charged, amine

a The more that the bits match between two amino acids from left to right, the more conservative the substitution of one by the other, and the more similar the physicochemical properties (see ref 2).

settings and handle new requirements by at least some “make do” method, i.e., fresh extensions are constructed using features intrinsic to GMS, without recoding the core of GMS itself. Note that when more concise or elegant methods are introduced later, new GMS systems will be backward-compatible, i.e., still able to use, older data streams in which extensions may have been introduced in the “make do” manner. The only area where some extension or revision may arise soon, because of ongoing research and debate about, is in regard to password scope. Handling such matters flexibly right 294

Journal of Proteome Research • Vol. 4, No. 2, 2005

now is again possible because of the power arising from the ability to transmit executable code within the data stream, as well as because of other capabilities already inherent in GMS. By way of example, consider the issue of fine-grained consent, when the patient may elect that certain areas of his data or DNA can only be used for certain specified purposes, and other areas only for other specified purposes. In some cases, uncommon but possible situations presenting ambiguity of choice can arise, at least in a system like GMS where we deliberately elect that brackets and hence information blocks can overlap, i.e.,

research articles

GMS Language Including Command Extensions Table 7. Summary of Current Functionalities in Overall Automated System Functionality

Legacy conversion Recovery of record form Patient Record Archive Joining of lab data to patient record Adding manual Annotation Production of record from template Encryption, imposition of passwords on stream or to prevent adding data from files at receiver end, optional text compression, choice of XML modes (XSL?), selection of annotation modes, addition of applets, etc. Use of passwords, provision of further data Manual annotation, editing and “signing off” Emergency access to and searching inside a specific medical record

Automatic annotation of genes and proteins Protein modeling Conversion of record for input to Data Mining Expression Array Analysis Data analysis and visualization

Location (Encode unit sends, decode unit receives)

Encode unit plug-in processing cartridge Encode unit plug-in processing cartridge Encode unit plug-in processing cartridge Encode unit plug-in processing cartridge Encode unit, intrinsic

Various in and out Mainly XML at present Various, XML. Mainly inserting lab data into XML document at present, Various, XML. Any. Editing engine identifies string ???? and requests input to replace it by data. GMSL from sender

Encode unit, intrinsic

Decode unit, intrinsic Decode unit, via web document. Decode unit

Decode unit plug-in processing cartridge Decode unit add-on application, protein modeling engine Decode unit add-on application, data mining engine Decode unit add-on application Decode unit add-on application

in the manner (...[...)...]. This “choice” is not a question of dangerous ambiguity or instability of action at run time, but rather a choice which is fixed in the design of GMS itself, or at least in its definitions of default interpretations. For example, one specific aspect here for system design is whether a user requesting access to the range of data [...] at the first square bracket should be able to do so without first obtaining access to the range of data (...) at the first curved bracket. At present, legal access (at the opening curved bracket) is required in simple cases. However, more flexibility of choice about this at run time can be achieved at the applet or cartridge level, or by splitting patient records into several separate streams, or simply by using the existing capabilities to insert encrypted and separately transmitted files concerning the special data. New features reported in the present paper include the means to switch to other specialist instruction sets including protein sequences. They are coded so as to facilitate rapid detection of homologous sequences, which allows faster homology serching.2 The relevant 5 leftmost bits of the RobsonGreaney code have been described earlier2 but are presented here in Table 6b for completeness. As noted previously, anything in the present GMS papers may considered either as specific proposals for a standard, or merely as examples illustrating how a concise means of shipping clinical and genomic information might be set up. The referee noted that the extended instruction sets (Table 3) categorizing the genomic, proteomic, clinical, diagnostic and demographic data with adaptation of the National library of Medicine’s Unified Medical Language System (UMLS) can be

Communicating Formats

keyboard, or in selected formats (esp. DNA sequences) HTML based web page used to verify edit underlying XML and content. XML (XML tags display as categories, and content is displayed as a command-line editor for access via mobile phone text or computer. Input is regular expressions representing consensus sequence dictionary FASTA or GOR in, PDB out GMSL or simple XML in, XML and CSV file out CSV CSV. XML could be used.

used as a standard for biomedical informatics. Consistent with that, in discussing GMS, the word “standard” is meant at a deeper level than any specific XML manifestation such as HL7 CDA, and deeper even than XML in general. GMS is in fact potentially a vehicle, and inter-transforming medium, for various standards. It is capable of carrying data about standard formats used in input, and in reconstituting the original standard after certain operations (such as addition of patient DNA to the patient record) are completed. In view of the importance of XML, however, GMS supports, as a kind of bonus, several XML tools. 4.2 GMS as a “Mark Up” Tool. In that connection, it is useful to clarify a point of the previous paper.1 Readers did on occasion query whether “DNA Markup” of the title referred primarily to a proposal of use of GMSL within DNA streams, or to an XML mark-up. “Mark-up” often refers to a particular XML embodiment, but to our knowledge those is no reason to so restrict its use. GMS is of itself, among other things, a markup language. In fact, we intended use of the term in both senses. In addition to GMSL, we proposed our own specific XML embodiment for clinical biosequences, as was comprehensively illustrated in Appendix 5 of that paper. GMS supports additional tools for support of that embodiment. An important feature of our XML proposal was that it had flexible form or multiple coexisting forms which can readily be converted to existing and future standards. In particular, note that markup of biosequence features occurs duplicated, both (i) as tags around the relevant region within the sequence, and (ii) as tags outside the sequence which point by sequence number and Journal of Proteome Research • Vol. 4, No. 2, 2005 295

research articles

Robson and Mushlin

Table 8. ‘Screen Shot’ Initial Fragment of Transmitted Stream of GMS Data

locus numbers to the appropriate part of the sequence referenced. This covers the two major classes of sequence mark-up XML in common use. Typically, the closest format is adapted to that of interest at any installation, and an add-in cartridge edits that form to the standard form required. Note that these and all sequence annotation is created entirely automatically by GMS or GMS cartridges, except for any comment originally inserted in the DNA by the sequencing laboratory, and any additional comments added when the HTML web page is displayed to, and edited by, the approved clinical genomic specialist. 4.3 Code Construction at the “Sender” Side. The referee felt that a little further detail on how codes are constructed at the “sender” side and then invoked at the “receiver” side would help the readers who are not familiar with such concepts. The reader is referred to the first paper1 for a more detailed account, but some additional overall statements may be helpful here. As a starting point, note that, in the most direct but least user-friendly mode, one might learn to use GMSL as a kind of reasonably friendly machine code language. The data and 296

Journal of Proteome Research • Vol. 4, No. 2, 2005

commands are simply written in the appropriate order but with some flexibility as to lay-out on each page. That is, one could type a file of fragments of DNA sequence AGCTGCTA along with commands input as GMSL mnenonics such as ‘xml;’ or ‘squeeze dna;’ delimited by semicolons or “carriage return”/ ”newline” markers, and separated by one or white spaces or “carriage return”/ ”newline” markers. Table 2, showing the executable manual, provides an example of such an input. Data and commands on such an input file will appear in the encoded and transmitted or stored stream, byte by byte, as shown line by line in Table 8 (for a medical record in this case). This is essentially the same method by which experienced machine code programmers use mnemonics to construct their programs. This is all with the caveat that normally this stream of bytes will be encrypted before transmission or storage by shuffling the byte stream using the intrinsic GMS algorithm,1 typically supplemented by further application of standard file or transmission encryption methods. In any event, shuffling will ensure that no interpretation such as that of Table 8 can readily be placed on the data. In addition, certain data like passwords could be masked even within the output stream.

research articles

GMS Language Including Command Extensions Table 9. “Screen Shot” of Execution of Initial Fragment of Received Stream of GMS Data

Additional tools (invoked by GMS commands, see ref 1) can “anonymize” patient data for example. On receipt, and after any required decryption, the stream of data in Table 9 is generated. Table 8 exemplifies the start of the so-called ‘encode stream’, essentially representing the rawest form of input, and Table 9 the corresponding start of the ‘decode stream’, the rawest form of output. Actually both appear on the file ‘reports.dat’ of the transmitter, and only the ‘decode stream’ appears on the file ‘reports.dat’ of the receiver. This filename ‘reports.dat’ indicates that the contents are reports for checks and maintenance by a GMS developer, but in fact this file also represents the original representation of the most fundamental input and output. Note that the commands of the output stream generate so-called “GMS EVENTS”, which simply means that something is triggered to happen at

that incoming byte or last of a particular sequence of bytes. These events may be, however, and often are, the writing of at least part of what is directly or indirectly a relatively userfriendly presentation of the incoming data to the screen or to a different file. By ‘directly’ is meant actual “what you see is what you get” text, and by ‘indirectly’ is meant, typically, an XML or even direct HTML file which will ultimately produce a very friendly “web page” user interface. All of these occurred in the example run in the first paper.1 XML and HTML output, written by the above GMS EVENTS, are well described in the first paper. The above-mentioned text output represents a somewhat less raw, second form of GMS output. An example of the direct text output case was not shown in the first paper,1 so a screen shot is shown here (Table 10). This display represents to the Journal of Proteome Research • Vol. 4, No. 2, 2005 297

research articles Table 10. An Example Section of Text, in Screen Display or File, Deduced from the GMS Stream or Written by Events in the GMS Stream (see text)

description of basic output with which, if all else fails, the receiver can access and search in the manner of a read-only editor. For example, the command -5 scrolls back five “pages”, protein searches forward for the word protein, and so on. This mode can be accessed, and potentially by text mode on a wireless telephone, if the rest of the clinical structure is damaged in a disaster. As well as being specifically generated by GMS EVENTS the system can be recalibrated to pick up XML tags and content transmitted in the stream of bytes without this being requested from commands in the stream (this kind of alternative or addition action is called OOSEI or “out-ofstream event initiation”). Although there are some intrinsic capabilities to perform OOSEI (again, to minimize dependency on further information in the event of a disaster), plug in cartridges at the receiver end, or alternatively applets or 298

Journal of Proteome Research • Vol. 4, No. 2, 2005

Robson and Mushlin

programs transmitted inside the stream, will typically perform such actions. In routine practice, of course, such raw input and output as represented by Tables 8-10 are not seen by the medical personnel or researcher. They are “under the hood”, and prepared from more readable documents. Plug-in cartridges are used to take such more user-friendly representations, and to convert them to the encode stream at the transmit end, and from the decode stream at the receive end. What those original and final user-friendly forms are depends on the cartridges, which are set up on installation. However, these may be, for example, HL7 CDA documents, as described, or XML documents to join (both were exemplified in the first paper). Alternatively, they may be documents in an old legacy format, which one might convert to HL7 CDA by more elaborate cartridges at the encode, transmitter end, or, in principle at the receive end. The first paper illustrated the former case. Where transmission of executable codes such as applets are involved, this is simply done on input, manually or by cartridges, by inserting the code in the perl data statement, as illustrated in the following from Table 2:

The string {applet data} is a request to execute the applet the moment it is encountered (and that entire applet is read in) in the incoming stream. If this is not present, then the code will be recovered and executed later. In a test, GMS was shown to be able to transmit itself. This is because the GMS is written such that strings referring to matters such as stop signals do not appear explicitly in GMS code but are deduced at run time on receipt. Otherwise, the contents of the perl data statement would terminate prematurely. As noted in the Introduction, currently only Perl is directly supported as transmissible code, but this simply means that there must be a Perl harness within the data statement that can invoke action by another language of choice, code for which is either directly entered in the stream, or on a cotransmitted file which is run by the Perl. Extensions to handle other languages such as Java directly will be a feature of new GMS codes, and indeed there are plans to code GMS in Java while retaining its Perl-handling capability. Note that there would be nothing to stop a GMS coded in Java receiving a stream of clinical genomic data from one coded in Perl, or vice versa, and that the stream could also comprise or contain the code to implement an automatic change to the Java version. The word “upgrade” is avoided here in deference to the zeal of Perl enthusiasts, as well as the growing power of the Perl language.

5. Conclusions Clearly, the GMS system confers considerably flexibility on the kind of system that one might wish to set up. The flexibility is further enhanced by the ability to transmit applets or programs, which can either directly modify representation and analysis of the clinical data, or generate more enduring plugin cartridges. This flexibility and the breadth of any account of the actions of a GMS system can be baffling. It may help to consider GMSL as a specialized computer language for clinicalgenomic transmission, storage, and utilization applications, and GMS as its compiler. The authors would like to re-emphasize

research articles

GMS Language Including Command Extensions

that these proposals need not be followed in fine detail, bit by bit, in order to build a workable GMS-like system. The specific binary values of the bytes were not lightly chosen and had various perceived advantages. However, even if the initial idea is popular, few significant standards in any field of information technology adhere precisely to the original proposals. Medical standards, such as those of XML, HL7 messaging, etc., represent a long process of evolution, debate, and democratic vote. As an example of the difficulties in debating improvements, note first that one possible GMSL improvement is in regard to the zero bytes (containing zero bits only). Zero bytes have most typically two functions. These are (1) to signal an error and notably a transmission phase error, since such bytes or strings of them are normally skipped, (2) following commands such as “byte mode”, to indicate that reference to a previously set variable should be made, or that an increased number of bytes following should be read to define a numeric quantity, and so on. In the latter kind of function of a zero byte, the command overrides the normal error checking. To some tastes, it might seem tidier and more elegant to redefine this aspect of GMSL so that a byte of zero bits in most prevalent modes of use always flags an error. One natural alternative is to allow only non-zero bytes as arguments and to use the byte of highest value to have the zero role, or to subtract one to obtain the intended integer value equivalent to current use. However, such tempting “improvements”, where say 255 means 0 or 6 means 5 might itself by considered “messy” by other workers, and likely to increase probability of human error in programming in GMSL. Moreover, in the present definitions of Table 1, etc.,

an error could still correctly be flagged by the zero byte if the stream was out of phase, because the preceding command such as “byte node” would not be read as such, and could therefore not suppress error detection.

References (1) Robson, B.; Mushlin, R. “Genomic Messaging System for Information-Based Personalized Medicine with Clinical and Proteome Research Applications” J. Proteome Res. 2004, 3(5), 930-948. (2) Robson, B.; Greaney, P. J. “Natural Sequence Code Representations for. Compression and Rapid Searching of Human-Genome Style Data Base” CABIOS 1992, 8, 283-289. (3) Robson, B. “Studies in the Assessment of Folding Quality for Protein Modeling and Simulation when the Experimental Structure is Unknown” J. Proteome Res. 2002, 1(2), 115-133. (4) Robson, B. “Clinical and Pharmacogenomic Data Mining. 1. The Generalized Theory of Expected Information and Application to the Development of Tools” J. Proteome Res. 2003, 2, 283-301. (5) Robson, B.; Muslin, R. “Clinical and Pharmacogenomic Data Mining. 2. A Simple Method for the Combination of Information from Associations and Multivariances to Facilitate Analysis, Decision and Design in Clinical Research and Practice” J. Proteome Res. 2004, 3(4), 697-711. (6) Robson, B.; Mushlin, R. “The Dragon on the Gold: Myths and Realities for Data Mining in Biotechnology using Digital and Molecular Libraries” J. Proteome Res. 2004, 3(6), 1113-1119. (7) Li, J.; Robson, B. Bioinformatics and Computational Chemistry in Molecular Design. Recent Advances and their Application. In. Peptide and Protein Drug Analysis; 2000, Marcel Dekker: New York, pp 285-307. (8) Robson, B. “Studies in the Assessment of Folding Quality for Protein Modeling and Simulation when the Experimental Structure is Unknown” J. Proteome Res. 2002, 1(2), 115-133.

PR0498483

Journal of Proteome Research • Vol. 4, No. 2, 2005 299