Clinical and Pharmacogenomic Data Mining: 3 ... - ACS Publications

Mar 19, 2005 - Zeta Theory is a coherent theory of observations, counting, data, and data mining which embraces information theory, number theory, and...
0 downloads 7 Views 163KB Size
Clinical and Pharmacogenomic Data Mining: 3. Zeta Theory As a General Tactic for Clinical Bioinformatics Barry Robson T. J. Watson Research Center (IBM), 1101 Kitchawan Road, Yorktown Heights, New York 10598 Received November 5, 2004

A new approach, a Zeta Theory of observations, data, and data mining, is being forged from a theory of expected information into an even more cohesive and comprehensive form by the challenge of general genomic, pharmacogenomic, and proteomic data. In this paper, the focus is not on studies using the specific tool FANO (CliniMiner) but on extensions to a new broader theoretical approach, aspects of which can easily be implemented into, or otherwise support, excellent existing methods, such as forms of multivariate analysis and IBM’s product Intelligent Miner. The theory should perhaps be distinguished from an existing purely number-theoretic area sometimes also known as Zeta Theory, which focuses on the Riemann Zeta Function and the ways in which it governs the distribution of prime numbers. However, Zeta Theory as used here overlaps heavily with it and actually makes use of these same matters. The distinction is that it enters from a Bayesian information theory and data representation perspective. It could thus be considered an application of the ‘mathematician’s version’. The application is by no means confined to areas of modern biomedicine, and indeed its generality, even merging into quantum mechanics, is a key feature. Other areas with some similar challenges as modern biology, and which have inspired data mining methods such as IBM’s Intelligent Miner, include commerce. But for several reasons discussed, modern molecular biology and medicine seem particularly challenging, and this relates to the often irreducible high dimensionality of the data. This thus remains our main target. Keywords: Zeta theory • number theory • data mining • clinical data • proteomics • genomics

1. Background 1.1. Is Data Analysis Particularly Difficult in Biomedicine? Of all disciplines, it almost seems that it is clinical genomics, proteomics, and their kin which are particularly hard on the data-analytic part of science. Is modern molecular medicine really so unlucky? Certainly, the recent explosion of biological and medical data of high dimensionality (many parameters) has challenged available data analytic methods (for discussion, see refs 1-5). In principle, one might point out that a recurring theme in the investigation of bottlenecks to development of 21st century information technology relates to the same issues of complexity and very high dimensionality of the data to be transformed into knowledge, whether for scientific, business, governmental, or military decision support. After all, the mathematical difficulties are general, and absolutely any kind of record or statistical spreadsheet of many parameters (e.g., in medicine; age, height, weight, blood-pressure, polymorphism at locus Y649B, etc.) could, a priori, imply many patterns, associations, correlations, or eigensolutions to multivariate analysis, expert system statements, or rules, such as |Height:)6 ft, Weight:)210 lbs> (but see Section 2.2.) or |Gender:)male, Pregnant:)no>. The notation |observation> is the physicists’ ket notation that forms part of a more elaborate “calculus” of observation (see Section 3 for a fuller account of the value of this). It is mainly 10.1021/pr049800p CCC: $30.25

 2005 American Chemical Society

used here for all such rule-like entities and they will generally be referred to as “rules”. As discussed below, there are systems which are particularly complex so that there are many complicated rules not reducible to, and not deducible from, simpler rules (at least, not until the future time when we can run a lavish simulation based on physical first principles). Medicine seems, on the whole, to be such a system. It is an applied area of biology, which is itself classically notorious as a nonreducible discipline. In other words, nonreducibility may be intrinsically a more common problem for complex interacting systems of which human life is one of our more extreme examples. Certainly there is no guarantee that all aspects of complex diseases such as cardiovascular disease are reducible into independently acting components that we can simply “add up” or deduce from pairwise metrics of distance or similarity. At the end of the day, however, it may be that such arguments are an illusion and that there is no special scientific case for a mathematical difficulty in biomedicine. Data from many other fields may be similarly intrinsically difficult to data mine. It may simply be that healthcare is peppered with everyday personal impact, life and death situations, public outcries, fevered electoral debates, trillion dollar expenditures, and epidemiological concerns that push society to ask deeper and more challenging questions within the biomedical domain than routinely happen in other domains. Journal of Proteome Research 2005, 4, 445-455

445

Published on Web 03/19/2005

research articles 1.2. Large Number of Possible Rules Extractable A Priori from All Types of High-Dimensional Data. For discovery of relationships between N parameters, there are almost always xN potential basic rules, where x is some positive constant greater than unity and which is characteristic of the method of data representation and study. For a typical rectangular data input like a spreadsheet of N columns, there are Σn)0,...N N!/ [(N - n)!n!] ) 2N potential basic rules3-5. For completeness, note that we can do without N + 1 of them if we do not need the N self-terms such as |Height:)6ft> and the empty case |> discussed below, giving Σn)2,...N N!/[(N - n)!n!] ) 2N - N - 1. For example, associations between three columns a, b, and c generates 2N - N - 1 ) 4 possible rules |a,b>, |b,c>, |a,c> and |a,b,c>, neglecting N self-rules |a>, |b>, |c>, and also |>. The number is potentially higher still if items with no metadata, or with the same metadata, can repeat (e.g., Injury: )broken _leg occurs at least twice in a record such that we can have rules such as |Injury:)broken_leg Injury:)broken_leg>). The number is kN if there are k possible states of each parameter, or more generally Πj)1,...N kj if the number of states differs with each state j. The numbers represent the numbers of associations etc. made from combinations of items or parameters taken n at a time, but they also relate to the potential number of eigensolutions or local minima in the N-dimensional function surface represented by a fit to numeric data with acceptable error , in which case fitting must involved at least (1/)N local fitting operations. These Nth-power-law conclusions are notably general to all types of data. For a mere N ) 100 nonrepetitive rules, extremely small by imminent biomedical standards, 2N is approximately 1030 statistical tests or direct queries. 1.3. Why Such Large Numbers May Be Surprising. There are three reasons why such large numbers seem unfamiliar to classical data analysts, although they are facets of the same issue. (I) The term discovery was used above. Traditional statistics is typically concerned with testing hypotheses or playing hunches, or using directed queries, going immediately, for example, to specific columns of data to test hypotheses about correlations between them. This is different from discovery when we take the position that we have no prior under standing or guesses, and wish to find everything which is interesting. Then we have to examine all combinations of columns (or other data structures), which gives rise to the combinatorial explosion.5 (II) The discussion is about potential as opposed to actual rules. Nonetheless, the present author tends to think of potential rules as actual rules, it is simply that some will have an associated value expressing strength (information content, probability, likelihood, degree of belief, weight, confidence, or statistical support etc.) within some arbitrary agreed proximity of zero (AAPZ). The traditional perspective is that the rules with this property are by definition not rules. The argument against that perspective is that, a priori, we still ideally have to look at all 2N of them to determine which do, and which do not, have such values in the AAPZ. Fortunately, this still allows the possibility that we may be able to think of algorithms to prune out those within the AAPZ early in the calculation. (III). A common assumption is that the operational number of rules is fewer because generally we do not need to state all the rules: others are implied by them. This is evidently true for rules such as those forming the matter of syllogisms. In particular, there is the common assumption that complex rules 446

Journal of Proteome Research • Vol. 4, No. 2, 2005

Robson

can inevitably built up from the simpler, rules. After all, if a is strongly correlated with b in some way, and c with d, must not a, b, and c be all strongly correlated together? On that assumption, it is typically felt that pairwise interactions should be estimated but others neglected. Such neglect is tempting since then 2N rules reduce to a mere N(N - 1)/2 pairwise ones (e.g., 4950 when N ) 100, which is a lot less than 1030). Sometimes, additive (or other simple functional) behavior is frequently assumed, to similar effect that e.g., |a,b,c> is somehow deducible from |a,b>, |b,c> and |a,c> without the need for any empicial information, i.e., “cross term”, about a, b, and c collectively. For these reasons, tables of pairwise metrics, cluster analysis, principle coordinate analysis, and multidimensional scaling also appear extensively in more advanced statistical courses. For many systems, this will be enough. Taking account of chirality, one could for example work out the three-dimensional structure of a protein molecule given all the distances between the atoms. Unfortunately, in other systems the assumption of mere pairs and even triplets etc of interactions can mean that more complex, and possibly much stronger rules, are missed. Anything approaching true multivariate analysis is rarely applied to more than a handful or two of parameters. The researcher never sees rules based on more complex combinations of parameters to regret their absence, perhaps invoking a false satisfaction that all relevant correlations or rules have been discovered. The above simplifying assumptions and conclusions should not be bolstered by an erroneous belief that certain relationships between parameter values behave rather like Euclidean distances, or distancedependent pairwise Newtonian forces, if this is inappropriate to the system studied and the manner of its study. A clinical counter example is that the notable lack of (i) anemic elderly pregnant male patients cannot even be approximately deduced from (ii) the abundance of anemic elderly male patients and (iii) the abundance of anemic pregnant patients. Empirical observation of the zero weight for |pregnant, male> would give that away, but there are other more symmetrical examples for which that is not the case. The probability of (i) seeing Alice, Bob, and Carol together on a motor cycle cannot be deduced from (ii) a high probability of seeing Alice and Bob on a motorcycle and (iii) a high probability of seeing Bob and Carol on a motorcycle. Nor can the simpler rules be deduced from the more complex in such a case. 1.4. Equivalence of Numerical Approaches. In the above discussion, references to quantitative analysis have been parceled in with qualitative analysis. There are differences, but in fitting parameters to raw data, the minima located have analogies to rules. It may be that in some cases dimensions are independent in the sense that the data-fitting function surface of multiple parameters is simply formed from those from each, or at the other extreme that there is only one minimum or few minima all starting points easily lead there, but not in general. In fitting numerical data, it seems easier to forget that the eigensolutions represent completely independent rules and so coexist as correct solutions to the same data, albeit of varying strength. The difficulties are analogous to those in applications of neural networks and statistical mechanical problems such as protein folding; there may be billions of such eigensolutions of different strength (e.g., free energy) and it easy to become trapped into a local solution. 1.5. Zeta Theory. A strategy to tackling irreducible, high dimensional problems may lie in the fact that (a) information

Clinical and Pharmacogenomic Data Mining 3

measures can be written in terms of the Riemann Zeta Function and (b) statements in much of number theory, and certainly the part dealing with prime numbers, map directly to statements regarding the way that data is represented and mined. I here extend my argument that it helps to view a major group of data-analytic procedures as an applied branch of number theory.3-5 This viewpoint has grown slowly6-25 based on a number of information-theoretic studies from the author’s and collaborators’ laboratories,6-18 which generally employed a Bayesian Expectation11 of information measure with terms which may now be seen as a special case3-5 of the Riemann Zeta Function.26 It resulted in a number of tools18-25 widely used and tested by the bioinformatics community (especially the GOR method18sa protein structure prediction toolsand subsequent variant forms20-26). Unfortunately, these forms were “hardwired” to protein structure problems and were not typically applicable to more general applications, such as clinical and financial data mining. Recent research, now using a number theoretic perspective, allows a more general application. The Riemann Zeta Function remains core and the approach represents “A Zeta Theory of Observations, Data, and Data Mining”. It is not intended to convey that Zeta Theory is superior to classical and more recent tools, which represent an impressive armory27-49 for data analysis. Or at least it should be said that the theory does not seek to be competitive on the same terms. The ultimate reason for using Zeta Theory is perhaps that we chose to do data mining. Data mining is concerned with discovery, as opposed to simple hypothesis testing or specifically directed statistical analyses or queries. Once the relationships are found, in the manner of finding a needle in haystack, then they can, and should be, verified and quantified by approved statistical methods. However there are some arguably nice features of Zeta Theory which may interest the general statistician, notably the particular way in which it treats sparse, even zero-occurrence, data. Sparseness occurs in events of high dimensionality, and in regard to those events which occur much less than expected (negative associations). This includes those combinations of events which are never seen at all but which would have been expected to be seen, on a statistical basis. Though they are oft neglected, negative associations can play a major role in many areas such as bioinformatics, security monitoring, credit management, financial risk management, economic indicator monitoring, epidemiology, and preventative medicine. The smooth continuity between zero, one and many observations might be considered as forging a link between qualitative, e.g., anthropological or interview-based research, which can be based on one-off observations and anecdotal data, and the quantitative research which often follows it. Other distinctive features arise from the intended generality of the approach, and are of more mathematical or possibly physical interest. However these aspects also relate to (i) algorithms for estimating likelihood of hits in searches, (ii) the relationship with molecular libraries and expression arrays, and (iii) expressing knowledge in more elaborate forms, which are of direct or imminent interest in proteomics (see ref 5 for review). All these aspects arise because Zeta Theory, by its relation to the Incomplete Riemann Zeta Function, highlights a new parameter (the Riemann Zeta parameter s) as a feature of data mining and data analysis in general.

research articles 2. Natural Numbers as Item and Record Tokens 2.1. Data Items in Data Mining are Like Prime Numbers in Number Theory. The link between data mining and number theory is relatively simple. Any item or entry on a record may be uniquely represented by a prime number which is called the token Ξ for the item. For computational purposes, it is convenient to have the more common items associated with the lower prime number.3-4 However, in assigning prime numbers to items, it is important to understand which features of complicated data are the items. An item may be simple and unqualified, e.g., female, or made more complicated (have internal structure) by qualifying it by metadata, e.g., gender:)female. More generally, an item is part a tree graph representing the path from a specific leafnode to as far back toward the root as is required at least for uniqueness, e.g., animal:)vertebrate:)mammal:) primate:) human:)patient, all of which is one item. Note that if for example a value such as 63 is presented without metadata, it is itself also formally an item and simultaneously a data value. However, in the form Age:)63 then it is Age:)63 which is the item, Age is the metadata, and 63 is the data value (parameter value). Throughout the above, the operator :) reads as “is the metadata associated with”. By this operator, it becomes immaterial whether the general data being examined is of collection (“bag”), set, or list type, including biological sequences. All pertinent data structures can ultimately be rendered into items represented by the above types. Where order is implied, appropriate metadata can be created from that which was implicit as in Column 6:)female, or from a biological sequence as in Amino_acid_residue _26:)alanine. Once metadata is properly associated with data values, items can appear in any order, i.e., in a collection, while retaining the original meaning, e.g., as in a spreadsheet or biosequence. 2.2. Some Practical Considerations. The above are the formal requirements, but some practical recommendations help clarify interpretations of what is an item, what the metadata is, and what the data value is, in practice. Though data structures such as XML can be rendered into items, values often occur in spreadsheet cells and the heads of the columns are taken as the same metadata for every data value in that column. Metadata and data values which are synonyms should be processed into, or otherwise ultimately recognized as, identical character strings. This concept can be extended to, for example, standardization of times and dates into universal time. Unknowns are a special case and are removed from processes such as statistical normalization. In our earlier method FANO2,3 (also known as CliniMiner) using Zeta Theory, such complex multi-metadata structures are preserved through to output for further analysis, but only the rightmost component following “:)” (if present) was treated as the data value, meaning here the value of the parameter such as “216 pounds”. The rest was treated as if simple metadata. Though chained forms a:)b:)c:)... relate to tree data structures, complex data such as subgraphs where the subgraph is specifically to be analyzed, should somehow be represented as a string to the right of the rightmost “:)” if any. A useful minimal parsing is that runs of leading and trailing blanks are deleted, and runs of embedded blanks are replaced by underscore, yielding e.g., “normal_X-ray”, and :) in the data value itself is disallowed or converted to “:” In FANO2,3 operators such as :)< and:)> have a role consistent with the above, but a special significance. For association analysis, entries such as Weight:) 215.5, while valid, Journal of Proteome Research • Vol. 4, No. 2, 2005 447

research articles are not very useful, since there would be a huge number of sparsely represented states. Such data may be optional pooled into two or more encompassing states, say Age:)>40, Age:) coded ΞZ can be split into further sub-records |ΞA>, |ΞB>, |ΞC>... if ΞA × |ΞB> × |ΞC>, ... ) ΞZ. An intuitive inverse process is however not always symmetrical. This is in the sense that if |ΞA> and |ΞB> are for example sub records of the same record, then their union |ΞA> × ΞB> is indeed also a sub-record of that record or the whole record, but only providing the two sub-records have no common items. If they do and it is known that data items can only occur once, then the duplicate items must be removed. Otherwise, |ΞA × ΞB × ΞC> is equivalent to |ΞA, ΞB, ΞC>, and so on. 2.4. Queries as Divisibility. A directed query against a record corresponds to dividing a natural number Ξ(R) into Ξ(β) and finding that there is no remainder (“integer division”, in this text). This shows that the sub-record coded Ξ(R) is present in Ξ(β), i.e., that there is a “hit” or “partial match” of Ξ(β) with the “probe” or “query” Ξ(R). The use of such a single division constitutes a directed query. One need not query numbers larger than the square root of Ξ, noting that if Ξ(R) divides into Ξ(β), then so must Ξ(β)/Ξ(R). These ideas hold for undirected queries, save that it it is useful to introduce the Tau function as follows. 2.5. Undirected Queries and the Tau Function. The Tau Function τ(Ξ) is a number-theoretic function that returns the number of integer divisions possible into a number argument Ξ such that for number arguments 1,2,3,...12 the function returns 1,2,2,3,2,4,2,4,3,4,2,6. Note that division by unity, and by the number itself if it is not again unity, is counted, and also hence that when Ξ is a prime, τ(Ξ) ) 2. For example, when

research articles

Clinical and Pharmacogenomic Data Mining 3

there are three items a,a, and b coded, say, as 2, 2, and 3, the record has token 12, complexity (compositeness), C(12) ) 3, and τ(12) ) 6 potential rules, namely one plus the three “trivial” rules |> (division by 1), |a> (division by 2), |b>(division by 3) plus the three “potentially interesting” rules |a,a> (division by 4), |a,b> (division by 6), |a,a,b> (division by 12). In a record of token Ξ there are thus kC(Ξ) possible distinct directed queries in records for items of k possible states, in a record of complexity C(Ξ), i.e., C(Ξ) items. If, as is typical of a spreadsheet, items cannot recur, then items are either present once or not present in the selection, and there are τ[Ξ] ) Σn)0,...C[Ξ]C(Ξ)!/[(C(Ξ) - n)!n!] ) 2C(Ξ)

(1)

possible sub-records or potential rules. We recall that there are typically not interested in rules about single items, since we are looking for associations or other correlations between at least two things. Subtracting these N + 1 less interesting rules gives Σn)2,...C(Ξ) C(Ξ)!/[(C(Ξ) - n)!n!] ) 2C(Ξ) - C(Ξ) - 1, corresponding to 2N - N -1 as discussed in Introduction. Either way, eq 1 is independent of the way in which we code items as prime numbers. Number theory thus tells us that if Ξ ) Πi pie(i) where the pi are the primes coding for items and e(i) is the number of times (0,1,2,3,...) that the item appears in the record, then τ(Ξ) ) Pi (e(i) + 1). It is useful, for computer memory management, to note the number-theoretic statement that the order relation τ(n) , nδ is true for every fixed δ > 0. τ[Ξ] also corresponds to the number of eigensolutions as local minima potentially obtainable by multivariate analysis of a spreadsheet of numerical data in C(Ξ) ) N columns. An undirected query for discovery corresponds to repeated division by all natural numbers 1,2,3,4,5,6,7, ... Ξ and discovery of those which divide without remainder is an undirected query for discovery of contained associations. This generates up to all τk[Ξ] ) kC(Ξ) possible sub-records or potential rules mentioned above, as appropriate, without omission and without redundancy. 2.6. Counting. The set of sub-records so identified in a record R of token ΞR is {R} and the division process so identifying them is the Tau transform T in {R} ) T(ΞR) and the inverse ΞR ) T-1{R} ) |{R}|, say. Counting involves concepts of undefined states and existentially qualified states. Every time a sub-record is identified as a potential rule for the nth time, then we can say that it has so far occurred n times. Also we need to count items or collections of items if they occur more than once in a single record, as revealed by one application of T. We will assume that this process is complete and that examination of the archive is exhausted, in which case we may say that the frequency (number of sightings, observations, occurrences, abundance) of a sub-record Ξ, i.e., rule incidence Ξ, as generated by the application of T over all records, is now and finally n[Ξ]. (Of course, archives can grow dynamically, so that conclusions drawn from all the n[Ξ] so far apply to that time of analysis.) The corresponding computer variable to n[Ξ] using the token for a sub-record is the counting array count{Ξ}, a hash array which is incremented by one every time a particular association (sub-record, rule incidence) with the token Ξ is seen in the database, the final value representing the number of occurrences of the concurrence of the items encoded by Ξ. The first time the particular token Ξ is seen count{Ξ} is created and count{Ξ})1. Then and for all further occurrences the event

is said to be existentially qualified. The above processes involved with existential qualification also implies the operator or functional aspect of Ξ( ) discussed above. Consider however a case regarding the lack of sub-records that would contribute to potential rule |Gender:) Male, Pregnant:)yes>. If this combination is never seen, a hash array count{Ξ} key for that rule is not created, and if considered a default of |0> (i.e., undefined) is initially assumed. However this triggers at least transiently (for as many relevant complex rules as possible including |Gender:) Male, Pregnant:)yes>) the definition of the relevant rules with the specific occurrence. That is count{Ξ})0 is created, i.e., n[Ξ])0. This important, because events which ought to have occurred (statistically speaking), but which did not, are of potential interest to, for example, preventative medicine. Thus, a final rule generated does not have to be found in the set of rule incidences in order to be considered. 2.7. Forms of Products Involving Tokens. Data structures containing several items can be joined by various operations which involve multiplication of their tokens. There are two forms of true multiplication of particular interest. The numerical product, may be written as the first form |{A}| × |{B}| or ΞA × ΞB for two records A and B. That is sub-records may be “joined” simply by multiplying their tokens together. This corresponds to the idea of union and so is discussed along with other operations in Section 3 (see in particular Section 3.7). In the present section, we are more concerned with representing the data in a more manageable form for fast calculation. Since large records may have unwieldy large integers as tokens, and since repeated division is computationally expensive, it is useful that there is also the general canonical product {A}.{B} which can be applied to arbitrarily selected fragments of records with smaller token values, and in such a way as to reproduce the effect of applying a single T operation on a whole record. For example, we can also represent the token of value 42 by the string ‘2 × 3 × 7’ or set {2,3,7}, whence the values are said to be unwound (they can also be partially unwound, viz:- ‘6 × 7’ or {6,7}). One generates a new set, say {X}, from the product of the tokens between every element of {A} and {B} taken pairwise. The maximum number of tokens in any set {X} is τ[ΞX]. Note that the initial number of tokens is ΣR)A,B,C.. τ[ΞR] and the final number of tokens is ΣR)A,B,C,... τ[ΞR] + ΠR)A,B,C,...τ[ΞR]. 2.8. Information Capacity of a Token. As noted above, when as in a spreadsheet items do not typically repeat, then τ[Ξ] ) 2C[Ξ] and so we may take the information content in terms of number of rules as C[Ξ] bits. Other scenarios including cases of collections (bags) of data, where items may recur k times, lead to C[Ξ] information units to base k. By any argument, the logarithm of Ξ to some base is also at least an approximate estimate of the information content of a record or sub-record. Consideration of the item-carrying capacity, i.e., the number of primes that the token can contain, yields natural logarithms. From the Prime Number Theorem as first expressed by Gauss, Ξ. I[Ξ] ) π(Ξ), the number of primes up to Ξ and hence the approximate number of items found by an undirected query using repeated integer division. An arguable appropriate estimate of information I[Ξ] for small values of Ξ is I[Ξ] ) log(eζ(s,n[Ξ]])-γ - eγ), which approaches log(Ξ) as Ξ increases. Consideration of any record of token Ξ requires in worse case a consideration of a maximum of Ξ divisions into Ξ potential sates, so I[Ξ] ) log(Ξ). Average computational effort yields logarithms to base 2. According to Dirichlet in 1938, the Journal of Proteome Research • Vol. 4, No. 2, 2005 449

research articles average value of τ(i) on i ) 1,2,3,...Ξ is log(Ξ) + 2γ - 1. Above γ is the Euler-Mascheroni constant 0.5772156....

3. Dirac notation, Degeneracy, Inversion, Intersection, and Union Here is enlarged a discussion of the useful “|observation>” or ket notation introduced above. The ket notation is a component of Dirac’s51 bra-ket (from “bracket”) notation which, although developed for descriptions of observations in quantum mechanics, is insightful. In all cases we can consider the above more generally in the above-discussed token form, e.g., , where the token describes, in terms of physics, the state. are examples of a ket. The underlying idea of the notation is that the complete bracket (or braket) expression denotes a number and any incomplete bracket expression is a more complicated mathematical entity of the bra or ket kind. The bra’s and ket’s are special forms of indefinite dimensionality vectors. In general, all kinds of ket vectors |Ξ> have a real or complex quantity (a weight) which can be considered a function of the vector, and hence ultimately of its token argument Ξ. We might thus write that value v returned by the function of a ket as v)f(|Ξ>) though by convention v)|Ξ> (see however discussion of “domain” below). We assume as did Dirac that the function of the vector is a linear one such that |Ξa> + |Ξb> means “the sum of the weight corresponding to |ΞA> and to |ΞB>”, and that + |ΞB>} ) + , and ) c, c being any number. Like all vectors, these entities can also be associated with dual vectors indicated by *. Given any two ket vectors |ΞA> and |ΞB> we can construct a number by forming the scalar product of |ΞB> with the conjugate imaginary of |Ξa> and taking the conjugate complex of this scalar product, so that ) *. We mainly here assume classic realm (referred to as “macroscopic limit” below to avoid certain ambiguities), by which is precisely meant no imaginary component. Thus, ) * ) * ) . However, the functions (see next Section 4) that are used to attach a weight to bras and kets can have an imaginary component (see below), as in Dirac’s approach. This suggests a continuity with quantum mechanics which can help is form concepts and develop algorithms of more general power. 3.1. Rules. In Zeta Theory, kets are (usually) rules, i.e., summary statements about associations or other correlations (such as variances between data) between two or more factors such as values of height and values of weight. Thus the rule strength can be considered as the quantity associated with a ket and as a function of the vector and also that value returned ultimately from the token Ξ as parameter, as discussed above. That is, we might thus write that value v as v ) fa(|Ξ>) ) fb (Ξ). An example of such a rule is the ket |Height:)6, Weight:)200> associated with its rule strength expressing the quantitative information content, or confidence in, or weight of, the rule. We may consider that the rule strength is a statistical summary of sorts over all bras which can be associated with the ket representing the rule, e.g., over all patients 200> (greater than 200) are, strictly speaking, degenerate, though we may argue that we chose to assign the ranges as true states (say two states, one above and one below a critical value), which is valid as long as they are unambiguous. While examples such as |Systolic_pressure:) 120, Systolic_pressure:)140> seem “irregular”, they can emerge in certain operations such as those described below, in certain relational database operations, and in relation to the discussion of the previous Section 3.2. At first consideration regarding macroscopic observation, such a rule should have a value reflecting a zero occurrence, if the blood pressure readings relate to the same moment, because the patient cannot have two different systolic pressures at the same time. However, we can consider them as results from separate observations (repeated measurements) which are susceptible to error. The program FANO2,3 takes this position; in FANO the same metadata can occur within a record (which is thus degenerate) and the nature of the algorithm means that the statistical interpretation falls out naturally. Some further miscellaneous comments may help clarify the uses of the notation and the concept of degeneracy and related matters, and their interplay. When we write . It is thus attempting to equate with the ket |Height:)6, Weight:)200>, i.e., no specification by a bra means a global account. In the same way that really means the set of records associated with those properties. As stated above, we generally chose to use the stand-alone ket form such as |Height:)6, Weight:)200> to describe some kind of set identified with a state, and some kind of statistical summary over the set, notably a rule. 3.4. Empty Case and Undefined Case. When an archive has records which are empty, as opposed to no records, then ’ when Ξ ) 1, i.e., it is defined but has a null value. This means that there is at least divisibility by one, and hence we know that that the variable is defined even if the exact value is not yet specified or known to us. In contrast, when there is no record to consider, ‘|Ξ>’ ≡ ‘’ when Ξ ) 0, and is considered undefined. If there is an attempt to determine an observation and division by one is not identified, or an item is referred to and not found in the working variable list, its operational token is zero. In other words, the universe of things we have not yet considered is populated by items associated with the token value zero. 3.5. Domain. At the beginning of this Section 3 it was noted in regard to the value returned by the function of a ket that “We might thus write that value v as v)f(|Ξ>) though by convention v)|Ξ>”. We can elect that every time we speak of a rule, we speak of the rule strength value, and also by convention the set of data over which data analysis of some kind is performed to obtain that rule strength, as defined by the state and represented by the token. That set of data is the domain. In general, a domain could be an entire archive, or a null (empty) domain, though typically it is a row or a column. For example, an archive can often be perceived as a spreadsheet matrix, starting with each row representing a patient. Then a bra such as can be associated with any column. We can also form statistical summaries over these domains, which give us some value of the function, being typically the rule weight. 3.5. Inversion. Inversion is of interest in regard to the Incomplete Riemann Function ζ(s,n) described below, where typically n relates to a number of records supporting an observation, but in other cases s plays that role. Inversion also illustrates what is meant by the equivalence of ) in the macroscopic limit, while writing e.g., the patient or other unique record specifier to the left “by convention”. That the patients and their data are really interchangeable, and not profoundly different, can be seen because such a matrix can be inverted through the diagonal. By inversion to obtain the description in the cells we flip the bra’s and ket’s such that ) , i.e., we alter our convention that the patient is written in the bra. It would then retain exactly the same information overall, but applied to medical records, this would mean that different patients will subsequently represent the metadata and the records are now records about the values

research articles of properties. Each record such as . Intersection relates to inversion because it provides part of the means by which we could generate tokens for the new records generated, from the old records before inversion. 3.7. Union. The case of union in the sense of combining rules |ΞA> and |ΞB> to form a more complex rule has been noted above and is done by writing |ΞA × ΞB> where ‘×’ is the standard scalar numeric multiplication operator. |ΞA × ΞB × ΞC> is equivalent to |ΞA, ΞB × ΞC>, and so on. See, however, section 2.3 for comments on the inverse operation if we wish to build an actual or realistic sub-record from smaller subrecords. If the rules being put together involve common items when items are not supposed to repeat, then the extra entries should be identified by the intersection operation, and removed.

4. A Natural Measure of Surprise Can be Expressed in Terms of the Riemann Zeta Function In Section 2, it was stated that the set of sub-records so identified in a record R of token ΞR is {R}, and the division process so identifying them is the Tau transform T in {R} ) T(ΞR), which involves or is related to the τ function. 4.1. Expected Information as a function of n[Ξ]. A definition of rule weight is required which appropriately treats the case n[Ξ] ) 0 and also n[Ξ] ) 1, n[Ξ] ) 2 etc up to large values of n[Ξ] where results should be convergent to more classical thinking. The author has held3-5,11 that the natural choice for deducing the above quantities indicating surprise involves a particular information measure which is the Bayesian estimate of linear functions of terms of form log(P[Ξ]), i.e., the Expected Information formed by integration of Pr[P[Ξ]]‚I[Ξ]‚d[P[Ξ]]. The Bayesian posterior density Pr[P[Ξ]] will be a likelihood of form such as P[Ξ]]n[Ξ]] or multiple analogous terms, say the binomial P[Ξ(R)]]n[Ξ(R)]]P[Ξ(β)]]n[Ξ(β)]] Such integrations, and estimated moments of information which are obtained integrations over information raised to higher powers, yield estimators which in general can represented additively as series of positive, zero or negative constants or terms which are related to the Riemann Zeta Function ζ(s), as follows. 4.2. The Incomplete Riemann Zeta Function (IRZF). The (Complete) Riemann Zeta Function ζ(s) can be considered as a special case of the Incomplete Riemann Zeta Function ζ(s, n[Ξ]) where n[Ξ]) ∞. For finite values of n[Ξ], the summation or integration which defines ζ(s) is bound by the value of n[Ξ].11 For example, it is known that ζ(s,n) ) (1-21-s)-1 Σj)1,2,...n(-1)j-1j-s for σ > 1 in s ) σ + it, where σ is the real part and t the imaginary part of s. The following define the Incomplete Journal of Proteome Research • Vol. 4, No. 2, 2005 451

research articles

Robson

Riemann Zeta Function (IRZF) in the realms of most interest here. I have elsehere3 derived and applied the computationally explicit more general form valid in the complex plane 0 o > e > E, for example, the relative value will stay the same independent of s ) 1,2,3,... To prove that the rank order of rules is not completely invariant on s, it is sufficient to show for at least one case that the following condition does not hold: ζ s,O) ζ (s,E) - ζ (s,o) + ζ(s,e) > 0, for all s ) 1,2,3,... By “case” here is meant that in terms of the relative values, O, E, o, and e, there are six relative relations which can always be “greater than or equal to” or always “less than”, giving 26 ) 64 possible cases. It is easy to show for 16 of these states that the rank order will indeed not change with s ) 1,2,3..... However consider the case e < E < o < O, when the two summated or integrated series of the IRZFs overlap in the region of terms E...o. The terms in that region of overlap will always cancel when one rule with range of terms e...o is subtracted from the other with range of terms E...O. For simple treatment of the range of terms o...E, use the case E ) 0 and e ) 1, because ζ(s,E ) 0) ) 0 and ζ(s,E ) 1) ) 1 respectively, whatever the value of s. The difference between the two rules is thus equal to or greater than -1 depending on the sum over the series represented by the terms spanning the interval o...O, which is the only section of sequence left to consider. Similarly, ζ(s,O) - ζ (s,E) - ζ (s,o) + ζ(s,e) can change from less than to greater than zero depending on how much we extend the series o...O. So for some types of rule case, order will depend on the magnitudes of some of the parameters, while others still provide a test for program legitimacy. 4.6. Complex Values of s. There are at least two areas in which complex values of s are of interest. The first is to seek to derive a two-dimensional quantification of the predicate calculus, much as probability theory is a quantification of Boolean logic.3-5 The second, application to quantum mechanics, is intriguing because a consistent Zeta Theory of observa-

research articles

Clinical and Pharmacogenomic Data Mining 3

tions, data, and data mining should be able to achieve continuity with it. Recall that |Ξ> is a rule (see above) represented with some value expressing degree of information content, confidence, probability, or weight. Preliminary efforts propose that the incomplete Riemann Zeta Function is itself the basis of a candidate calculus for quantum mechanical amplitudes and weights. It is required for that is that a complex fixed value of s can always be chosen for any one specified linear superposition of states such that

(6)

The significance of the summation range limit M can be justified empirically. Given additional due account of a parameter which can be interpreted as including the effect of prior information, such that ζ (s, L ) M) is replaced by ζ (s, M + a) - ζ (s, a), this closely resembles, and is arguably equivalent to, the empirical equations for hits of cDNA libraries used in expression studies.5

6. Example Algorithms for Reducing Token Size and Division, Arising from Number Theoretic Considerations

|ΞA × ΞB> ) x(eζ(s,n[ΞA])-γ - e-γ)|ΞA> + x (e ζ(s,n[ΞB])-γ e-γ)|ΞB>: - |eζ(s,n[ΞA])-γ - e-γ |: |eζ(s,n[ΞB])-γ - e-γ| ) n[ΞA]:n[ΞB], Lt n f ∞ where:- means “implies” and the colon: signifies the ratio of the left argument to the right argument. γ for s ) 1 is the EulerMascheroni constant satisfying ζ(s, n[ΞA]) - γ ) log n[ΞA], Lt n f ∞. However, other values are allowed.

5. Probabilities of Partial Matches are Governed by Reciprocals of the Incomplete Riemann Zeta Function Insight into the meaning of s is obtained by rewriting eq 3 as ζ(s,n) ) Σi)1,2,3,... n e-s‚log(i) and by known relationships such as ζ(s,n)∞) ) xΣi)1,2,3,...∞ τ(i2)‚i-s. Such suggest that the Zeta Function relates to a statistical-mechanical averaged value of information for an increasing number of observations of subrecords so far, assuming a given fixed number s of possible forms, and hence relating to the sub-record complexity (record length, number of items). There is evidence of an interchangeable interpretation between s and n as parameters of the IRZF. The abovementioned operation of inversion transforms an archive of n records of length s into s records of length n. When two or more record codes have no number (other than 1) which will commonly divide them, the numbers standing for the record codes are by definition coprime, and the records contain no item which is present in all. An important probability P[n[Ξ], coprime] is 1/ζ(s ) n[Ξ], L ) ∞) that n[Ξ] arbitrary selected numbers will be coprime and hence that s arbitrarily (randomly) selected records or sub-records cannot all contain a common item (though as many as n[Ξ] -1 may do so). Strictly speaking, the relation is known true for P[n[Ξ], coprime] at n[Ξ] ) 0,2, and ∞ and must hold at least approximately otherwise;5 however a recent mathematical analysis50 seems to be extensible to all natural number values of n[Ξ] in P[n[Ξ], coprime], so that we may write P[n[Ξ], coprime] ) 1/ζ(s ) n[Ξ], L)∞)

P[n[Ξ], hit] ) 1 - 1/{ζ(2, L ) ∞)s × ζ(s, M)}

(5)

Here 1/ζ(s, L) is the Riemann Zeta Function that for natural numbers L ) 1,2,3,... has simple form 1 + s-2 + s-3 + ... s-L. Equation 1 is thereby the probability that at least one of n[Ξ] will be unique in all items, suggesting again a relationship to quantity of information content. In this case, there is implicit complete summation up to L ) ∞, and includes sampling from all possible records in principle, including large ones of “transastronomic” size. If one brings the (s+1)th record (the “probe”) for comparison with s records, the probability of not sharing a common item with any of them is ζ(s ) 2, L ) ∞)-s if the probe of code Ξ is randomly selected from all possible codes. For records with codes in the range 2...M i.e., maximal possible finite size M

The most general time saving approach can be to use principles such as generating rule incidents by repeated division to automatically generate appropriate “hard wired” code, i.e., in which all cases of combinations are covered explicitly without further division. In current programs using the Zeta Theory, this is typically done for processing items of up to 10 at a time, for example. More specific case by case examples are:6.1. Using Record Fragments and Rebuilding Results by Forming the General Canonical Product, and String Winding Transitions. In Section 2.7, it was noted Ξ ) 42 is equally capable of representation by strings, say here {S} ) ‘42’, ‘2 × 21’, ‘6 × 7’, including the full factorization ‘2 × 3 × 7’, i.e., to have processes which potentially display all τ[Ξ] possible fragments. These are examples of winding or “prime-composite” transitions into more wound (composite) or, “melting” into more unwound (factorized) forms. The fully multiplied form (here, 42) will always be used in hash arrays especially as count{Ξ} (where again the curly braces now imply a hash array not a set). While a primary algorithm for taking the general canonical product of such string representations was given in Section 2.7, that merely shows how string (i.e., incompletely wound) representations can be brought into the general analysis and counting process. It does not exhaust all the algorithms of interest. Briefly, it may be stated that a tactic with a statisticalmechanical basis can be made out of how to make best use of winding at different points in the program, including using probabilities such as that in eq 6. Not all τ[Ξ] possible strings of the set {S} need be generated at any time, but selected on a bias, on a Monte Carlo basis. For example, there is an analogy of winding with e.g., helix-coil transitions in statistical-mechanical polymer theory. This involves a statistic weight for wound form formation, a statistical weight for nucleation related to the cooperativity of the process, and an effective temperature. Hence, different temperatures can be assigned differently to different parts of the program and their values varied so as to optimize performance. It is not always too obvious how one should make best temperature assignments from the outset for typical data mining tasks, especially in going from language to language with different ways of representing numbers and different quality procedures for large or indefinite integers, so it is useful that this optimization may be empirical. 6.2. Divisibility. For the dividing process up to any of the above conditions, there are also divisibility tests which can reduce the number of divisions still further. For example, simple tests not involving division are available to show divisibility by 2, 3, 5, 11, and with more effort, 7 and 13. Thus, while lower primes are most usefully assigned to most common items, there is merit in using those primes for which division can be avoided for items of maximal interest. Journal of Proteome Research • Vol. 4, No. 2, 2005 453

research articles

Robson

6.3. Using Relationships between Number Theoretic Functions. τ is just one example of a number theoretic function. There are other number theoretic functions and properties of such function46 which are of interest in regard to halting and related matters. On the basis of these functions, one may define a set of “computational number theoretic functions”, i.e., number theoretic functions which according to take, say, the value 1 when halting is to occur (without evaluation), and zero otherwise. We have introduced further number-theoreticstyle functions because they represent convenient subroutines in the computer programming of number-theory-based data mining. Others of interest, notably φ(Ξ) and ξ(Ξ), are quite standard in number theory. τ*(Ξ) ) φ(Ξ) + ξ (Ξ) is the number of items and sub-records which could have been included in numbers up to Ξ but which were not. As normally coded, this is the number of such which more abundant than the item coded by p[Ξ] (the next prime number above Ξ. They are deemed “potentially relevant” subrecords etc. for any analysis of the record of token Ξ. φ (Ξ) is the number of natural numbers up to Ξ which are coprime to Ξ; i.e., the number of items and sub-records which could have been coded for by numbers up to Ξ, but were not, and which contain no common items with record Ξ. This also relates to the number of rules which are not represented but which could have been coded for by numbers up to Ξ. They are deemed “irrelevant” sub-records etc. for any analysis of the record of token Ξ. Note φ (1) ) 1. Function ξ (Ξ) is the number of natural numbers up to Ξ which are not divisors of Ξ and which have a greatest common divisor other than one; i.e., the number of items and sub-records which could have been coded for by numbers up to Ξ, but were not, but which do contain common items with record Ξ. Ξ ) τ (Ξ) + τ*(Ξ) ) τ (Ξ) + φ (Ξ) + ξ (Ξ) - 1

(7)

where the bracketed terms are number-theoretic functions as above. When, for example, none of N items repeat as in the typical spreadsheet, and there are unknowns in the rows of interest, we obtain τ*(Ξ) ) Ξ - 2N

(8)

Example algorithms for halting include the following. Halt on d > xΞ, deduce corresponding items Ξ/d, Halt when n ) [N/2] + 1. Halt when n ) [τ/2] + 1, deduce corresponding items Ξ/d. Halt when n* ) [τ*/2] + 1: deduce items d + 1...Ξ, Halt when τ - n g [xΞ] - d: deduce d...Ξ. Halt when τ* - n* g [xΞ] d. Halt when di ) Ξ, where i is any integer >0: deduce i items with token d. 6.4. Using Tokens to Finite Precision. For several operations, tokens Ξ need only be used to a given precision. Use of the logarithm of Ξ expressed in the standard way is thus useful. One may determine that an integer division is possible but unproven, or conversely that it is disproved. Clearly, τ[Ξ] ) τ′[Ξ] -τ′′[Ξ], where τ′ is the number of possible divisors generated by the above method and τ′′ is the number of “false positives”. If we have typical spreadsheet data, then when τ′ g 2N or τ[Ξ] - τ′[Ξ] > Ξ - 2N, we can halt. In several instances, it is convenient and useful to order records and subrecords by the value of Ξ, so allowing all subsequent entries to be rejected by satisfying some test at just one of them. In performing a query of a record or sub record Ξ(R) against a record or sub-record, Ξ(β), we can guarantee that there is no exact match if Ξ(R) > Ξ(β)/2 if Ξ(β) is even. A match can be 454

Journal of Proteome Research • Vol. 4, No. 2, 2005

discarded if Ξ, or conveniently log(Ξ(R)), is not satisfied to a given precision.

7. Conclusions Statements in much of number theory and certainly the part dealing with prime numbers map to the way data is represented and mined. Many interpretations of number theory in terms of data and data mining turn out to be, in some sense, trivial or intuitive, though to the nonmathematician are typically easier to understand in data and data mining terms. Others are of comprehensible significant worth as of algorithmic value in data mining. Note Added in Proof (by Daniel Platt, IBM Research, with Barry Robson). Although it is not the central theme, these studies include an attempt to develop a notation of broader power by extending the quantum mechanical notation to this classical data-analytic application domain. Though, strictly speaking, a perfect continuity between the quantum mechanical domain and the classical data-analytic domain is not an absolute requirement for utility, there are, as stated, obvious advantages in finding a consistent and uniform approach, and studies are ongoing to make that consistency. Several recently arising points could to be emphasized from the perspective of the quantum mechanician. Notably, there are a few points that could be clarified in the use of the kets, i.e., the |>’s. (1) Some of the explanation of |>’s in the paper deal with linearity of the vector space, or deal with linear operators on/ in a vector space (linear vector function of a vector). If in the above paper it seems a little unclear which, from time to time, that is because the |>’s in this paper are deliberately set up to embrace several concepts simultaneously. While in principle such objects can stand for several such things, it might be helpful to point out that |>’s are most generally column vectors, and / is a complex conjugated row vector. However, it needs to be said that caution is required for bioclinical interpretation: these "row" and "columns" do not necessarily correspond to rows and columns of e.g., the spreadsheet being addressed. (2) It is also valuable to emphasize that inner products (row vector * column vector) are scalar, whereas outer products (column vector * row vector) are matrices or represent linear operator functions. The Ξi’s are indices. The structure of the indices, in a way that would admit projections, would be for each class (i.e., hemoglobin) to have its own column, with each value to be a component to the column: |hemoglobin:)low> ) [1,0,0], |hemoglobin:)normal> ) [0,1,0], |hemoglobin: )high> ) [0,0,1]. Linear combinations of these could certainly be constructed to represent uncertainties, etc. (3) It is worth noting that there is another useful kind of quantum mechanical product, that often looks like |A>|B> (or |A> X |B>, where "X" is often drawn as a circle with an "X" in it). This is actually represented by stacked column vectors. This lets one put multiple variable types together. In this context, as written, the Ξi’s are the formal quantum mechanical way to represent multiple projections, e.g., potentially clinical results such as |Xi> ) [1,0,0...] hemoglobin, with [0,0,1,...] BUN (another clinical laboratory measurement). This also keeps track of which things are where. This may a complication, in that usually one wants an index that identifies a particular condition (i.e., |Hemoglobin:)high> as a basis), with a patient’s condition being some linear combination of these basis states. Then Ξi, or f(Ξi), would be a projection operator that would pick off those linear combination of items of interest for any

research articles

Clinical and Pharmacogenomic Data Mining 3

particular patient. Incidentally, applications of the form where: |u1>|u2> ) c1|u1> + c2|u2>, for some c1, c2, need thinking about. Since, |u1>|u2> usually signifies stacked vectors, and |u1> and |u2> and |u1>|u2> generally have different dimensionality, it is not really natural to write that equation that way. For a discussion of the above in the context of quantum mechanics, see ref 52.

References (1) Lander, S. E. Foreward to Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, 2nd ed.; Baxevanis, D. A., Ouellette, B. F. F., Eds.; John Wiley & Sons Inc.: New York, 2001. (2) Shortcliffe, H. E.; Perrelault, L. E. Medical Informatics: Computer Applications in Healthcare and Biomedicine; Spring: Berlin, Heidelberg, 2000 . (3) Robson, B. Clincal and Pharmacogenomic Data Mining. Generalized Theory of Expected Information and Application to the Development of Tools. J. Proteome Res. 2003, 2, 283-302. (4) Robson, B.; Mushlin, R. Clinical and Pharmacogenomic Data Mining: 2. A Simple Method for the Combination of Information from Associations and Multivariances to Facilitate Analysis, Decision, and Design in Clinical Research and Practice. J. Proteome Res. in press, JPR web version available 2004. (5) Robson, B. The Dragon on the Gold: Myths and Realities for Data Mining in Biotechnology using Digital and Molecular Libraries. J. Proteome Res. 2004, 3(6), 1113-1119. (6) Pain, R. H.; Robson, B. Analysis of the Code Relating Sequence to Conformation in Globular Proteins. Nature 1970, 227, 62-63. (7) Robson, B.; Pain, R. H. Analysis of the Code Relating Sequence to Conformation in Globular Proteins: Possible Implications for the Mechanism of Formation of Helical Regions. J. Mol. Biol. 1971, 58, 237-256. (8) Robson, B.; Pain, R. H. The Role of Side chains in determining the Conformation of Globular Proteins. Proc. lst Biophys. Congress 1971, 1, 33-37. (9) Robson, B.; Pain, R. H. Directional Information Transfer in Protein Helices. Nat. New Biol. 1972, 238, 107-108. (10) Robson, B.; Pain, R. H. Studies on the Role of the R-helix in Nucleation. In Conformation of Biological Molecules and Polymers; Pullman, B., Ed.; Academic Press: New York, 1972; pp 161172 . (11) Robson, B. Analysis of the Code Relating Sequence to Conformation in Globular Proteins: Theory and Application of Expected Information. Biochem. J. 1974, 141, 853-867. (12) Robson, B.; Pain, R. H. Analysis of the Code Relating Sequence to Conformation in Globular Proteins: Development of a Stereochemical alphabet on the basis of Intraresidue Information. Biochem. J. 1974, 141, 869-882. (13) Robson, B.; Pain, R. H. Analysis of the Code Relating Sequence to Conformation in Globular Proteins: The Role of the Residue in Determining the Conformation of its Neighbours. Biochem. J. 1974, 141, 883-897. (14) Robson, B.; Pain, R. H. Analysis of the Code Relating Sequence to Conformation in Globular Proteins: The Distribution of Residue Pairs and in Turns and Kinks in the Protein Chain. Biochem. J. 1974, 141, 899-905. (15) Schulz, G. E.; Barry, C. D.; Griedman, J.; Chou, P. Y.; Fasman, G. D.; Finkelstein, A. W.; Lim, V. I.; Ptitsyn, O. B.; Kabat, E. A.; Wu, T.; Levitt, M.; Robson, B.; Nagano, K. Comparison of Predicted and Experimentally Determined Secondary Structure of Adenylate Kinase. Nature 1974, 250, 140-142. (16) Robson, B.; Suzuki, E. Conformation Properties of Amino Acid Residues in Globular Proteins. J. Mol. Biol. 1976, 107, 327-356. (17) Suzuki, E.; Robson, B. Relationship between Helix-Coil Transition Parameters for Synthetic Polypeptides and Helix Conformation Parameters for Globular Proteins: A Simple Model. J. Mol. Biol. 1976, 107, 357-367. (18) Garnier, J.; Osguthorpe, D. J.; Robson, B. Analysis of the Accuracy and Implications of Simple Methods for Predicting the Secondary Structure of Globular Proteins. J. Mol. Biol. 1978, 120, 97-120. (19) Crampin, J.; Nicholson, B. H.; Robson, B. Protein Folding and Heterogeneity inside Globular Proteins. Nature 1978, 272, 558560. (20) Gibrat, J. F.; Garnier, J.; Robson, B. Further Development of Protein Secondary Structure Prediction Using Information Theory. J. Mol. Biol. 1988, 198, 425-443.

(21) Biou, V.; Garnier, J.; Robson, B. Secondary Structure Prediction: Combination of Different Methods. Protein Engineering 1988, 2, 185-192. (22) Garnier, J.; Robson, B. The GOR Method for Predicting Secondary Structure in Proteins. In Prediction of Protein Structure and the Principles of Protein Conformation; Fasman, G. D., Ed.; Plenum Publishing Corp.: New York, 1989; 417-465 . (23) Gibrat, J. F.; Garnier, J.; Robson, B. Influence of the Local Amino Acid Sequence upon the Zones of the Torsional Angles and Adopted by Residues in Proteins. Biochemistry 1991, 30, 15781586. (24) Robson, B.; Garnier, J. Protein Structure Prediction. Nature 1993, 361, 506. (25) Garnier, J.; Gibrat, G. F.; Robson, B. GOR Method for Predicting Protein Secondary Structure from Amino Acid Sequence. In Methods in Enzymology, (Computer Methods for Macromolecular Sequence Analysis); Dolittle, R. F., Ed.; 1996, 266. (26) Edwards, H. M. Riemann’s Zeta Function; Dover Publications: New York, 1974 . (27) Fischer, R. A. Statistical Methods for Research Workers; Oliver and Boyd: Edinburgh, 1941. (28) Aitken, A. C. Statistical Mathematics; Oliver and Boyd: Edinburgh, 1945. (29) Wilks, S. S. Mathematical Statistics, Section 7.7; Wiley: New York, 1962. (30) Whittle, P. Probability; Library of University Mathematics/Penguine Books: London, 1970. (31) Kullback, S. Information Theory and Statistics; Wiley: New York, 1959. (32) Savage, L. J. Recent Advances in Information and Decision Processes; Macmillan: New York, 1962; pp 164-194. (33) Goode, H. Recent Developments in Information and Decision Processes; Machol & Grey, Eds.; Macmillan: New York, 1962; pp 74-76. (34) Fano, R. Transmission of Information; Wiley: New York, 1961. (35) Cooley, W. W.; Lohnes, P. R. Multivariate Data Analysis; John Wiley & Sons: New York, 1971. (36) Dunteman, G. H. Introduction to Multivariate Analysis; Sage Publications: Thousand Oaks, CA, 1984; Chapter 5 covers classification procedures and discriminant analysis. (37) Morrison, D. F. Multivariate Statistical Methods; McGraw-Hill: New York, 1967. (38) Overall, J. E.; Klett, C. J. Applied Multivariate Analysis; McGrawHill: New York, 1972. (39) Tabachnick, B. G.; Fidell, L. S. Using Multivariate Statistics; Harper Collins College Publishers: New York, 1996. (40) Donoho, D. L. High-Dimensional Data Analysis: The Curses and Blessings of Dimensionality; Lecture August 8 2000 to the Am. Math. Soc. “Math Challenges of the 20th Century” www.statstanford.edu/∼dononho/. (41) Bellman, R. Adaptive Control Processes: A Guided Tour; Princeton University Press: Princeton, NJ, 1961. (42) French, S.; Robson, B. What is a Conservative Substitution? J. Mol. Evolution 1983, 19, 171-175. (43) Robson, B.; Garnier, J. Introduction to Proteins and Protein Engineering; Elsevier Press: Amsterdam, 1986; 699 pages. (44) Steinbuch, K. Die Lernmatrix. Kybernetik 1961, 1, 36-45. (45) Hertz, J.; Krough, A.; Palmer, R. G. Introduction to the Theory of Neural Computation; Addison-Wesley: Redwood City, CA, 1991. (46) Tattersall, J. J. Elementary Number Theory in Nine Chapters; Cambridge University Press: New York, 1999. (47) Schroeder, M. Fractals, Chaos and Power Laws. Minutes From an Infinite Paradise; W. H. Freeman and Co.: New York, 1990. (48) Kuznetsov, V. A. Distribution Associated with Stochastic Processes of Gene Expression in a Single Eukaryotic Cell. EURASIP J. Appl. Sign. Proc. 2001, 4, 285-296. (49) Kirk, R. E. Experimental Design: procedures for the Behavioral Sciences; Brookes Cole Publishing Company/International Thomson Publishing Company: London, New York, Washington, etc., 1962, 1985, 1995. (50) Hafner, J.; McCurley, K.; Sarnak, P. Relatively Prime Values of Polynomials, Comtemporary Math; 1993; Vol. 143, pp 437-443. (51) Dirac, A. M. The Principles of Quantum Mechanics; Oxford University Press: New York, 1930-. (52) Feynman, R. P.; Morinigo, F. B.; Wagner, W. G.; Hatfield, B. Feynman Lectures on Gravitation, 1st ed.; Addison Wesley Longman: Reading, MA, 1995.

PR049800P

Journal of Proteome Research • Vol. 4, No. 2, 2005 455