Mathematics in data analysis: An introduction - Journal of Chemical

Mathematics in data analysis: An introduction. Taitzer Wang. J. Chem. Educ. , 1982, 59 (7), p 592. DOI: 10.1021/ed059p592. Publication Date: July 1982...
1 downloads 9 Views 3MB Size
Mathematics in Data Analysis An Introduction Taitzer Wang Department of Pharmacology and Cell Biophysics, University of Cincinnati College of Medicine, Cincinnati. OH 45267 This article is intended for students who are interested in mathematics and its applications in experimental sciences. The meanings of simple mathematical equations are described in a perspective that, for some reasons, many beginning students do not seem to he able toohtain from reading textbooks on the subject. Let us first take a careful look a t the simple algehraic equation y = 3 x 2 and try to understand what it means. A beginning student in algebra knows that for each numerical value of x in this equation there is a corresponding value for y. For example, x = 1,y = 5; x = 2; y = 8, etc. The letter y in this equation is called a dependent variable and x an independent variable. The meanings of these two terms are selfexplanatory, because the variation in the value of y depends upon the variation of x. The constant numeral 2, which stands by itself, is called a constant, and the multiplier of x, a numeral 3 in this case, is called a coefficient. Expressed in a graphical form (Fig. I), the equation gives alternate definitions for the constant and the coefficient. The constant is the intercept on t h e y axis when x = 0 (i.e., y = 2, x = O), and the coefficient is the slope of the linear plots of y versus x (because of the lin2 is also referred to as a linear algehraic earity y = 3 x equation). The slope measures the change in the value of y with respect to the change in x. Numerically, for a change in x from x, to xz and the corresponding change in y from y l to y2,the slope is (yz - y,)l(xz - XI) = 3, whichis the coefficient of x in the equation. What is the algehraic equation like y = 3 x 2 good for? For each value of x there is a value of y defined by the equation. So what? Indeed, the use of an equation of this form (including known numerical ualues of the coefficient and the constant) is limited, because it is basically for an introduction to heginning algebra, from which students learn the general concept of algehraic operations. In its application, the equation can at best he used to descrihe physical phenomena of known quantitative relationship. For example, if a water tank is filled 2 ft high with water and more water is added at a constant rate of 3 f t high per hour, one can use the empirical equation y = 3 x 2 to describe and calculate the height of the water level (y) at a certain time (x). At the beginning (when x = O), the height of water level is 2 ft (y = 2), the water will reach the l l - f t mark (y = 11)in three hours (x = 3), and if the water tank is 74 ft tall (y = 74), it will take one full day (1: = 24) t o fill i t up. The calculation may go on and on. Equipped with the equation, one may enjoy making precise prediction of the height of water level y at any time x and estimating a precise amount of time needed to fill the tank to a certain level.

+

+

+

+

592

Journal of Chemical Education

Figwe 1. The plats of y versus xgive alternate definitionsfor Ule constant and the coefficient of the mathematical equation: the constant is the intercept on the y-axis and the coefficient is the slope.

But no physical measurements can be precise. At this point, we should thoroughly understand the algebraic equation in the above example and later be able to distinguish it effortlesslv from another water tank example that we shall discuss next: Imagine that we are asked to measure the water levels a t various times and develop a mathematical equation to describe the rising water level with respect to time. Presented with this type of mathematical problems, some of us may immediately decide (due to the lack of enthusiasm, of course) that mathematics is not for us. We would know how to measure the water level at a certain time hut would not have the slightest idea as to how a mathematical equation can he develooed to descrihe their measurements. Whv not? ~ k a i n lif~one , is familiar only with the "working forward" solution path as exemplified by the first example, helshe usually does not know how to employ the strategy of "working backward" that the second example requires. Because of an inadequate understanding of simple mathematical applications, one may fail to recognize that the empirical equation such as y = 3 x 2 in the first example is the kind of mathe-

+

I 0.2

0.4

0.6

-I

1.0

0.8

X

Figure 3. From the linear plots of i / y versus l/x,we know that the dependent variable l / y is related to the independent variable l/x(Fig. 2, broken curve) in the same way as y to x in y = ax b (Fig. 2,solid line).

+

Fiaure 2. The two lassumed). ouantities. the initial water level ~~~and the flow rats. . are bnhnawn and y mJst oe meas~red(ratherman ca claled)at l m e x . Tnere MY be two ollterent eqatlonr lnal oescribe m e same set of exper mental dala equally well. ~

~~

~

~

matical equation for which we are seekine. The diffference is that, in tLe second example, one has to "kork backward" by measuring the heights of the water level v a t a various time x and use the whole set of y versus x data to determine a mathematical equation that can best describe the data. For easy description of the approach, let us use a set of data that would fit a mathematical equation analogous t o v = 3 x 2. It is important to reiterate &re that in th;! first ekample the initial water leuel and the flow rate of water are known p r ~ c i w l yand? cnn he ralrulated if x is known. In the second example, however, no such information is availat~le;the two (assumed) quantities, the initial water leveland the flow rate, are unknown and y must be measured (rather than calculated) at time x. If one starts hisher measurement one hour after the filling of the water tank has started and thereafter measures the water level a t 30-min intervals. the measurements (v) .- . a t various times (x) may turn out to he as follows:

+

x

(hour)

Y (ft)

1.0

1.5

2.0

2.5

3.9

5.6

7.0

8.6

3.0 10

3.5

4.0

11.4

12.9

y

+

+

+

=

~

+

+

+

+

Plots of this data on a e r a. ~ h(Fie. . .. 2) allow one to draw a straight line thn,ugh the points, nnd the line, judged to he a rood fit toall the voints,. mav. have an interceot (the valueof at x = 0) of 1.0 f i and a slope of 3.0 fthour. icc&dingly, the mathematical equation that can be used to describe the risine water level maybe y = 3 x 1, giving the initial water lev2 1ft high, the flow rate 3 ft per hour, and the water level being about 16 ft high (though not actually measured; Fig. 2, open circle) in five hours, etc. Now, we have learned a good, "working backward" example in mathematical applications. That is, one can derive a mathematical eauation from a series of ohvsical measurements and from'there estimate the n u m e k k values of the coefficient and constant included in the eauation. In this t w e of data analysis, the coefficient a and thk constant b in'lhe linear equation Y = a x b are usuallv referred to as oarameters. Each parameter should have its numerical vaiue and physical meaning. In the water Dump the estimated . . exam~le, numerical values for the parameters a and-b are 3 and 1,respectively, and the physical meaning of a is the flow rate of water (3 ft Der hour) and b is the initial water level (1ft hieh a t time 0 o i measurement). Understandably, in an kquatiin like y = a x b, the physical meanings of a and b are by no ~

means apparent, hut we do have to know them. Mathematical equations with parameters of unknown physical meaning are of little importance. However, one should understand that the ~hvsicalmeanings of parameters are defined strictly by the mathematical equation considered best fit to the data. There may be more than one equation that can describe equally well the same set of data. Use the wateipUmp example again. The set of data which we have iust descrilied as a "linear" event (Fie. " 2.. solid line) with a ~onstilntflow rate ma)' in fact constitute a punion of a smooth curve (Fir. - 2.. h k e n line, with a eraduallv (lecreasing flow rate. ciearly, the linear eq"ation y a x bdoes not fit the curve; the equation predicts that the initial water level is b ft high, whereas the curve shows that there is no water in the tank a t the beginning of measurement. We may use another equation y = a xl(b x) to describe the curve, in which the physical meanings of parameters a and b are different: a is "the maximal possihle water level a t infinite time (or at least a t time x whose numerical value is much greater than b; i.e., x >> b)" and b is "the amount of time required to fill the water tank to the a12-ft mark." The procedure for arriving at the equation y = a xl(b x) is more difficult, but is similar to the one we have just used for equation y = a x b. For y = a xl(b x), we plot the reciprocal values of y (i.e., lly) versus the reciprocal values of x (11x1 tabulated below:

+

l/x (hour-')

1

0.67

0.50

0.40

0.33

0.29

0.25

l/y(ft-')

0.26

0.18

0.14

0.12

0.10

0.088

0.078

From the linear plots of l l y versus 11x (Fig. 3) we know that the dependent variable l l y is related to the independent variable 1/11in the same way as y to x in y = a x b. Therefore, an appropriate equation must be l l y = c(l1x) d with a slope (the coefficient of l/x) c = 0.25 hourlft and a constant d = 0.016 ft-I. Rearranging the equation to show how y is related to x (in contrast to how l l y is related to llx), we obtain y = Old) x l k l d x) or, replacing the two parameters l l d and cld with a and b, y = a xl(b x). Estimated a (or l l d ) and b (or cld) are 62.5 ft and 15.6 hours, respectively. To understand the physicalmeaning of a in the equation y = a xl(b x), one can imagine the water level a t a time long after the filling of the tank has started. When time x = m or a t least x >> b, y = a xlx = a. Therefore, the equation defines a as the maximal Dossible water level. which is estimated from a limited numher of measurements to be 62.5 f t high. It takes a long time for the water to reach the 62.5-ft mark! Similarlv. to understand the physical meaning of b, we set x = b, whichleads to y = ab12b

+

+

+

+

+

~

Volume 59

Number 7

July 1982

~

~~~

593

= al2. Thus. b is the amount of time needed for the water to nnch half a maximal hei~ht012. The estimated b value is 15.6 huurs: it tnkei 15.6 h w ~ rfur i the Water 10 reach 31.3-ft (i.e., 62.512) mark. The following tahle summarizes different information that we can obtain from the same set of measurements.

Information

y=ax+b

Water level at time x

Y a b

Flow rate

Y abl(b x ) ~ 0 a

+

-

Predicted water level at time 0 Predicted maximal water level Time needed to fill half of the

y=ax/(b+d

unknown

b

maximal hainht

broken curve are calculated from y = 3 x + 1 and y = 62.5 xl(15.6 f x ) , respectively. We have illustrated several important aspects of mathematical applications of empirical equations, which are derived from physical measurements or experimental observations. These aspects of mathematical applications are the same as those of theoretical equations derived from some theory or hypothesis proposed for a certain physical phenomenon. Chemists' and biochemists' efforts in carrvina out exoeriments and fitting kinetic equations are often derived &om welldefined chemical equations. Pertinent to the foregoing discussion is the Michaelis-Menten equation in enzymology, u = V,., [SJIK, IS], which describes the mathematical restesdv-state lationshio between the reaction velocitv. , ~. .. u (under . conditions~,and the initial s ~ h s l r a t ccmcentration, e 151.The eauation is derived from the chemical eauatiun E S L F:S E P, which describes how most enzymes react with their substrates to .give ~roducts. . Mnthematically, the theoretical Michaelis-Menten equation is identical with the empirical eauation 1 = u x l t l , T X .I .oresented earlier. In the s a k e way the and b are is the maximal possible reaction verelated t o y and x , V,. locity at very high substrate concentration, [S] = m or [S] >> K,, and K , is numericalls equal to the substrate concentraFurtion a t which the velocity~ishalf maximal, u = V,.,/2. thermore, V,,, and K, have additional physical meanings is the reaction velocity defined by the chemical equation. V, when all the enzyme is in the complex form (ES) (the equilibrium E S ,-ES shifts to the rieht when S is in lame excess). K , is a measure of enzyme affinity for the substrate, which expressed as a dissociation constant.. K ,... = [El . . ISIlIESl. ., h& the dimensions of the concentration of S (note th& when the velocity is half maximal, half of the enzyme is in the complex form, ES, and another half remains free E, giving K , = [S]. This is in accord with the mathematical relationship when K , = [S], u = V,.,/2). Therefore, the affinity of an enzyme for its suhstrate can be studied mathematicallv. bv. mensuring the velocities of the enzymatic reaction at vnrious initial substrate cw~crntrat~on.iSuch study mas pnwide additional information about the maximal rate of &&matic reaction. Thus, when we deal with mathematical equations, we have to he able to recognize the dependent variable, which is the quantity we intend to measure in our experiment; the independent variable, which is the quantity we can vary; and the parameters. In this paper we discuss onlv simole aleehraic equations and method for estimation df parameters. For complicated forms of mathematical equations, such as exponential and differential equations, and using statistical method (often a computer is needed) in comoutine a lame number of data, our purpose is the same: to estimate then;. metical values of the parameters that have well-defined physical meanings.

+

~

Strictlv soeakina. the calculated values of water level y are accurate "niy within thr range of actual measurement. ~ G o n d the rmee there is no evidence that shows the water level approach& b f t at the beginning and m ft a t infinite time as ~redictedby the equation y = a x b , or 0 ft and a ft, respectively, as predicted by y = a x l ( b x ) . Let us now compare the water levels measured at various times with the water levels calculated from the two mathematical equations in the following table:

+

+

As far as the same seven measurements of the water level are concerned, the data does not permit a choice of one equation over the other. How do the difficulties arise? I t is obvious that we have not measured the water level for a wide enough range of time. In other words, the difficulty in distinguishing the mathematical equations y = 3 x 1and y = 62.5 xI(15.6 x ) results from not having an adequate amount of data. If we assume that the rising of water level in the tank can be described mathematically hy one of the two equations, then it is obvious that in order to he able to determine the right eouation one must measure the water level hevond the time range, earlier than one hour after the filling of water and/or later than 4 hours. The eouation v = 3 x 1dictates that the water level is 1.6 ft at 12 &n (x 0.2) and l 6 f t a t 5 hours ( x = 5). The equation .Y = 62.5 xI(15.6 I ) , on the other hand, dictatestha; the water le\,el should be0.79 ft at 12 rnin ( x = 0.2) and 15.2 f1 at 5 huurs (s = 31. A dwiation of 0.1 l't in the measured values f n m the calculated as shown in the table permits a significant difference hrtween the two sets of measurements: 1.6 f 0.1 and IF + 0.1 it versus 0.79 i 0.1 and 15.2 f 0.1 ft. The former are indicated by two open circles and the latter by two open triangles in Figure 3. The solid line and

+

+

=

594

+

+

Journal of Chemical Education

~~

~~

+

+

a

+

-

~~~

-

-

.