Box-and-whisker plots - Journal of Chemical Education (ACS

Apr 1, 1985 - Construction of box-and-whisker plots and their application to bond ... Abstract: Box-and-whisker plots or simply boxplots are powerful ...
0 downloads 0 Views 3MB Size
Russell D. Larsen Texas Tech University, Lubbock, TX 79409 Box-and-whisker plots were introduced bv the eminent statistician Tukey for-the purpose of giving rapid virdizntion of batches of data ( I J.These displays have been found to be useful in numerous areas. The author has found that such displays can he effectively used for the presentation of the diverse collections of data which are the subject matter of the traditional first-year chemistry course, viz., melting points of elements, heats of vaporization, ionization energies, covalent radii, bond energies, heats of solution, etc. Box plots fall into the realm of "exploratory dataanalysis," the objective of which is to obtain a feeling for how a data set as a whole behaves. Exnloratorv data analvsis is numerical detective work which, i t is hope& uncoversihe possibility of quantitative interrelationships amone the data: analvsis of s w h relntitmships is the subject matt& of confirmatori data analvsis ( 2 ) . Hox lor displnvs force the reruenition of interconnections between members of a batch; often such relationships are unexpected and surprising. The author believes that presentation of data in this form provides considerable motivation for the "explanations" that we traditionally give for chemical and physical trends. Such explanations are, of course, partly based on high-information models or theories, whereas a box plot is a low-information representation of the data. However, the box plot accurately reflects the actual data-somethine which i n ''exnlanation" or model mav not ~ completely do. ~ i c a u s high e a n i low values stand out in these lots thev demand interuretatiou. Moreover. a better feeline :or "typical" values of physical and chemicaiproperties ( a n i their orders of magnitude) are obtained with these displays. ~

~

~

~

~~~

A

Construdlon of a Box-and-Whisker Plot In order to construct basic box-and-whisker d o t s and their variations from a batch or table of data it is necessary to rank the raw data. For an initial example we select a set of raw. unranked data-a collection of 78 bond energies for heter: onuclear diatomic molecules and ions. These data are relatively uninformative and uninteresting in their alphabetically tabulated form, taking up a full two pages in a popular textbook (3). Moreover, no explanation or i&erpretkibn of these data is offered-except that they arise from a molecular orb e These data are bital treatment of u gc&ral ~ ~ - t molecule. nble to be quickly ranked by means of a "stem-and-leaf disnlav." . " Rnther than rieidlv - .rankinr a batch - - aorocess that is time consuming-a stem-and-leaf ordering sorts the data by the leadine dieits (stem). the trailine dieits (leaves) beine ienored except For the first. Table 1s$play of the bond energies in ref. (3). Table 1needs a bit of explanation. The first entry in the table of ref (3) is the value 481 kJ mol-'. This number is treated as follows:

digits 5-9 (denoted by a . after the stem). A stretched stem and leaf obviates the necessity to squeeze together onto one line too many entries. Thus, the number 481 appears as the first entry of the 7 numbers, 4 .I8757886 (these numbers are 480, 470.450.470.480.480.460. all havine been rounded to 10'sof . . kJ. ?he stem-and-leaf qukkly rank; the data points by the leading digit. Only the leading digits are sorted. The first two entries in Table 1 are 0.166, hotb corresponding to rounded values of 60 kJ mot-'. It is customary to identify the highest and lowest numbers that are so ranked. One of the 60 kJ mol-I valucl corresponds to NaK: the highest value. 1070 k.1 mol-1. correspondsto CO. A coint for-each stem helps in late; identification of the median and hinges. Construction of the stem-and-leafdisplay in this manner usually take only a iew minutes for a medium-sized talde of data. The whole purpose of a stem-and-leaf display is to reorganize the data so that various measures oirelative rankings mav he obtained. The stem-and-leaf is an intermediate sreo in the construction of a box plot. From the stem-and-leaf one can verv ouicklv. bv countine. find kev statistical measures: the median (~);thLupperanzlowerhinges (H),the inner and outer fences (f and F) and, if necessary or useful, further higher-order quantile-like subdivisions of the data such as the eighths (E) and sixteenths (D) may be obtained. -

~

Table 1. Stem-and-leal Disolav of Bond Enaraies 1kJ mol-'\ NaK

66

-

leading digits trailing digits 4 81-ignored I i used t o sori displayed Each line of Table 1 is a stem, each piece of information is a leaf. The display in Table I is a "stretched" stem-and-leaf which uses two lines (stems), one for leaves with trailing digits 0-4 (denoted by a * after the stem), and one for leaves with 302

Journal of Chemical Education

CO

7

0. Table

12-4.

anation.

Table 2. Letter-Display for 78 Bond Energies of Table 1 (in 10's of kJ mol-') a M 39h

1

I

42

I ... S88 ten l a explanation.

3

1-number

of outside values

The depth, d, or location of the median, given M ordered data values, is: d = (M 012. Similarly, the depth of the hinges is: d(H) = ((d(M)) 012. These rank measures are next summarized in a "letter-value display." A letter-value display is a compact table that summarizes the values of the median (M), hinges (H), eighths (E), etc. Since these rank measures are denoted by a sequence of letters, the display is called a "letter-value display." Table 2 is a letter-value display of the bond energy data found in Table 1. In Table 2 the use of "h" replaces ".5." This avoids the distraction of the decimal, which recurs when rank measures are tabulated. For example, 12.0.12.5.13.0.13.5 . . are writtenas 12.12h. . . 13.13h. w he valuesat the hinges indicate the location of the middle of each half of the ranked batch and are lorated about onequarter of the way in from each end of the batch. Hinges are similar to auartiles but differ because thev are found from the depth of t i e median. Usually the hinges are somewhat closer to the median than the auartiles ( 4 . 5 ) . Finally, a box plot is ci)nstructedfrom the information in the letter-value disdav. A box cnvelom all of the data between the upper and lower hinges; a horizo;ltal line through the box denotes the median. The box thus delimits the middle half of the data. In one representation of a box plot, whiskers (thin vertical lines) are drawn from the box t o the extreme values. The extreme values are prominently labeled. Figure 1 is a schematic box plot for the 78 bond energiei of the heteronuclear diatomic molecules and ions that are displayed in ref. (3) and Tables 1 and 2. Box-and-whisker plots differ from ordinary plots in several ways. The scale of a box plot shows only those values that are necessarv to couvev relative values of the data entries. In fact. the finai box plotis a superposition onto tracing paper of values initiallv lotted on eranh Daoer. The intention is to keep the vert:cil scale of tKe fiuai bbx plot as unclutterd as possible. No more than three or four numbers along a s a l e are usually necessary. The objective is effective plotting-ruled lines and manv scale values make the plotting easy but the resulting plot is not effective-we want to s e e a h a t the data show without being distracted. The vertical scale markings need not increase b; values of 10 or 100, rather, extreme values

+

+

can be indicated and, perhaps, one or two intermediate values micht be shown. Thus, Figure 1 shows that the median bond energy is somewhat beloi500 kJ mol-'; the maximum value (for CO) is somewhat above 1000 kJ mol-l; the lowest data entry (for NaK) is somewhat below 100 kJ mol-I (the interested reader can speculate as to why the value for NaK is so low-it clearly stands out). The box plots shown in Figures 1, 2, and 4 are called "schematic" box plots. They have the following features. The whiskers are dashed vertical lines ending with dashed crossbars a t the so-called adjacent values (the value a t each end closest to, but still inside, the inner fence'). Moreover, outside (the fence) values should appear separately and be identified. Usually values inside the fences are not identified, but we find it convenient to label a few benchmark substances for orientation. lnterpretatlonof Bond Energy Data

While the construction of a box-nlot d i s.~ l .a vof raw data follows a fairly standard procedure it is, after all, interpretation of the dis~lasrddata that is the ultimate obiectiveof the viewer (althouih it is not the objective of this Consider again, however, the above bond enerm data. The data in the original table looked uninteresting and even a box-plot display, at first, does not seem to disclose anything startling. The CO molecule clearly stands out as having a very high bond energy, but it also stands out in the original table of ref. (3). Nevertheless, we can glean insights from these data. Typical bond energies are seen to range around 300-500 kJ mol-' (one-half of the data are enveloped by the box). A median bond energy is typified by values for KC1 and OH. Note that the median value is about 420 kJ mol-l. Electronegativity Upper and lower fences are located 1.5times the difference between the uDoer above and below mem.. resoectivelv. . and lower hinaes. "

.

.

Schematic box plots. In contrast 10 simple boxand-wnisker plots, have whiskers that extend to me adjacent values rather man to the extreme values.

-I- NaK Figure 1. Schematic box plot far band energies of 78 diatomic molecules: units are kJ ml-'. CO. 80, and CO+ are above upper fence shown by dashed horizontal line. These soecies have unusuallv lame bond enemies com~aredto the oox haK has a low bond energy aidmrrfl m d d l e M of me data enveloped by 11is not nord mte y low Average ooro ensrges are exempld ed oy KC1 ana Oh. a '?ypcal ' v a l ~ eis around 400 kJ mal-'

Figure 2. Schematic box plot fa standard heats of formationof 84 substances; units are kJ mol-'. AI2O3(s). BaSOds),and CaSO4sI in being above the upper fencehave unusually large negative heats of formation campared to the miWle half of the data enveloped by the box. H*qI)has a ?ypieal"heat of formation. around -300 kJ mol-'. C2H2(g) is an example of a substance with a positive AH,', although its value is not unusually large. Volume 62

Number 4

April 1985

303

differences do not seem to play a large role in the values of the hond energies except am&~congeneric species such as the alkali halides. Notice that the isoelectronic molecules HO, CO'. and CN which have the same bond order also have bond energies below that of CO which has a higher bond order. The familiar effect of an antihondine ?r electron can be seen on comparing the positions of CO anh NO. The box-plot data do, in fact. demand inter~retation-a viewer is virtuallv forced to think of and ask qLestions about the relative d u e s displayed. Heats of Formation A box plot was similarly constructed for the heats of formation of 84 substances found in ref. (6) and is shown in Figure 2. Again, although the author's purpose is to show what box plots look likeand not to interpret them, it helps to point out some conclusions which can be drawn from the relative rankings. A1203 is visibly outside the upper fence of a schematic box plot and has the largest value of AH? among the batch of 84 suhstances. We would suggest to students that this high exothermicity reflects the high stability of the compound. Note also that the high molecular weight sulfates such as Bas04 and CaS04 apparently show high stability also. Water has a standard heat of formation verv near the median of the 84 suhstances-3espite the numerob unusual properties of HzO, the heat of formation amears unexce~tional.Note also that more substances have negative heats of formation than positive heats of formation and that the magnitude of negative values appears to extend to higher values than the ma&itude of positive values. Althouah such is merely descriptive information, it gives the ohserver a feelingfor a &en value of a heat of formation. Note that the median value is around -ROO kd

Note the exceptionally large heats of solution possessed by both RbClOa and HC104-the latter heing extremely exothermic, the former extremely endothermic. Another question immediatelv Dosed hv these rankines is: are the values for NaC104, K C ~ Oa~ n, d u ~ s ~between l ~ 4 the acid and the rubidium compound? We would expect so-but the box plot almost forces us to check and see. Note the relative values for NHaNOa . - and NaOH-both substances beine classical examples of dissolving species: NaOH is, of course, very exothermic and NH4N03 (used in cold packs) is quite endothermic on dissolving. While both are somewhat unusual among dissolving species in that they lie outside of the box which envelops the middle half of the data, nevertheless, they are not really exceptional when compared to RbC104 and HC104.

-

Solubility Products Although the author has constructed and uses numerous other exam~lesof hox plots in his classes, a final illustration will be given. From a table of 102 values of solubility product constants a schematic plot (a box plot with upper and lower fences delimiting the whiskers) of these data was constructed and is shown in Figure 4. Such a display is very useful to int r o d u c t o ~studenc learning solubilii) rules forthe first time. Students can immediately see the \.cry insoluble sulfides and hvdroxides as standine" out in the dam wt. B\,fimt intnducine such a display students are much more receptive to learning solubility rules because they now have a visual, concrete point of reference.

-

Heats of Solution A box plot was constructed for 113 values of the heats of solution of uni-univalent electrolytes in water a t 25'C and is displayed in Figure 3. Before looking a t such a display an obvious auestion concernine heats of solution is the relative distribution about positive &d negative values. Are there as manv exothermicallv as endothermicallv dissolvine .. species? . A boi plot shows a mure or less even dis&ibution about exothermic and endothermic values: in fact. the median of the middle half of the data slightly favors endothermic values.

Conclusions Box-and-whisker plots are able to be rapidly constructed and thus Drovide a means for auicklv assessine relative data values in large (or small) datiset cdnsisting chemical and physical properties. Exceptionally large or small values stand out and demand explanation. Useful insight is obtained by a visual comparison of these relative values-they are potential fodder for trends, descriptions, and mathematical models. The author believes that tables of data that amear in introductorv textbooks, especially alphabetically arranged data, except fo; heing a quick source of a "number" are uninformative, and should be accompanied by a box plot which shows the content of the table of data. Simply seeking highs and lows and particular groupings in a table does not produce the same results

Figure 3. Box-and-whisker plot for heats of solution of 113unit-univalentelectroiyies in water at 25%; units are kcal mai-'. RbCi0. and HC104are the extremes of these data, the formerhaving an unusually large endothermic heat of solution whereas the acid has an unusually large negative heat of solution. mew laroe values cen be cornoared to the familiar benchmark substances. NHINOI a i d NaOH..which are uiuaiiv cited as bdno" hiohlv " . endo- and exothermicaily dissolving species. A typical heat of solution. on the other hand, has a value only a fewkcal moi-' above or below thermoneutrality.

Figure 4. Schematic box piat for relative pK,, values of 102 slightly soluble electmps. Bi& has an extremely iarge pK, which refleets its high inrnlubility. me high oxidation state metal hydroxidesalso have a pKoabaveW upper fence immediately showing their high insolubility.me other extreme KCiO, has avery low pKsD value. Typical pK,, values range ham 10 to 20. it is interesting that AgBr has a very ordinary insolubility. .,

~

304

~

Journal of Chemical Education

or

a

I

and give the clues that the numerical detective work of exploratory data analysis achieves. It is not inappropriate to quote Tukey, the originator of the philosophy of exploratory data analysis and the architect of box-plot displays: "exploratory data analysis is an attitude, a state of flexibility, a to look for those thine that we believe are not there, as well as for those we believe might he there. Except for emphasis on graphs, its tools are secondary to its purposes" (7).

Literature Cited (!ITukey, J. W.."ExplomtaryDataAnalnia,.'Addimn-Wesley, Reading. MA. 1911.

~j a m$i n / ~~u m m iw < O~CA,~p. m. G,P,,..ChemiealPdnejpl~,..3rdd,,Ben. ; ~ & : ~ ~ ~ ~ ~ ; f wM , ~ park,

(4) VekmmP.F.,andHoaglin.D.C.,"A~li~tifi~1.Baaia.andComputingofE1plo~to~ Dab Andwis? Durbury Press. Bmton, MA,1981. (5) P~W., E., J. amer. star. ASSO. ..14, 105 (1979).

(6) Mks~~ta.,W.L.,slo~in~kiiE.J.,~dStaoitskii~.~.,"~hemid~tinci~~~~~th

Saundem. Philadciphi.p. 113. (7) W e y , J. W..J Arner. Sfof.Asroe., 74,121 (1979).

Volume 62

Number 4

April 1985

305