503
V O L U M E 2 4 , NO. 3, M A R C H 1 9 5 2 ACKNOWLEDGMENT
Table 111. Effect of Various Buffers (pH 7) Acid and Alkali on Colorimetric Determination Standard (30 y maltose) Standard 0.5 mmol. citrate Standard 0.2 mmol. citrate Standard 0.1 mmol. ammonium sulfate Standard 0.05 mrnol. ammonium sulfate Standard 0.025 mmol. ammonium sulfate Standard 0.1 mmol. sodium hydroxide Standard 0.3 mmol. hydrochloric acid
+ + ++ ++ +
Reading Klett-Sunimerson Photocolorimeter Divisions 96 55 95 0 12 96 85
0
-
Enough 1% permanganate solution must be added to the gum ghatti-phosphoric acid solution so that one additional drop of the oxidizing agent will impart to gum ghat,ti solution a pink color that mill remain stable for at' least 24 hours. The blank should rc9ni:iin yellon- in color. 9green color indicates the presence of reducing impurities and makes accurnte analysis difficult, even if the hlank is used to determine an arbitrary zero point. If the gum ghatti solution has been standing for a long time in the acid medium, hydrolysis is apt, to liberate further reducing groups which must be reoxidized wit.h permanganate. Improperly oxidized gum ghatti solutions cause large blanks.
The authors are grateful to the Corn Industries Research Foundation for support of this work. LITERATURE CITED Hixon, K. )I., IND. FNG. CHEM.,ANAr.. ED., 13, 616 (1941). Foliii, O., a n d lfalmros, H., J . Biol. Chem., 131, 211 (1929). French. D.. Levine. M. L., a n d Pazur. J. H.. J . A m . Chem. SOC., 71, 356 (1949). Gore, H. C., a n d Steele. H. K., IXD.E ~ G CHEY., . ANAL.ED.,7, Farley,
E. F., a n d
324 (1935). L a n s k y , S.,Kooi, >I., a n d dchoch, T. S., J . Am. Chcm. Soc., 71, 4066 (1949). hIeyer, K. U., iYoelting!. G , and Bernfeld, P., Helv. Chzm. Acla, 31, 103 (1948). Koeltinn. G . , a n d Bernfeld, P., I t i d . , 31, 286 (1948). Xussenbaum S. and Hassid, I T . Z.,J . Bzol. Chem., 190, 673 (1 951). P o t t e r , A. L., and Hassid. \t. 2.. J . Am C'hem. S o c , 70, 3774 ,19381. Richardson, K.A,. Higgjiibotham, R. S., a n d Farrow, F. D., J . Yestile Inst., 27, 131. r m b r e i t . R'. IT., Burris, K. H . , and Stauffer, J . F., "hlanometric Techniques a n d Related M e t h o d s for the S t u d y of Tissue hIetabolisrn." p. 103. hlinneapolis, M i n n . , Burgess Publishing Co., 1945. R E C H IE D for review October 26, 1951. .\ccepted Decelnber 26, 1961. I
_
Organoleptic Panel Testing as a Research Tool L. C. CARTWRIGHT, CORNELIA T. SNELL, 4ND PATRICIA H. KELLEY Foster D . Snell, Inc.. New York, N . Y. Organoleptic panel test methods can be utilized by any laboratory group having 5 to 15-preferably 15people available for organoleptic training. The method is suitable for solving problems or answering questions concerning foods and beverages, not susceptible f solution by other analytical procedures. The principal practical applications deal with detecting the source of a disturbing off-quality in flavor or taste, the effect of a specific ingredient added to improve quality, and probable consumer preference of competing products similar in general character
T
HE human seiises are capable of distinguishing a tremendous number of distinct impressions. The senses of most humans can be rendered much more acute and the memory of sensory perceptions can be highly developed through training (1). Herein lies a powerful tool for evaluation of those properties of matter that affect the senses, but serious errors may result from its careless use. In the first place, there are great differences in the reactions of individuals to the same sensory stimuli (21). Recognition of this fact led to the use of carefully selected and trained professional tasters for beverages, and of odor experts in the perfume and essential oil industries (6, 7 ) . A less widely recognized, but perhaps more serious source of error is the almost continual, sometimes extrenie, variation in the sensory acuity of even the most carefully selected expert ( 1 ) . Errors from this source can be reduced by having the expert re-evaluate each sample several times, but this is not always feasible. A more effective nieans of detecting and niininiizitig such errors is through the use of an organoleptic panel of carefully selected and trained nienibers. Individual errors then tend to be compensating. Much attjention has been given in recent years to the improvement and refinement of techniques of organoleptic panel t,esting,
but differing somewhat in flavor. Various aspects of such problems are discussed in terms of the experi-
ence of the group; examples illustrate different types of applications drawn from many industrial problems. Methods of scoring, the training of panel members, and statistical treatment of results are discussed. A summary of data i8 given, based on the evaluation of eight samples of a particular product. T-alues for standard error of the mean demonstrate the accuracy of evaluation and show the relative quality of the different samples.
and many published papers attest to the effectiveness of this method as a research tool (9, 13, 14, 16-20.). Although organoleptic panel testing has been used most extensively in the evaluation of odor and flavor in foods and beverages, there are numerous eunigles of its application to other problems. Most of the basic principle are directly applicahle to the evaluation of any sensory stimulus SELECTION O F SYSTEM OF EVALUATION
First the property or properties of the product to be evaluated must, be determined. These may be aroma, flavor, appearance, consistency, or any other property affecting sensory response. It must tic decided whether these are to be evaluated separately, each as :in entity, or each as a composite of several factors; and whether :in over-all evaluation of the product or material is to be atteniptc>d,involving some weighted coinbination of the ratings of various factors. Then it must be decided whether evaluation is to be mnde on the basis of preference ratings of samples or on the basis of some effectively quantitative rating of each property, and whether results will be expressed as direct numerical scores or
ANALYTICAL CHEMISTRY
504 as descriptive terms that may be assigned numerical values. Each method has its field of special usefulness ( I , 10). In this laboratory, for example, two similar breads were evaluated, differing only in the presence or absence of a small amount of one ingredient. Evaluation was made by two methods: consumer acceptance, and trained panel testing. In the consumer survey, over 400 individuals were asked to express a preference for one bread or the other, and also to give an opinion as to aroma, flavor, freshness, and’eating quality in terms of “good,” “fair,” or L‘poor.” The trained panel of five members gave numerical evaluations of aroma, color, porosity, texture, mastication, and flavor of the two breads. The Donsumer ratings were converted to numerical values to facilitate their interpretation. Mean values for the consumer results and for the trained panel indicated superiority of the same bread. Simple statistical analysis, including standard errors of means, average deviations, and graphical comparison, confirmed the significance of the findings and showed positive correlation between the consumer results and the trained panel results. The nature of the problem generally dictates what properties are to be evaluated. The simplest method is merely an expression of preference rating. AEillustrated by the bread study, experience favors preference ratings as most applicable for consumer surveys. This is also a good beginning in selecting new individuals for training in panel work. In these cases, rating should usually be of only two or three samples at a time, and it is often desirable to submit two identical samples and one that is different as a”check on the reliability of the ratings (g). Preference rating may also be used to advantage (8) with a trained panel in certain instances, especially if the differences between these samples are only slight. In this case, however, a larger number of samples, even as many as six to ten, depending upon their nature and the danger of sensory fatigue, may be arranged in order of preference concurrently. In packaging and shelf-life studies, each property or component is weighted initially. The maximum total score is 100, where aroma is rated 30, flavor 40, and aftertaste 30. Further subdivision of these properties into “presence of desirable” and “absence of undesirable,” giving equal weight to each subdivision, is made to place emphasis on possible contamination causing offodor and off-flavor. For formulation work, this method has been used with suitable modifications to include particular components such as “sweet,” “sour,” “salt,” “spicy,” etc. A modification ef the ‘‘flavor profile” method which rates strength of flavor notes and over-all quality has also been employed. This method allows for pictorial presentation of flavor (I). This pictorial method has been extended to distinguish threshold flavor or flavor detection from over-all quality (3). SCORING FOFUMS AND DIRECTIONS
Although, in simple evaluations of a few samples, directions can be oral, and panel members can record their evaluations or give them orally to the panel conductor, it is usually desirable to have prepared scoring forms and written directions for distribution to the panel members ( 1 , 15). These may be very brief. Some investigators prefer to use a series of descriptive terms such as “excellent,” (‘good,” “fair,” and “poor,” and to give these uniform numerical equivalents ranging from 10 to 0. An over-all score may be obtained as a weighted mean of these component scores. This introduces complications when independently variable properties are included, and considerable attention has been given to sound methods of combining such ratings. The authors generally prefer to weight numerical scores initially. This often results in a more ready understanding of the problem by all panel members. The nature and complexity of the forms and directions will vary with the problem, so that only illustrative suggestions can be given here.
In the case of consumer surveys or untrained organoleptic panels, the directions must be complete and specific but as simple as possible. In these cases, evaluations are likely to be restricted to simple preference between two or three samples, or possibly a few descriptive ratings. NUMBER OF PANEL MEMBERS
With competent members, the larger the panel the more reliable the mean values obtained, so that the upper limit of panel size is generally determined by weighing considerations of cost and of availability of acceptable members against the value of greater precision in the particular evaluation. The minimum permissible size of panel depends upon the ability and training of the members and the minimum acceptable precision, Various investigators have suggested minimum panels of from 3 to 50 members. The authors have obtained acceptable reliability with as few as three members in some work, but generally prefer to use at least five. The major portion of the direct numerical scoring is done with panels of from eight to ten carefully selected and trained members. However, when beginning to use a new scoring system or to score a new product, a larger panel is necessary to maintain the same degree of reliability. When such work is to continue over a long period of time, the results of the first several scoring sessions are analyzed to select the eight or ten most reliable members, and this smaller, but more select, panel can be expected to give about the same reliability as the larger initial panel. One problem was to determine the cause of flavor deterioration in a cocoa product packaged in small lassine envelopes enclosed in paperboard cartons, which were pacaed in corrugated shipping cases. In determining which of the several packaging materials might give the most trouble, a panel of 15 members was used to test samples which had been subjected to accelerated agin in contact with each individual packaging material. Later w%en the packaged cocoa was submitted to organoleptic evaluation for routine checking, the original panel of 15 was reduced to 8 experienced members, with practically no change in the accuracy of the results.
If only three or four acceptable panel members are available, rescoring of each sample two or three times d l 1 give approximately the same reliability of final mean score as scoring once by a larger panel. SELECTION OF PANEL MEMBERS
Although the ability of practically every individual to make accurate and reliable organoleptic evaluations can be tremendously increased by training ( I d , I1 ), it is necessary to select the panel members for each specific organoleptic problem (8, 9, 10, 13, I7), and it is desirable to make further selection and elimination during the training period, and sometimes thereafter for aa long as work continues. In selecting a panel, the first consideration is that the individual shall have fair ability to distinguish between appreciably differing levels of the property or properties to be evaluated. The method which the authors use for selecting candidates is to offer them three samples of the product to be evaluated. These may represent three levels of quality, or two may be at the same level and the third at another level. The candidate is asked to evaluate these, either on straight preference or on a descriptive basis, with respect to the basic properties which are to be examined, such as aroma, flavor, or aftertaste. If the candidate successfully distinguishes between the samples that are actually different, indicates little or no differencebetween identical samples, and prefers the sample with the predetermined higher rating, he is selected for training. Other things being equal, panel members are preferred who have had previous panel experience, because there is considerable
V O L U M E 2 4 , NO. 3, M A R C H 1 9 5 2 transfer of skill and interest from one organoleptic panel to another. Further Selection and Training. The purpose of training is not only to increase the sensory acuity and memory of panel members, but also to make certain that all members of the panel have substantially uniform understanding of the particular properties to be evaluated, the criteria and system of evaluation, the relationship between quality or intensity of sensory stimuli and the descriptive or numerical terms used in the rating system, and the precautions necessary to minimize the effects of irrelevant factors on the rating of each property (1, 4). These matters are all explained and discussed with each candidate, even though he may have had much previous panel experience, because each problem involves special considerations. NUMBER OF SAMPLES PER SESSION AND SESSIONS PER DAY
The number of samples that can be reliably evaluated concurrently in a single session, as well as the maximum number of sessions that should be attempted in a day, varies greatly, depending primarily on the nature of the samples and the properties to be evaluated, but also on the skill and experience of the panel members (1,8, 15). The major limiting factor is sensory fatigue, but simple boredom, or inattention due to other interests, may sharply reduce the reliability of evaluation. A trained panel can accurately score from four to eight samples per session in the case of odor, flavor, and aftertaste evaluation of a wide variety of foods, ranging from baked goods and beverages to meats and vegetables. With spices, highly spiced foods, or strong alcoholic beverages, it is preferable not to exceed three or four samples per session. As little as 30 minutes has been found adequate for sensory recovery between sessions with many foods of relatively mild odor and flavor. PREPARATION AND CODING OF SAMPLES
Samples should be as uniform as possible in all aspects and properties other than those to be evaluated. Above all, the actual identity of each sample must be concealed by suitable coded marking during evaluation (1). In so far as possible, amount, form, consistency, color, appearance, temperature, and container should be uniform. If some of these properties are to be evaluated, this should be done first and the samples then adjusted to uniformity for further evaluation. Effects of color and appearance on odor and flavor can be substantially eliminated by use of special lighting (1, 8 ) , or the panel conductor may submit samples to the panel members without the member looking a t the sample directly. EVALUATION PROCEDURE
The usual procedure is to have a panel conductor present the coded samples and the scoring form for the session to each panel member in the absence of other members (1, 4, 15). Sometimes, especially with members new at panel scoring, it is helpful for the conductor to record the scores as the member gives them orally, so that the member can concentrate exclusively on evaluation, but experienced members find that recording their own scores is no serious handicap. Whenever the nature of the problem permits, a t least one control sample-also marked in code--of previously determined or otherwise established score is included in every session (1, 5, 15). In packaging or shelf-life studies, this may be a standard sample of the food product, either freshly prepared or one that has been stored under conditions to minimize any change in odor or flavor; this should receive the full score for all components. Other coded controls may include samples previously scored or samples aged under standard conditions having a known effect on the prop-
505
erties to be evaluated. Often such controls are also used aa reference samples. The panel member is then free to check the sample he is testing against a known control. Let us assume that six samples of a prepared food product are being scored in one session in connection with an investigation of packaging materials. One sample is a standard control, r e p r e senting the product as manufactured before packaging, four are samples of the same production lot subjected to accelerated aging in contact with prospective packaging materials under established and carefully controlled conditions, and the sixth is an aged control sample, aged under the same conditions in a clean, glass container. Equal amounts, say 50 grams, of each of these samples are placed in uniform clean, screw-top, glass jars, labeled only with code markings. A second sample of the standard control, marked “control,” is used for reference, with the understanding that it represents full score for each component. Having scored all samples in the session for odor, the member may taste the identified control, if he requires this to refresh his memory, although experienced scorers often find this unnecessary when scoring frequently. Then he tastes each of the coded samples and assigns numerical scores for flavor and aftertaste. The mouth is rinsed out with water before tasting each successive sample. Incidentally, the term “flavor” is used instead of “tmte” in this instance, as it should be in most cases of food tasting; it is generally impossible to dissociate taste from odor, and flavor impliea the combined effects of both sensations, as well as of all other sensory response to food in the mouth, such as the “hotness” of pepper or the “cooling” of mint. Aftertaste, on the other hand, is generally less influenced by odor. So far as possible, the product being evaluated should be tested in the manner nearest to actual use conditions. If other conditions, such as change in temperature or quantity or preparation, are used, the correlation between such conditions and actual use must be established (1, 15). COII.IPILATION, ANALYSIS, AND INTERPRETATIOV OF PANEL RESULTS
Various methods of compilation, statistical analysis, and interpretation of organoleptic panel results have been discussed in the literature. Application of statistics will not change the basic character of incorrect data; it must be assumed that the data are sound. Statistical methods here m elsewhere are a tool-not a cure-all. And certainly if the answer to an experiment is obvious, an elaborate analysis is unnecessary. In simple preference rating of pairs or ranking of multiple samples, the number and percentage of first choices, second choices, etc., for each sample may permit adequate interpretation of the results. However, with multiple ranking or with preference rating of all possible paired combinations of a number of samples, this simple method may not clearly indicate proper over-all ranking. Sound methods of analysis of such results are covered in detail in standard works on statistics (11, 1%’)and have also been presented in direct application to organoleptic problems, In quantitative scoring, whether by descriptive terms with assigned numerical values or by direct numerical scoring, it is usual to calculate the means of the scores assigned by all panel members either to each property or component, or to the over-all quality of each sample. Deviations of scores by individual members from mean values show the relative agreement between members, and examination of the amount and direction of deviations by individuals for a number of samples, and especially in successive scoring sessions, indicates the consistency of each individual and whether he tends to score higher or lower on any or all components than the panel average. The major function of this type of analysis is in the continued training or selective elimination of panel members as work progresses. On the other hand, calculation of the standard error of mean total score for each sample, and sometimes of mean score for each component, gives the most direct and effective measure of relia-
506
ANALYTICAL CHEMISTRY
bility of results. is :
The usual formula for standard error of means
S.F..W.
=
2/Z:X2/A’(N - 1)
where Z X 2 is the sum of the squares of deviations of individual values from the mean and N is the number of individual values. Use of standard error of means is invaluable in showing whether the difference between the scores of two samples is significant. In general, if two scores differ by the sum of their standard errors, this assumption may be taken as practically a certainty. In order to illustrate the type of results obtained on scoring various samples using 100 as a perfect score, a series of scores on samples oi varying quality is listed. This illustrates the accuracy of scores and shows that the accuracy of results decreases as the total scores decrease. This pattern of results often occurs when therr IS no conceivable Kay of illustrating a very poor sample ~it: hiscore of 0. Sarnpl,.
A
B C
n
E: f
6;
H
No. of Scores 11 13 12
14
15 15
12 18
Scores -.___~~_ ____ RIaxirnum 100 0