Charles B. Leonard, Jr.
University of Maryland Baltimore, 21M1
Computer-Graded Examinations: Evaluation of a Teacher
W h e n a teacher examines the print-out from a computer-graded examination, his prime concerns will most likely be the profile of scores for the class and the individual student's score. However, as one becomes more familiar with the indices provided for each item in the print-out, one is not only able to evaluate the students, but also to evaluate himself as a teacher. It is to this that I wish to direct some consideration, as well as a brief explanation regarding the usefulness of the various indices. The multiple choice examination can be graded extremely rapidly by computers and thus the feedback to the student is rapid, but the construction of an effective examination requires time and effort on the part of the teacher. The construction of effective test items can be achieved if one asks himself three questions, namely (1) what are my educational objectives for an examination, (2) how can I best achieve these objectives, and (3) how can I evaluate my attempts? Each teacher should formulate his list of educational objectives and this, in turn, would indicate the objectives for an examination and suggest a format by which these may be accomplished. Although many teachers may not follow the hierarchy of Bloom's taxonomy of educational objectives, one should be cognizant of it.' The educational objectives are classified into six major categories and are arranged from simple to complex as follows 1.00 Knowledge-recalling appropriate materid previously encountered 2.00 Comprehension-lowest level of understanding of material being communicated without necessarily relating it to other material or seeing its fullest implication
3.00 Application-use of abstractions in particular and concrete situations in the form of general ideas, rules of procedure or generalized methods 4.00 Analysis-breaking a. communication into its constituent elements such that the organization of idem is clear 5.00 Synthesis-putting together of elements and parts so ss to form a whole structure 6.00 Evaluation-judging the value of materials and methods for a given purpose
The hierarchial order of the categories is predicat,ed on the idea that the objectives in one category make use of and build on the objectives found in the preceding categories in the list. If one is aware of the educational objective for a given test item as well as its hierarchial level, the test item need not be ineffective when greater than 90% of the students answer the item correctly. For example, an actual test item used in a biochemistry examination was as follows The atoms which glycine contributes to the purine nucleus
3. C-4; C-5; N-7
.
.
The results showed that 100% of the students answered the item correctly. Since the item is testing knowledge of facts, the lowest of the hierarchial levels, the results might merely reflect the ability of the student to write Presented at Univemity of Maryland School of Dentistry Faculty Seminar series, October 21, 1968. 'BLOOM, B. S., "Taxonomy of Educational Objectives. The Classification of Educational Gods. Handbook I: Cognitive Domain," Longmans, Green and Company, New York, 1956, pp. 17-19, 201-7.
Volume 47,
Number 2, February 1970 / 149
the facts given during a lecture, and, in turn, to have them recognize and regurgitate our desired responses. This does not necessarily represent a learning process and thus the test item would be considered ineffective on this basis for the evaluation of students. However, if the educational objective of the lecture were to impart certain fundamental information to the student and if the objective of the test item were to determine the acquisition of the information, then one could consider the objectives have been accomplished since 100% of the students made the correct response. On this premise the test item could be considered valid with the teacher and his methods also being evaluated in the process. If similar analyses were obtained using the same test item in successive attempts, then the test item should either be rewritten for use at a higher level of hierarchy, or dropped since the educational objective of imparting that particular knowledge to the student has been and will be successfulunder similar circumstances. Computer Print-out Format
An example of a typical item analysis provided in a print-out is given below for the following test item In an oxidation catalyzed hy malic dehydrogenase, the number of high energy phosphate bonds formed per molecule of substrate oxidized is (are) 4. none 1. three 2. two 5. unable to he calculated 3. one
The print-out would show the following Question 1 1 2 3 4 5 omit Correct Answer is 1 Upper 25 0 0 0 2 0 P # 0.611 PHI # 0.646 Lower 8 8 2 7 2 0 l t P B I # 0 . 6 7 1 R B # 0.855 DIS # 0.315
On the basis of scores, the students are divided into an upper and lower group. The number of students in the respective group, who answered a given response, namely 1,2,3,4,5,or who omitted the question, is given in the print-out. I t will be noted from the example given above that 25 out of 27 students in the upper group answered the item correctly while only 8 out of 27 students in the lower group answered it correctly. I n addition, the lower group exhibited a spread in their choices throughout the five possible responses. These results indicate good discrimination between the two groups of students and is reflected in the indices as will be discussed below. Negative values for the indices with the exception of the P # represent the fact that more students in the lower group chose the correct response. The print-out would also show if one particular wrong response were not chosen by either group. That response would obviously be ineffective and would be better either to omit it in future use of the test item or to rewrite the response following the thought or format of a more effective distractor.
#, and DIS # The P # or difficulty index is calculated as follows
P #, PHI
P#
= # o f students passing itern/total
# of students
This index simply reflects the proportion of the total group who answered the item correctly with a value of 0.500 being maximum differentiation. A good test 150
/
lournol o f Chemicol Education
item would have a P # i n the 0.25-0.75 range, although a value greater than 0.75 may not indicate a poor test item if the educational objective represented a low level in the educational objective hierarchy. Aside from the educational objective premise, a value greater than 0.900 reflects an extremely easy item. This might be due to one of several reasons: (1) very ineffective distractors which were not related to the test item; (2) a clue to the correct response from another test item; (3) errors in grammar such as incomplete sentences resulting when the possible responses are inserted into the test item or a statement requiring a singular response, but plural responses also being given, in which case the "is (are)" format in the statement is recommended, or responses where the first word begins with a consonant even though the statement has the word "an" before the blank implying an answer with the beginning letter being a vowel, in which case the "a (an)" format in the statement or the insertion of "a" or "an" in the response is useful. A P # less than 0.250 might indicate the use of terminology or thoughts not expressed during the lecture, or ambiguity in the test item, while a P #of 0.200 for a five-response choice may indicate mere guessing since this value would be the probability that a given response was chosen as a result of guessing. Test items with extremely low P # might also indicate an error on the part of the teacher in the construction of the test item by involving two correct responses or the choice of a wrong response as the correct one. These test items involving teacher error should be considered invalid for evaluation purposes and should be omitted with a corresponding adjustment of the students' scores. The P H I # or phi coefficient can be calculated as follows
where A and C are number of students in upper group who passed and failed the item in question, respectively, and B and D are number of students in lower group who passed and failed, respectively. This index represents a situation where the criterion variable is a natural dichotomy, i.e. upper-lower and pass-fail, and must be used as such. No assumptions are made about the distribution within the group among the possible responses. A value of 0.300 or greater indicates a good discriminating item. The DIS # or discrimination index is calculated as follows DIS # = (#upper correct - Plower eorrect)/total # students where the index attempts to show discrimination of the item with respect to the two groups. A DIS # of 0.300 or greater represents an item showing good discrimination with 0.500 being the maximum value possible. The item analysis given above shows values for P, PHI, and DIS to be 0.611, 0.646, and 0.315, respectively, and thus a good discriminating test item. Evaluation of this item indicated that the results were most likely due to the level of hierarchy a t which the item was asked. Instead of asking for the knowledge of a fundamental fact concerning the cofactor present in the enzyme malic dehydrogenase, the item asked for an ap-
plication of this fact, thus achieving a higher level of hierarchy and a better discriminating item.
internal-consistency method and is calculated by the Kuder-Richardson formula 20 as follows
RPBI #, RBI #, and Reliability Coefficient
The RPBI # or point biserial correlation coefficient is calculated as follows
where R, is the mean of scores in the upper group, X , is the mean of all scores, St is the standard deviation of all scores defined as ~ Z ( -XR ) = / N ,and p and q are the proportion of individuals in the upper and lower groups, respectively. In the program utilized by the author and provided by the Health Sciences Computer Center, University of Maryland, Baltimore, Md.,% is defined as the mean of the passing scores and S, as [XX2 N P / ( N - I ) ] % . The criterion for this index of discriminating power is a continuous variable where the variation of the RPBI depends on data which is continuous on one variable, hut dichotomous on the other variable, i.e., grades versus upper and lower groups. A reasonable test of significance of RPBI is made by use of the t distribution as follows
xp
l
=
RPBI
.dl
where N is the degree of freedom. Since N would vary depending on the situation, no significant value for RPBI with regards to discrimination can he given as was the case for P, PHI, and DIS values. The RBI # of biserial correlation coefficient is an index which requires the assumption that one of the normally distributed underlying variables has been forced into a dichotomy. The index is calculated as follows
where y is the height of ordinate of unit normal curve a t point of division between p and q and can he determined from the Taylor's series approximation as followsa y = -0.65244
+ (6.03363 + (-12.72403 + (12.11453 4.76666*P)*P)*P)*P
Likewise as in the case for RPBI, no absolute values for discriminatory ability can he given. However, RBI tends to be substantially larger than RPBI. One final index, which can be used as an evaluation of the entire examination as a unit, is the reliahility coefficient. One procedure to determine reliahility is the
where n is the number of test items, and p'qi is the product of proportions of passes and fails for item i.* The index measures the internal consistency of the test material and estimates how close the same set of scores would result if the same items were used under similar circumstances. The value of the index would theoretically range from 0.000 to 1.000 with a value of 0.600 representing a test with good reliahility and values of 0.800 or above being highly reliable. A high index value means that the students' scores were not influenced by chance selection. Values above 0.800 are difficult to achieve consistently with a homogenous class grouping and with items which have not been previously used, analyzed, and revised. If the index is low, the value can be increased by improving the items used, i.e., how sharply the items discriminate, or by adding more items, or both. As the test length increases, the test reliability increases as follows
where r, = reliability of a given test, mk = reliahility of the lengthened test, and k = numher of items. However, while a given test reliahility can he increased by increasing the length of a test, the practical factor of time becomes the limiting factor. Summary
In an age where an increased amount of information with its ultimate knowledge, understanding, and use is required to he imparted to the student, more effective teaching methods are of prime interest and concern. The understanding of the data which a teacher can obtain from a computer-graded examination will assist the teacher not only in the evaluation of student performance but also in the effectiveness of the teaching methods employed. The awareness and understanding of the P, PHI, and DIS values as well as the hierarchal level of questions provide the teacher with a useful tool in his armamentarium. If these are used wisely, both the teacher and his students will reap great benefits. --
a G ~ JOAN ~ F.,~University ~ ~ of Maryland, ~ ~ , College of Educstion, College Park, Md., private communication. 8 G ~ ~ s STEVE, o ~ , University of Maryland, Health Sciences Computer Center, Baltimore, Md., private communication. ' FEROUSON,G. A,, "Statistical Analysis in Psychology and Education," McGraw Hill, New York, 1966, pp. 23644.
Volume 47, Number 2, Febrwry 1970
/
151