provocative opinion
Can Teaching Assistants Grade Lab Technique?
This paper is an analysis of the widespread system of basing laboratory grades in freshman chemistry an laboratory results. We are not here concerned with the level of proficiency shown by the students, hut more importantly, whether such grading provides a fair test, and how such grading correlates with other measures of student performance (for example, evaluation, or a teaching assis. . a soot tant's opinion.) The validitv- of grades in terms of obiective criteria is of vital importance to all curriculum planners. One of the reasons for grade inflation is the feeling that the raw scores made byUstudents, or the opinions of teaching assistants may not mean anything. This leads to the philosophy that says, "since they tried hard, and we're not really sure, let's give them all A's." The curriculum planner needs to know how manv exneriments need to he eraded in this wav to provide a fair sample. What fraction of otherwise competent students are likelv to fail because of a "had dav"? We wish to minimize the effect of such an event on a student's grade, hut on the other hand, to avoid the pressure and administrative hassle of overgrading. To " gather data for this studv, .. we conducted a practical examination in our freshman chemistry lab course. The enrollment was about 500, equally divided between premeds and engineers. The idea ofthepractical exam is n i t new,' hut it eliminates several difficulties that would occur if each regular experiment was graded on results. First, there is a conflict between expecting a teaching assistant to help a student, and then grading the results from that experiment. ,In a limiting case, a student who was so incapable that the TA had to do the experiment for him, would get the best results, and hence, the hest grade. If the teaching assistant were to he scrupulously fair, he would have to help nobody (in which nothing would get taught), or help evervbodv eauallv (an imoossibilitv. if onlv because the separating the teachhes&tudents.neei the least help). ine" davs " from the evaluation periods. we are eliminatine the inherent conflict between teaching and grading. There are several other minor advantaaes - to the practical exam form of organization. It provides a goal for the teaching assistant and student to work towards. Also the achievement (or nonachievement) of the class as a whole is immediately obvious.
. .
-
BY
Experimental
Teaching assistants were expected to go through the practical exam the first week in the semester. This permitted us to criticize their titration technique and other laboratorv habits. a necessitv since nowadavs even good -eraduate students often have little knowledge of wet analytical chemistrv. . Bv. going " throueh i t themselves,. thev. were also able to pinpoint traps to warn their students about. The idea of preparing their students for the examination was seen as a challenge by most. The examination consisted of titrating an unknown containing sodium oxalate with KMnO4. The KMnO4 solution and a standard oxalate solution were prepared by the stu-
' MacNcvin, W.,J.CHEM. EDUC.,38,1M (1961).
dents the previous period. T o avoid the danger of accidental or deliberate crosscontamination of reagents, the solids were dispensed in individual vials. Students were encouraged to do a "dry run" of the standardization the previous period, to test their solutions, and to become familiar with the endpoint. This was the third experiment involving titration during the semester, so that there was ample time to become familiar with the equipment before being tested. The unknowns ranged from 10%to 25% oxalate. The material was oredried bv the stockroom. and dispensed in individual vials. We purchased the material k hulk from Thorn Smith, Inc, Troy, Mich. A set of samples was run by one head TA, and by one of us (MP) to reassure us that the unknowns were as described. Great pains were taken during the examination to see that no unknowns were mixed up, and that the numbers on the vials were recorded in several places besides the student's notebook. The TA's were rearranged so that they would not proctor their own students, and told not to assist with any problems of technique, hut simply to proctor as they would for a written exam. Students were expected to work alone and without consultation. A proctor was also assigned to the balance room. Students were permitted to do laboratory work for two and a half hours. At the end of this time, the proctor asked them to do the computation on an "income tax form" report during the next half hour. The form asked only for arithmetical operations and could he filled out without chemical knowledge. Students were permitted to average their values, take their best value, or even guess. The reports were graded assembly line fashion. All sections were comhined sorted by unknown number, and the result graded. If the correct value was 20%, the student had t o report a result between 19.80% and 20.20% to receive a erade of 100. For each additional 0.2% error. ten ~ o i n t s were subtracted. Our preliminary experiments showed that this standard was obtainable with student equipment, if three determinations were averaged. In all case; where students were far off, the arithmetic was checked on the form. In a few cases this explained the anomalous results. A surprising amount of technique showed up in the form itself. One could see a t a elance whether a student understood significant figures, or read the buret to hundredths. I t was also clear whether he weighed the standard oxalate out blindly to a specified value or simply measured it exactly.. Messy forms seemed to correlate with low grades. Where it was possible to fix up a student's arithmetic, the grade depended on the corrected result, since we were testing laboratory technique, not arithmetic. Before the practical exam results were known, TA's were required to fill in comprehensive recommendation forms for each student, which asked for a 0 to 4 rating of "technique," defined as "mechanical or manipulative skill". Also one of us (GK) made a spot check of students' techniques during the examination to see if that would yield useful information. We believe that the method of administration prevented cheating, arithmeticerrors or other extraneous influences on the results, and that because of the structure of the reVolume 53, Number 5, May 1976 / 313
port form, the test is a purely manipulative one, with as little contamination as possible. Results Almost half of our students obtained a result within 2% relative error, and two thirds were within 3% relative error. About one verson in six was more than 10% off. In every such case, Earefu~questioning of the student uncovered definite technique blunder. The frequency distribution of student results was normal. We found no correlation between the "over-the-shoulder" technique rating during the exam itself, and the scores. This is not really surprising, in that the time spent observing was only about a minute or so per student. If a student blundered during the observation period, he might realize it, and fix up his experiment. If a student did a perfect job while being observed he might easily blunder later. The x2 value for this correlation indicated that the results were purely due to chance (95% confidence level). Hence i t is clear that detailed and repeated observations of a student are required to make a valid rating of technique; a spot check will not suffice. We found a strong correlation between the written final in the course, and the practical examination score. The numbers are shown in Tahle 1. The x2 value of 10.1 is significant a t the 99.5% level. There was little variation of the mean score from one lab dav to the next. Within a given day, there were often drast i c differences between s&tions, but the number of students per section (15) was not large enough to permit useful statistical testing. Frequently a teaching assistant would have a section that did very well on one of his teaching days, and another section on another day that did abysmally poorly. Hence, average section scores cannot be used as a measure of teaching assistant prowess. The numerical techuique ratings made by the TA's were
a
Table 1.
Correlation of Written and Practical Exams Written exam score
Above
Practical exam above median practical exam below median
Table 2.
Median
Below Median
126 94
93 128
Correlation between T A Rating of Technique and Exam Score far all Sections T A rating Above
average Average Practical exam score above mgdian Practical exam score below median
Table 3.
95 40
98 101
Below average
30 86
T A Rating of Technique Correlated with Exam SWre for Most Accurate TA's T A rating
Above average
Table 4.
Average
Below average
T A Rating of T ~ h n i q u eCorrelated with Exam Score for Least Accurate TA's T A rating
Practicai exam score above median Practical exam score below median
314 / Journal of Chemical Education
Below
Above average
Average
average
10 15
15 20
11 4
averaged for each section, and each student characterized as average, ahove average, or below average, based on the average TA rating for his section. This TA rating was then compared with whether the student had done well or not (above or below the median) on the practical exam. The resulting contingency table for all sections combined is shown in Tahle 2. The X2 value: is 10.2, indicating significant positive correlation at the 99.5%confidence level. The contingency tables for each TA showed much more variability. In general, the number of cases falling into some of the cells was too small to allow meaningful use of statistics. We were especially interested in the fraction of mistakes made by a TA. It is reasonable to assume that students rated ahove average would do well on the practical, and those rated below average would do poorly. Three of our TA's were right in more than 80% of their predictions. It should he noted that these TA's were superior in other respects. Three others were right less than 50% of the time. and one was rieht onlv 16% of the time. These three would'have done be&er by &sing a coin to make a rating. The averaee orediction was rieht ahout 61% of the time. This is only ;lightly better thsn chance. The best three contineencv tahles were comhined into Table 3: the w 2 value of 11:l indicates positive correlation a t the 9(Mb confidence level. The worst three were comhined to yield Tahle 4; the x2 value of 4.8 indicates negative correl&ion at the 90% confidence level. Discussion
Two rather startling conclusions can he drawn from this study. The first is that there is a correlation between results on a written exam with those on a practical exam. This is true even though the vractical was a . ~ u r e. l vmaninulative test, since the results were calculated via "income tax forms". and even the arithmetic corrected if necessarv. Much of the practical required only the ability to follow directions, and a few manipulative skills. Apparently, however, a "good" (or perhaps "well motivated") student is a good student, no matter what the task at hand. The second startling result is the wide variability in TA's ability . to -iudge - technique. If the TA's ovinion did not correlate overall with the exam score, one would suspect the exam itself. And if the TA's ability to rate the students was flawless, then the practical exam would be unnecessary. We have found that a few TA's are very good judges of technique, the majority do only slightly better than chance, and a few do significantly worse than chance. I t is frightening that in one section in six, TA ratings are negatively correlated with practical exam scores. This last result justifies an ohjective test of lab skill. The problem with having TA's do an opinion rating of lab skill is not that there will be the inevitable section corrections. Methods for coping with these have been worked o ~ t . ~ , ~ Rather it is that the underlying appraisal of a student's skill may he fundamentally invalid, in the sense that it is not correlated with objective criteria. If this systematic error is introduced, it is impossible to remove it by statistical adiustment. ~ a i t i o should n he used in extrapolating from our experiment. I t may he that our result is ap~licahleonlv to our institution, d"e perhaps to some idioiyncratic factor of our TA training program. It is also possible that TA's may be hetter judges of other types of achievement than they are of lahoratory skill. But it is axiomatic that a grading system can only he as good as its poorest grader. Our bias toward ohjective tests of laboratory technique is founded not on a belief in Keller plan behavioral objectives, but the necessity of creating a fair and reproducible system. I t is also not clear how to improve our TA's skill a t mak2 Bshe, L.W.,
J. CHEM. EDUC., 40.90 119631. R. H.. J. CHEM. EDUC.. 45.12s (1966).
3 Sehuendernan.
ing these technique judgments. Even if we take the grading out of their hands, we have not solved the. prohlem. The ability to recognize had technique is the first step in finding out which students need help, and therefore is central to the teaching, as well as the grading, process. We are presently simply exhorting TA's to "know your students", since we have shown ahove that detailed acquaintance with a student's work, not a mere spot check, is necessary. We have found that even the hest judges of technique make mistakes, assuming for the mo%ent that the practical exam is a good test. Do these errors represent the effects of a "bad day", as students like to claim, or a TA judgment error? The joint probability that a student has a practical exam grade below the median, and has heen rated ahove average by his teaching assitant is ahout 5% for the most accurate TA's. (It should he noted that this value is slightly doubtful. because the number of accurate TA's proved to and be less than anticipated when the study was so the sample size is onlv 61 students. However. i t suffices f i r this a r b m e n t to condude that the probability is small in maenitude.) impossible to disentangle the effects of a It "bad dav" from iudement errors. But even if all of the TA rating errors areeauied by "bad days", i t is clear that "bad days" are quite rare, and the test is a reasonable sample of student technique. Since there are certainly some judgment errors involved, i t is safe to regard one in twenty as an upper limit for the probability of a "had day" as defined above. (The day doesn't have to be a total disaster to qualify under our definition-all of the "victims" of "bad days" scored close to the median even if not in the top half.) It is also worth noting that the grade, even with the "bad day" factor, is far more accurate than i t would he if TA opinion were the only input into the technique part of the course grade. In that case, about one student in eight would he misgraded, as shown in Tahle 2.
Since the number of cases where accurate TA's overrated students was so small, we examined the detailed TA comments. The pattern seemed to be great enthusiasm for a student's theoretical skill, but more lukewarm about practical skills. The overrating may be partially a case of the "halo effect", and those students overrated may he better with their heads than their hands. This anecdotal information reinforces our belief that the figure of one in twenty for the chance of a "had day" is an upper limit. Finally it is clear that if one held two such practical exams, the probability of a "had day" happening on both would be a t most one in four hundred. In light of this we feel that the typical lab course is overgraded. The results of only a very few experiments are needed to give an adequate samole of a student's technique. One mav or may not gain motivation by grading resul&on every experiment, hutone does not gain much additional grading information. T o summarize, we have found that the practical exam format has a number of organizational advantages over the traditional practice of graAing results on each student experiment. Either method, howe\,er, should produce significantly more valid grades than input based o n TA opinion. The results on the practical exam are found to be correlated with written exam results, and seem to provide a reasonable sample of student performance. Acknowledgment
We are grateful to Pearl Steinmetz who did much of the tedious statistical work. The manuscript also was greatly improved by the critical reading of Dave Monts and Steve Goldstein.
Miles Pickering Gary Kolks Columbia University New York, New York 10025
Volume 53, Number 5, May 1976 / 315