A Statistical Analysis of Infrequent Events on Multiple-Choice Tests

Apr 1, 2008 - A statistical analysis of multiple-choice answers is performed to identify anomalies that can be used as evidence of student cheating. T...
0 downloads 0 Views 178KB Size
Research: Science and Education edited by

Resources for Student Assessment 

  Thomas A. Holme University of Wisconsin–Milwaukee Milwaukee, WI  53201

A Statistical Analysis of Infrequent Events on Multiple-Choice Tests That Indicate Probable Cheating Michael J. Sundermann Lone Star College-Montgomery, Conroe, TX 77384; [email protected]

A challenge for any teacher is writing tests that gauge student knowledge. When a student cheats, the test result is considerably less relevant for evaluation purposes. Pervasive cheating also creates a culture of dishonesty where even more students cheat (1), and nobody can be trusted. These attitudes carry over into future employment (2). Identifying cheating on essay questions is relatively easy, even if it cannot be determined who cheated from whom. It is usually more difficult to identify it on multiple-choice tests, because students do not have to write out their thought processes. By looking at the patterns of correct and incorrect answers on tests, often the occurrence of cheating can be strongly indicated. Cheating by elementary school teachers on standardized tests has been detected by the presence of long streaks of identical student answers (3). Cheating by students can be detected in a similar way: by a suspiciously large number of answers in common. Harpp and Hogan found that by comparing the ratio of exact errors in common (EEIC: two students put the same wrong answer for a question) to errors in common (EIC: two students get the same question wrong, but do not necessarily have the same wrong answer), blatant cheating could be detected (4). Their technique required inputting a large number of test answers from different students into a computer program. They later found that comparing the ratio of EEIC to differences (D: all cases where the students have different answers) gave an even more discriminating relationship (5). They provide data from over 10,000 students and show that an EEIC:D ratio of greater than one is highly indicative of cheating. The goal of this article was to find an easy way to detect cheating and to assign a numerical probability to the likelihood of cheating by observing merely the two suspect tests. Some statistical studies of cheating have been performed (6). Not all of them use the results of real exams. In particular, it must be noted that not all questions on a test are equally difficult and not all wrong answers on a test are equally popular. In addition, the large number of exams necessary to perform an analysis of infrequent events (less than one in a million) cannot be feasibly obtained. Therefore, the best way to study infrequent events is to perform simulations, using real data as a template for the test answers. The test used for this article is the 2002 ACS Organic Chemistry exam. The scores of 1852 students from 46 colleges were reported. Method Two parameters were examined for this article: the value of the EEIC:D ratio and the longest streak of identical answers. One billion pairs of simulated student answers were created and compared in a series of different situations. 568

For simulation 1, the student answers from the actual ACS test were used. This test has 70 multiple-choice questions with four answers each, with an overall average of 61.5%. Computer simulated answers were generated for each question. For instance, on question 1, 66.1% of students answered the question correctly, 27.9% of students gave the most popular incorrect answer, 4.4% of students gave the second most popular incorrect answer, and 1.6% of students gave the least popular incorrect answer. Simulated students 1 and 2 were each assigned a random number between 0 and 1 by computer, and the number was translated into an answer using the percentages above. A random number between 0 and 0.661 was assigned the correct answer. A random number between 0.661 and 0.940 (a difference of 0.279) was assigned the most popular incorrect answer, and so on. The same process was performed for both students for the other 69 questions, using the percentages from the actual student answers on the ACS test. The EEIC:D ratio of the test was calculated, as well as the longest continuous streak of identical answers. This process was repeated 1,000,000,000 times. On the ACS test overall, the correct answer was given 61.5% of the time. Some questions were easier and some were harder, and percent of answers correct had a standard deviation of 12.3% for any given question. The most popular wrong answer was given 20.8% of the time, with a standard deviation of 8.3%. The third and fourth answers had averages of 11.1% and 6.6%, respectively. They had standard deviations of 4.3% and 3.8%, respectively. In one case, the most popular incorrect answer was given more often than the correct answer. This is a common occurrence on tests if students share a common misconception. For simulation 2, the average difficulty of the 70 questions was kept the same, but the actual difficulty of each question was assigned randomly. The test was calibrated so that the average difficulty of each question was 61.5%. Therefore, the overall test scores of a large number of students would average 61.5% on the test, even though some questions were made harder than others. The difficulty of the questions was given a Gaussian distribution with an average of 61.5% and a standard deviation of 12.3%. The wrong answers were also allotted and calibrated so that their overall averages and standard deviations were that same as the ACS test. Simulated student answers were created and compared as described above. 1,000 student pairs were compared on each of 1,000,000 different simulated ACS tests, for a total of 1,000,000,000 comparisons. The results of the simulation 2 gave the same results as simulation 1, verifying the integrity of the technique. What effect does the number of test questions have on the results of the simulations? What effect does the class average have on the results? These questions were examined next. For simulations 3–7, the class average of the simulated tests were

Journal of Chemical Education  •  Vol. 85  No. 4  April 2008  •  www.JCE.DivCHED.org  •  © Division of Chemical Education 

Research: Science and Education Table 1. Analysis of Simulation 1, EEIC:D Ratios Probability of Occurrence

Table 2. Analysis of Simulation 1, Longest Streaks Probability of Occurrence

EEIC:D Ratio

Longest Streak

0.4

–2

10

11

0.5

10–3

14

10–4

0.6

10–4

16

10–5

0.6

10–5

19

0.7

10–6

22

–2

10

10–3

–6

10

Table 3. Probability of a Certain EEIC:D Ratio for a Streak of a Given Length EEIC:D Ratio

Longest Streak

0.4

0.5 –2

11

1.3 x 10

–2

0.6

1.3 x

10–3 10–3

0.7

2.0 x

10–4

2.2 x 10–5

7.6 x

10–4

1.2 x 10–4

14

2.6 x 10

3.0 x

16

3.9 x 10–2

4.9 x 10–3

1.1 x 10–3

1.6 x 10–4

19

–2

7.2 x

10–3

2.1 x

10–3

4.1 x 10–4

3.1 x

10–2

5.2 x

10–3

0

6.5 x 10

–1

22

1.1 x 10

Table 4. Probability of a Certain Longest Streak for an EEIC:D of a Given Ratio EEIC:D Ratio

Longest Streak 11

14

16

19

22

0.3–0.4

2.9 x 10–2

3.7 x 10–3

9.2 x 10–4

1.1 x 10–4

1.0 x 10–5

0.4–0.5

5.8 x 10–2

9.3 x 10–3

2.7 x 10–3

4.0 x 10–4

4.8 x 10–5

0.5–0.6

–1

–2

–3

–3

1.9 x 10–4

–3

7.3 x 10–4

0.6–0.7

1.0 x 10

–1

1.8 x 10

2.0 x 10

–2

4.6 x 10

changed and tested at 60%, 65%, 70%, 75%, and 80%, and the test length was changed to 50 questions. For simulations 8–12, the class averages listed above were used, and the test length was changed to 100 questions. It should be noted that it is possible that some students cheated on the tests submitted to the ACS. It is hoped that professors found the tests before submitting them, but any cheating would increase the correlation of the student test answers. Therefore any data obtained below may slightly underestimate the probability of a student’s having cheated on a test, but would not overestimate it. Simulated students were not programmed to be high or low performing. A student in the simulation who got problem 1 correct was not more likely to get problem 2 correct. Even though the distribution of simulated overall student scores is different from the actual ACS test, this fact does not matter because when students are suspected of cheating, the percentage scores of the suspicious students rather than the overall class can be used. Results The results of the simulation 1 are provided in Tables 1 and 2. The probability of an EEIC:D ratio of greater than 0.4 was less than 1 in 100. The probability of a continuous streak of at least 11 identical answers (right or wrong) was less than 1 in 100. The probability of an EEIC:D ratio of greater than 0.7

6.5 x 10

–2

1.5 x 10

1.1 x 10

4.1 x 10

was less than 1 in a million. A ratio this high is highly indicative of cheating and corresponds to the results of Harpp and Hogan who used actual exams. Out of the billion simulated trials, only 2 had a ratio greater than 1.0. A continuous streak of more than 22 is also highly indicative of cheating. This parameter is useful because when students cheat, the copied answers tend to occupy one section of a test. Out of the billion trials, no streak was longer than 29 questions. Tables 3 and 4 show the correlation between the EEIC:D ratio and longest streak. Table 3 shows that there is a positive correlation between the longest streak and the EEIC:D ratio. The longer the observed streak, the greater the probability of a large EEIC:D ratio. The zero at the lower right of Table 3 is the result of small sample size. A long streak of exactly 22 occurred so rarely that there were no observed tests with an EEIC:D ratio of more than 0.7. This fact does show that the two variables have a large degree of independence. Even tests with very long streaks of similar answers usually have moderate EEIC:D ratios. This means that a test with a large streak and a large ratio is almost certainly a product of cheating, unless both students answered almost all questions correctly. If two students receive a 69 out of 70 and got the same question wrong, it could not be assumed that cheating occurred. Table 4 reconfirms the correlation between the longest streak and the EEIC:D ratio. As expected, the greater the EEIC:D ratio, the greater the likelihood of a long streak. However, the presence of a large ratio certainly does not guarantee a long streak.

© Division of Chemical Education  •  www.JCE.DivCHED.org  •  Vol. 85  No. 4  April 2008  •  Journal of Chemical Education

569

Research: Science and Education Table 5. Analysis of EEIC:D Ratio on Test with 50 Questions Probability of Occurrence

Class Average on Test 60%

65%

70%

75%

80% e

10–2

0.4

0.4

0.4

0.3

0.3

–3

10

0.6

0.5

0.5

0.5

0.5

10–4

0.7

0.7

0.6

0.6

0.6

10–5

0.9

0.8

0.8

0.8

0.9

10–6

1.0

1.0

0.9

1.0

1.1

Table 6. Analysis of EEIC:D Ratio on Test with 100 Questions Probability of Occurrence

Class Average on Test 60%

65%

70%

75%

80%

–2

10

0.4

0.3

0.3

0.3

0.2

10–3

0.4

0.4

0.4

0.3

0.3

10–4

0.5

0.5

0.4

0.4

0.4

10–5

0.6

0.5

0.5

0.5

0.5

–6

0.6

0.6

0.6

0.5

0.5

10

Table 7. Analysis of Longest Streaks on Test with 50 Questions Probability of Occurrence

Class Average on Test 60%

65%

70%

75%

80%

10–2

10

11

13

15

18

–3

10

13

14

16

19

23

10–4

16

17

20

23

28

10–5

19

21

23

27

33

10–6

20

24

27

31

37

Table 8. Analysis of Longest Streaks on Test with 100 Questions Probability of Occurrence

Class Average on Test 60%

65%

70%

75%

80%

10–2

11

12

14

16

20

–3

10

14

16

18

21

25

10–4

17

19

21

25

31

10–5

20

22

25

29

36

10–6

23

25

29

33

41

The number of questions and the percentage of answers correct has an impact on the probable EEIC:D ratios and the longest streaks. Table 5 indicates how the probabilities of the EEIC:D ratios change with the test average with a test of 50 questions (simulations 3–7). Table 6 shows the ratios with a test of 100 questions (simulations 8–12). Table 7 indicates how the longest streaks change with the test average with a test of 50 questions (simulations 3–7). Table 8 shows the longest streaks with a test of 100 questions (simulations 8–12). Discussion The distribution of EEIC:D ratios is not affected very much by the difficulty of the test. The robustness of the numbers makes EEIC:D a good indicator of cheating for a variety of tests. The ratio distribution did depend on the number of test questions. 570

A shorter test is more likely to give a larger ratio. Therefore, the ratio is a stronger indicator of cheating for longer tests. The distribution of longest streaks follows a different pattern. It is affected by the length of the test and strongly affected by the difficulty of the questions. As one would expect, there is less chance of a long streak if a larger number of questions is answered incorrectly. Using the longest streak as an indicator of cheating is recommended on shorter tests with a low class average. Suppose two students have an identical streak of 13 questions on a test of 50 questions where they both received about a 70% score. Using Table 7, it can be seen that under the 70% column, there is only a 1 in 100 chance that a streak of 13 questions would happen by chance. However, this does not mean that there is a 99 out of 100 chance that they cheated. To assess the probability of cheating, a Bayesian analysis can be applied.

Journal of Chemical Education  •  Vol. 85  No. 4  April 2008  •  www.JCE.DivCHED.org  •  © Division of Chemical Education 

Research: Science and Education

Bayes’ theorem states that

P C E 

P E C P C

P E C P C P E N P N

where P(C|E) equals the posterior probability of cheating after the evidence (either the EEIC:D ratio or a streak) from the test has been considered, P(E|C) equals the probability that a cheating student will produce the observable evidence (either the EEIC:D ratio or a streak), P(C) equals the prior probability of cheating before the evidence has been considered, P(E|N) equals the probability that a test will produce evidence of cheating when none occurred, and P(N) equals the prior probability that no cheating occurred. Note that P(N) = 1 − P(C). The data calculated in this article can be used to determine the value of P(E|N). The value of P(E|C) is probably close to one because it would be difficult to cheat without leaving evidence of a large EEIC:D ratio or a long streak. The only factor that remains is P(C), the baseline percentage of overall students who cheat on a given test. Kerkvliet and Sigmund (7) found a large variance of cheating between classes, from probabilities ranging from 0.002 to 0.32. Harding (8) found that the average student cheats on about one test per year. Several studies have found that students with lower grade point averages are more likely to cheat (7, 9). However, Harpp and Hogan noted that their most suspicious pairs were the students in the A–B range. Gender does not appear to have an effect on the probability of cheating (7). Clearly, more work needs to be done in this area to determine P(C). Should a student be accused of cheating if the probability is 90%? 99%? What constitutes a reasonable doubt is something that the test giver must contemplate. The damage to an innocent student falsely accused must be weighed against the cost of allowing cheaters to go undetected and unpunished. In most cases, the evidence will be clear. Even with a low value of P(C), the posterior probability of cheating P(C|E) will be very large if a test has an EEIC:D ratio of greater than 1.0 or a very long streak. Students may respond that similar answers are the result of studying together instead of cheating. However, it has been found that the tests of students who study together do not correlate any more closely than the tests of students who do not, even if the students who studied together are identical twins! (10). It should be noted again that for very high performing students the method above will not work. If two students receive a 69 out of 70 and got the same question wrong, it could not be assumed that cheating occurred. Some research has been performed on ways to prevent cheating (7, 9). Kerkvliet and Sigmund found several factors that decreased the likelihood of cheating. Using faculty instead of graduate students to proctor tests, using more proctors, using multiple versions of the test, and a verbal warning before the test all contributed to greater honesty. Some factors were found not to have an effect, including greater physical separation between students and use of non multiple-choice questions. Harpp and Hogan found that when students were seated randomly, away

from their friends, and when multiple versions of the tests were given, cheating declined considerably (4, 5). Any students who are suspected of cheating can be separated for subsequent exams. All of the results above deal with 4-response multiplechoice tests. Some multiple-choice tests have 5 possible responses. The extra distracter should make the likelihood of a large EEIC:D ratio or a long streak occurring by chance even less likely and the analysis above even more discriminating. The results above should allow teachers to be more confident in their assessment of cheating and give them greater ability to back up their claims to students, deans, and principals. The computer code used for the simulations is provided in the online supplement. The author is willing to send the programs upon request and provide assistance. Literature 1. Whitley, B. E. Research in Higher Education 1998, 39, 235–274. 2. Payne, S. L. College Teaching 1994, 42, 90. 3. Levitt, S. D.; Dubner, S. J. Freakonomics: A Rogue Economist Explores the Hidden Side of Everything; William Morrow: New York, 2005; pp 25–38. 4. Harpp, D. N.; Hogan, J. J. J. Chem. Educ. 1993, 70 , 306. 5. Harpp, D. N.; Hogan, J. J.; Jennings, J. S. J. Chem. Educ. 1996, 73, 349. 6. (a) Wesolowsky, G. O. J. Applied Statistics 2000, 27, 909–921. (b) Rizzuto, G. T.; Walters, F. H. J. Chem. Educ. 1997, 74, 1185. (c) Roberts, D. M. J. Educ. Meas. 1987, 24, 77. (d) Aiken, L. R. Research in Higher Education 1991, 39, 725–736. (e) Cody, R. P. J. Med. Educ. 1985, 60, 136. (e) Frary, R. B.; Tideman, T. N.; Watts, T. M. J. Educ. Stat. 1977, 2, 235. (f ) Angoff, W. H. J. Am. Stat. Assoc. 1974, 69, 44. 7. Kerkvliet, J. Sigmund, C. L. J. Economic Educ. 1999, 30, 331. 8. Harding, T. S. On the Frequency and Causes of Academic Dishonesty Among Engineering Students. In Proceedings of the 2001 American Society for Engineering Education Annual Conference and Exposition; Albuquerque, New Mexico, American Society for Engineering Education, 2001. 9. (a) Whitley, B. E.; Keith-Spiegel, P. Academic Dishonesty: An Educator’s Guide; Erlbaum: Mahwah, NJ, 2002. (b) Cizek, G. J. Cheating on Tests: How to Do It, Detect It, and Prevent It; Erlbaum: Mahwah, NJ, 1999. 10. Harpp, D. N.; Hogan, J. J. J. Chem. Educ. 1998, 75, 482.

Supporting JCE Online Material

http://www.jce.divched.org/Journal/Issues/2008/Apr/abs568.html Abstract and keywords Full text (PDF) With links to cited URLs and JCE articles Supplement Computer code used for the simulations

© Division of Chemical Education  •  www.JCE.DivCHED.org  •  Vol. 85  No. 4  April 2008  •  Journal of Chemical Education

571