Communication Cite This: J. Chem. Educ. 2019, 96, 1767−1772
pubs.acs.org/jchemeduc
Bringing Nuance to Automated Exam and Classroom Response System Grading: A Tool for Rapid, Flexible, and Scalable PartialCredit Scoring Tom P. Carberry, Philip S. Lukeman, and Dustin J. Covell*,† Chemistry Department, St. John’s University, 8000 Utopia Parkway, Jamaica, New York 11439, United States
Downloaded via VOLUNTEER STATE COMMUNITY COLG on August 15, 2019 at 07:47:14 (UTC). See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles.
S Supporting Information *
ABSTRACT: We present here an extension of Morrison’s and Ruder’s “Sequence-Response Questions” (SRQs) that allows for more nuance in the assessment of student responses to these questions. We have implemented grading software (which we call ANGST, “Automated Nuanced Grading & Statistics Tool”) in a Microsoft Excel sheet that can take SRQ answer data f rom any source and f lexibly and automatically grade these responses with partial credit. This allows for instructors to assess a range of understanding of material from student-generated answers as in a traditional written exam, while still reducing grading workload for large classes. It also allows instructors to do automated statistical analysis on the most popular answers, and subanswers, either from sources like exams or classroom response systems (CRSs), to determine common misunderstandings and facilitate adjustments to instruction. KEYWORDS: First-Year Undergraduate/General, Organic Chemistry, Problem Solving/Decision Making, Reactions, Second-Year Undergraduate, Student-Centered Learning, Testing/Assessment
■
INTRODUCTION The use and design of more effective multiple-choice questions in assessment is a perennial debate in chemical education.1−5 We subscribe to the view that, despite the use of “distractors”, question ordering strategies, and other approaches, it is too easy for a student skilled in spotting “likely” answers to guess multiple-choice answers from a short list. In addition, the fact that students choose from a list of premade complete answers, rather than generating the answers themselves, is also pedagogically unsatisfying. We were thus thrilled when Morrison6 and Ruder’s7 “sequence response question” (SRQ) work was published in this Journal as it overcame the problems described above, while still being amenable to CRS (classroom response system) use. Complete answers to SRQs are generated by the student combining subanswers, often with an emphasis on the ordering of the parts being important to correctness. Depending on the complexity of the question, the choice of subanswers and answer order can lead to a combinatorially large potential “answer space” of tens to many thousands of answers. Pedagogically, we hold that within this answer space there are responses that reflect © 2019 American Chemical Society and Division of Chemical Education, Inc.
a range of understanding, and thus deserve credit based on their partial correctness. In addition, analysis of the populated part of answer space can be useful to instructors for assessment of the range of student comprehension. For example, SRQs have been used to provide students a selection of multiple reasonable answers to organic retrosynthesis questions.8 However, to our knowledge, there are no automated grading systems that are freely available that allow exploration of this “answer space” and facilitate the resulting assignment of partial credit for SRQs. We thus developed an automated grading system in an Excel spreadsheet,9 which we call “ANGST” (“Automated Nuanced Grading & Statistics Tool”), to perform these tasks. Our spreadsheet is designed to parse these answers based on a series of rules defined by the question type, and quickly generate grades for all responses using a dynamic grading rubric. Furthermore, ANGST also allows analysis of every answer and subanswer so that instructors can determine which Received: December 5, 2018 Revised: June 11, 2019 Published: June 27, 2019 1767
DOI: 10.1021/acs.jchemed.8b01004 J. Chem. Educ. 2019, 96, 1767−1772
Journal of Chemical Education
Communication
Figure 1. Sample multiple-answer question we used on a final exam. Points were awarded for the correct characters “3”, “4”, and “6”, while deductions were applied for including “1”, “3”, or “5”. See Supporting Information for detailed partial-credit analysis.
Figure 2. Sample answer pair question, with our grading rubric. We used this style for mechanism questions in the style of Straumanis and Ruder.7
■
THE ANGST SPREADSHEET ANGST is currently designed for several types of question, including the following: • Multiple-answer questions, which will grade every character in a response separately (Figure 1). • Answer pair (e.g., arrow-pushing mechanism) questions, which grade every pair of characters separately, with an option for penalizing extra answers (Figure 2). • Ranked ordering questions (e.g., rating molecular properties, such as acidity, f rom high to low), which will assign credit based on putting a series of characters in a certain order (Figure 3). • Custom questions (e.g., describing a synthesis based on reagent choice and order), which grade the entire response as a single string (Figure 4).
part(s) of the answer students are struggling with in order to tailor instruction as necessary to address gaps in understanding. Simultaneously, ANGST is designed to assist in the statistical analysis of a problem set, including score breakdowns for the class as a whole, by section, or by question. Additionally, the percentage of students whose answer follows each criterion in the rubric is reported. Thus, the instructor can quickly and easily identify strengths and weaknesses in the class’s knowledge and take appropriate action. Moreover, the spreadsheet will automatically compile total scores and display typical statistics (e.g., mean, median, standard deviation) for the entire class after grading is complete. Finally, grade data for standard multiplechoice questions can also easily be incorporated into our spreadsheet alongside the SRQs as we recognize that, for some instances, these questions are sufficient. 1768
DOI: 10.1021/acs.jchemed.8b01004 J. Chem. Educ. 2019, 96, 1767−1772
Journal of Chemical Education
Communication
Figure 3. Sample ranked ordering question, with our grading rubric. This style question can assign credit based on comparing the locations of any two items in the sequence.
Figure 4. Sample custom question, which we used to make an organic synthesis question. The correct answer could be any of three sequences based on the rubric shown, but many variations were given in the actual class. See Supporting Information for detailed analysis.
Figures 1−4 each show examples of these types of questions as well as reference the partial-credit schemata we applied to introductory organic chemistry coursework. Further informa-
tion on our partial-credit breakdowns and our thought processes can be found in the Supporting Information. We note that while these example schemata reflect the author’s biases as to what is 1769
DOI: 10.1021/acs.jchemed.8b01004 J. Chem. Educ. 2019, 96, 1767−1772
Journal of Chemical Education
Communication
answer that had “1 before 5”. This allows instructors to tally common mistakes in a finer-grained manner and tease out misunderstandings to better improve their teaching strategies. For example, using a sample data set (26 students, derived from real class data with identifying information redacted, included in the Supporting Information) of responses to the question shown in Figure 2, we found that 54% of the class correctly identified the arrow “34” and 50% identified “68”. The spreadsheet showed that 21% of the students incorporated the incorrect arrow “64”, however. This implies that students were attempting to draw resonance structures of the double bond as if the oxygen atom lost a hydride ion and gained a positive charge. Using this data, we inferred that there was still a significant portion of the class who were confused about the use and meaning of resonance; we thus added more examples of this concept into lecture to address their misunderstanding. A standard analysis that cannot look at subquestion components would only show that the answer “6834” was the most popular response in the class (35%). The rest of the students’ answers would appear nearly random, indicating confusion but with no hint as to the exact area(s) of difficulty.
important in partial credit, ANGST is very flexible in terms of the strictness or leniency that it can encode in terms of a grading rubric. How ANGST Handles Student Data
ANGST first deconstructs students’ complete SRQ answers into subanswer components. Then, these subanswers are judged using lookup tables (e.g., Figure 1, answer contains “3”; Figure 2, answer contains “68”) or rules-based algorithms (e.g., Figure 3, answer has “1” before “5”). Partial credit is then awarded on the basis of the subanswers included by the student or for following the instructor-generated rules (negative point values can also be used for incorrect answer deduction based on the instructor’s inclination). A total question score is then tabulated for each problem, with user-controlled maximum and minimum values. Individual student answer sets are then compiled by ANGST; a total score is tabulated, and basic statistical information (average, deviation, grade distribution, etc.) on group grades and answer choices is generated. Tables 1 and 2 show the scores earned by students’ answers given in our class for the questions in Figures 2 and 3. Table 1. Student Scores for the Mechanism Problem (Figure 2)
Practical Applications of ANGST Input
Score (/14)
Credit Given
683423 6834 6824 688123 813467 8567 677834
14 14 6 6 6 0 14
“68” for 7, “34” for 7, “23” for 0 “68” for 7, “34” for 7 “68” for 7, “24” for −1 “68” for 7, “81” for −1, “23” for 0 “81” for −1, “34” for 7, “67” for 0 “85” for −1, “67” for 0, min score 0 “67” for 0, “78” for 7, “34” for 7
We have used versions of this grading method in midterm and final exams for 5 consecutive semesters in a sophomore organic lecture, averaging 300 students per semester. Our students gained familiarity with how to answer SRQs through training and utilization in CRS sessions in lectures and recitations, as well as on practice exams. Our exams initially started with 4 of these questions alongside 19 standard multiple-choice questions, but we have moved to 8 SRQs alongside 17 standard multiple-choice questions. We intend to continue this trend of using more SRQs with fewer standard multiple-choice questions. Beyond exam grading, we note that the analytic capability of this spreadsheet can also be applied to in-class clicker questions as well. When many SRQs are given in a period, the finer analysis available through this worksheet aids in the development of follow-up activities tailored to students’ weaknesses, by analyzing responses after class. Advanced users familiar with the software can paste CRS “whole” answers into a prepared spreadsheet live in class, allowing for quick “real-time” analysis of student comprehension relating to subanswers, which are automatically generated in a ranked list in ANGST. Overall, this facilitates subtler midcourse corrections because of the immediate availability of subanswer data analysis and the clearer illustration of student misunderstandings not obvious to the naked eye or CRS software. Feedback given to students on partial-credit answers has ranged from formal (sheets describing major errors and their scoring justification, much like the rubrics in Figures 2 and 3) to informal (explanations available in office hours or after class). In large course sections we typically employ multiple versions of exams, to stymie copying, which ANGST has been designed to allow for by deconvoluting scrambled question order during grading. In addition, ANGST is entirely flexible as to the data source for the answers and has a component for entering a score from another source (e.g., standard multiple-choice grades) for combined statistical analysis. Each ANGST instance permits facile grading of up to 25 SRQs; using multiple spreadsheet files extends this capability with essentially no limit.
Table 2. Student Scores for the Ordering Problem (Figure 3) Input
Score (/14)
Credit Givena
12534 21543 51243 21534 43521 5431
14 8 6 11 7 0
Correct sequence Misses criteria i, iv, vii, viii Misses criteria ii, iii, iv, vii Misses criteria i, vii Specific half-credit sequence No credit for wrong sequence length
a
Based on the criterion numbers shown in Figure 3.
The templates for each question type are specific and dynamic. Upon user modification, the entire spreadsheet automatically updates grading totals and retabulates the course wide statistical information. For example, if more partial-credit answers are discovered/developed by the grader (as has happened with the authors on a first pass through an exam’s student-generated answers!), the entire sheet and grading system respond accordingly. This eliminates the need for regrading entire questions or exams. We even note that, in our office hours postexam, students have shown reasoning that we had not expected, causing us to rethink our grading scheme. A simple update of the rubric instantly regrades the entire class. Subanswer Analysis for Pedagogical Insight
The evaluation of student subanswers and rule-algorithms afforded by ANGST also enables finer-grained feedback after exams or when using CRSs. In particular, while it is easy in many grading programs for multiple-choice questions or CRSs to determine the top 5 “whole” answers, it is not easy to determine how many students had answers that contained “68” or had an 1770
DOI: 10.1021/acs.jchemed.8b01004 J. Chem. Educ. 2019, 96, 1767−1772
Journal of Chemical Education
■
■
DISCUSSION AND CONCLUSION
Communication
ASSOCIATED CONTENT
S Supporting Information *
In terms of taxonomies10 of clicker questions, SRQs and our grading approach fall under the “I understand” and “I apply” categories, that is, ones designed to test deeper understanding of material. In-classroom clicker use has been shown11 to positively affect underrepresented groups’ participation and satisfaction with introductory chemistry classes; in addition, significant mental effort from answering more challenging clicker questions has been shown to positively affect exam performance.12 Given their depth, requirement of mental effort, and clicker friendliness, we thus believe that utilization of ANGST graded SRQs will further enhance these benefits. Other approaches to bring nuance to grading of multiplechoice questions include partial-credit scoring for traditional multiple-choice exams13 and scoring multiple-choice answers until correct.14 While these approaches have merit compared to traditional multiple-choice question scoring, none, in our opinion, have the benefits that the combination of SRQs and ANGST provide in allowing for student-generated answers. We also note that SRQs offer a secondary benefit, as the number of potential questions that SRQs provide is large, which guards against guessing and gaming. Reworking exams,15 an approach shown to have pedagogical benefit with traditional written question, but difficult with traditional multiple-choice exams, should be facilitated by using ANGST for minimal regrading effort. We note that, while we have focused on organic chemistry as a test case, any subject that routinely uses multiple-choice questions that are convertible to SRQs could use this tool or a version of it, with, we predict, similar benefits. We acknowledge two limitations of the current version of ANGST. First, there is no direct interface to commercial software packages. We currently either manually enter written answers, or copy and paste data output from commercial package16 (e.g., Scantron, PollEverywhere, iClicker, etc.) files into our spreadsheet. We estimate, in the former case, that it takes 1 min per student to manually enter 8 answers into a spreadsheet, in the latter case less than 5 min for an entire exam of hundreds of students. Second, as configured currently, sheets can only automatically deconvolute exam versions which randomize question order, not answer identity.2 This could be solved by using a sheet for each version, in essence making each variation on the question into a new question with its own manually entered rubric. More sophisticated randomization could be conducted algorithmically, where question generation was done in a program like ChemDraw: Each subcomponent of an answer was given a “master label”, and a rubric-generating program would output different versions of this master label for use in the tests themselves. We expect to address these issues in future versions of ANGST and ancillary software. In summary, we present here “ANGST”, a tool which we believe will lower grading burdens on instructors, while increasing the use of pedagogically valuable questions. After construction of rules-based rubrics, this tool can award credit for instructor defined partial knowledge. In addition, it will provide nuanced statistical analysis on student outcomes at the subanswer level. We envision this will help to reveal specific problems in understanding, give more meaningful assessment of learning, and thus enable instructors to better service their students though targeted revisiting of these topics.
The Supporting Information is available on the ACS Publications website at DOI: 10.1021/acs.jchemed.8b01004.
■
ANGST spreadsheet, detailed instruction manual, sample data set, sample completed ANGST, partial-credit analyses, and sample student inputs from Figures 1 and 4(ZIP)
AUTHOR INFORMATION
Corresponding Author
*E-mail:
[email protected]. ORCID
Philip S. Lukeman: 0000-0003-0563-5032 Dustin J. Covell: 0000-0002-9265-3919 Present Address †
Chemistry Department, Franklin & Marshall College, Lancaster, PA 17604, United States. Notes
The authors declare no competing financial interest.
■
ACKNOWLEDGMENTS We thank Dr. Aireen Romu and Dr. Jared Delcamp for betatesting the spreadsheet and providing useful advice.
■
REFERENCES
(1) Campbell, M. L. Multiple-Choice Exams and Guessing: Results from a One-Year Study of General Chemistry Tests Designed To Discourage Guessing. J. Chem. Educ. 2015, 92 (7), 1194−1200. (2) Denyer, G.; Hancock, D. Graded multiple choice questions: Rewarding understanding and preventing plagiarism. J. Chem. Educ. 2002, 79 (8), 961. (3) Towns, M. H. Guide to developing high-quality, reliable, and valid multiple-choice assessments. J. Chem. Educ. 2014, 91 (9), 1426−1431. (4) Binder, B. Improved multiple-choice examinations. J. Chem. Educ. 1988, 65 (5), 436. (5) Friel, S.; Johnstone, A. Scoring systems which allow for partial knowledge. J. Chem. Educ. 1978, 55 (11), 717. (6) Morrison, R. W.; Caughran, J. A.; Sauers, A. L. Classroom Response Systems for Implementing Interactive Inquiry in Large Organic Chemistry Classes. J. Chem. Educ. 2014, 91 (11), 1838−1844. (7) Straumanis, A. R.; Ruder, S. M. A method for writing open-ended curved arrow notation questions for multiple-choice exams and electronic-response systems. J. Chem. Educ. 2009, 86 (12), 1392. (8) Flynn, A. B. Developing Problem-Solving Skills through Retrosynthetic Analysis and Clickers in Organic Chemistry. J. Chem. Educ. 2011, 88 (11), 1496−1500. (9) While there are other automated grading systems integrated into commercial packages (e.g., Mastering Chemistry), and also online distributed grading systems (e.g., Gradescope), none, to our knowledge, perform the tasks we describe in this paper. The spreadsheet has been tested on Windows PCs with Excel 2007/2010/2013/2016/ Office 365 (offline) and OS X/Mac with Excel 2003/2011/2019. It will not run on Mac Excel 2008 as Visual Basic for Applications was removed from this version by Microsoft. At present, it will not run in online spreadsheet software like Office 365 (online) or Google Sheets due to limitations in VBA macro support. While compared to modern computing applications, this software is undemanding, especially for large classes or answer sets; we recommend a minimum of 4 GB RAM, 1.8 GHz or higher processor. Excel is a registered trademark of Microsoft, Inc. 1771
DOI: 10.1021/acs.jchemed.8b01004 J. Chem. Educ. 2019, 96, 1767−1772
Journal of Chemical Education
Communication
(10) Woelk, K. Optimizing the Use of Personal Response Devices (Clickers) in Large-Enrollment Introductory Courses. J. Chem. Educ. 2008, 85 (10), 1400. (11) Niemeyer, E. D.; Zewail-Foote, M. Investigating the Influence of Gender on Student Perceptions of the Clicker in a Small Undergraduate General Chemistry Course. J. Chem. Educ. 2018, 95 (2), 218−223. (12) Murphy, K. Using a Personal Response System To Map Cognitive Efficiency and Gain Insight into a Proposed Learning Progression in Preparatory Chemistry. J. Chem. Educ. 2012, 89 (10), 1229−1235. (13) Grunert, M. L.; Raker, J. R.; Murphy, K. L.; Holme, T. A. Polytomous versus Dichotomous Scoring on Multiple-Choice Examinations: Development of a Rubric for Rating Partial Credit. J. Chem. Educ. 2013, 90 (10), 1310−1315. (14) Slepkov, A. D.; Vreugdenhil, A. J.; Shiell, R. C. Score Increase and Partial-Credit Validity When Administering Multiple-Choice Tests Using an Answer-Until-Correct Format. J. Chem. Educ. 2016, 93 (11), 1839−1846. (15) Risley, J. M. Reworking Exams To Teach Chemistry Content and Reinforce Student Learning. J. Chem. Educ. 2007, 84 (9), 1445. (16) The software packages’ names are registered trademarks of their respective owners.
1772
DOI: 10.1021/acs.jchemed.8b01004 J. Chem. Educ. 2019, 96, 1767−1772