Evaluation of Subset Norm Stability in ACS General Chemistry Exams

1 day ago - The ability to assess students' content knowledge and make meaningful comparisons of student performance is an important component of ...
0 downloads 0 Views 2MB Size
Article Cite This: J. Chem. Educ. XXXX, XXX, XXX−XXX

pubs.acs.org/jchemeduc

Evaluation of Subset Norm Stability in ACS General Chemistry Exams Jessica J. Reed,† Jeffrey R. Raker,‡,§ and Kristen L. Murphy*,† †

Department of Chemistry and Biochemistry, University of WisconsinMilwaukee, Milwaukee, Wisconsin 53211, United States Department of Chemistry, University of South Florida, Tampa, Florida 33620, United States § Center for Improvement of Teaching and Research in Undergraduate STEM Education, University of South Florida, Tampa, Florida 33620, United States Downloaded via VOLUNTEER STATE COMMUNITY COLG on August 16, 2019 at 04:12:37 (UTC). See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles.



S Supporting Information *

ABSTRACT: The ability to assess students’ content knowledge and make meaningful comparisons of student performance is an important component of instruction. ACS exams have long served as tools for standardized assessment of students’ chemistry knowledge. Because these exams are designed by committees of practitioners to cover a breadth of topics in the curriculum, they may contain items that an individual instructor may not cover during classroom instruction and therefore chooses not to assess. For the instructor to make meaningful comparisons between his or her students and the national norm sample, the instructor needs norms that are generated upon the basis of the subset of items he or she used. The goal of this project was to investigate the effects of norm stability when items were removed from ACS General Chemistry Exams. This was achieved by monitoring the average change in percentile for students as items were removed from the exam and noting when average change crossed a specified threshold. An exploration of subset norm stability for three commonly used ACS General Chemistry Exams is presented along with implications for research and instruction. KEYWORDS: First-Year Undergraduate/General, Testing/Assessment, Chemical Education Research FEATURE: Chemical Education Research



INTRODUCTION In chemistry, faculty and instructors are routinely facing increased pressure to produce assessment data that reflect how their students compare to other students at institutional and national levels.1−3 However, findings from a national survey administered by the American Chemical Society Examinations Institute (ACS-EI) suggest that even though enhanced assessment endeavors are expected, faculty and instructors may need assistance to bridge the gap between instruction and evidence-based assessment methods.4 Therefore, the ACS-EI has made efforts to translate research endeavors into pragmatic approaches for classroom assessment. For example, development of the Anchoring Concepts Content Map (ACCM) for various chemistry subdisciplines stemmed from desired instructional interest to analyze students’ content knowledge and provide the opportunity for programmatic assessment.5−8 Additionally, a tool was developed by the ACS-EI to aid instructors as they conducted analyses of student performance on standardized, norm-referenced ACS General Chemistry Exams.9,10 This tool, the Exams Data Analysis Spreadsheet (EDAS), allows instructors to perform unique analyses of performance data by allowing instructor selection of exam items to include in the analyses. The EDAS then generates new norms and class performance comparisons upon the basis of the subset of items the user has selected for analysis. One © XXXX American Chemical Society and Division of Chemical Education, Inc.

caveat of the EDAS tool is that it does not explore the stability of the norms generated when using a subset of exam items. Although an instructor may be able to generate norms upon the basis of any number of items selected for analysis, this does not mean that those norms are necessarily appropriate measures for comparison of student performance. This paper explores a pragmatic approach for investigating subset norm stability in frequently used ACS General Chemistry Exams. The goal is to provide users of ACS exams with insights as to how the removal of exam items affects the stability of the norms generated from a subset of test items. These insights are expected to help the user make informed decisions and exercise caution when conducting performance analyses with subsets of ACS exam items as well as consider more broadly how item removal can impact interpretation of student performance on an assessment. ACS Exams

ACS exams are standardized, multiple-choice chemistry exams designed for summative assessment use across the high school to upper-level undergraduate curricula. Committees of chemistry faculty and practitioners from across the United Received: February 12, 2019 Revised: July 30, 2019

A

DOI: 10.1021/acs.jchemed.9b00125 J. Chem. Educ. XXXX, XXX, XXX−XXX

Journal of Chemical Education

Article

instructors collaborate on writing common final exams for courses, it is possible that the use of these tests can align with the use of ACS exams. Instructors may find a need to obtain an overall score as well as a subscore, with the additional need to determine scoring relative to the total sample (akin to norm scoring).

States develop ACS exams to reflect the content coverage and expectations in their own classrooms.11 The ACS-EI does not impose exam specifications other than the content should be appropriate for the course level and student population being tested. In this sense, exam development committees are free to generate exams that represent what is being taught in a variety of institutions and course demographics. Items are trial-tested and analyzed upon the basis of performance metrics before being selected for inclusion on the final version of the exam available for widespread use. This development process has ensured that ACS exams are quality measures of students’ chemistry knowledge for over 85 years. Because ACS exams are secure and copyright protected, we cannot show specific items or information that would reveal the protected information on any test. ACS exams are norm-referenced exams, meaning that a student’s performance can be compared with that of a hypothetical average student upon the basis of the collective score results of a group of students who have already taken the test. The ACS-EI relies on instructors to submit individual student scores to generate norm data for an exam, including the exam average and percentiles. Users of ACS exams are then able to compare their students’ performance to a national sample and receive percentile rank data for individual total scores.



METHODS To investigate the effect of removing assessment items on the stability of exam norms, three frequently used ACS General Chemistry Exams were studied. The exams used in this analysis were an exam with content coverage consistent with a full-year general chemistry course (GC13),16 a first-semester course (GC12F),17 and a second-semester course (GC10S).18 Each exam consisted of 70 multiple-choice questions. These exams were selected because they represent the most frequently used exam types in the General Chemistry exam series offered by ACS-EI, and they had large numbers of complete student responses. Unlike most standardized testing scenarios, submission of student performance data to ACS-EI for norm generation is voluntarily. Therefore, ACS-EI may receive only total score data from instructors without students’ individual item responses. While this is not problematic for norm generation, it can make other research endeavors in which student response data is necessary more challenging. In this study, individual item responses were necessary for understanding how a student’s total score and percentile changed as items were removed from the exam. Therefore, only observations that included individual item responses were used in the data sets for this study. Additionally, these ACS exams are offered in two forms in which the content is identical but the order of items differs. For this analysis, student response data were converted to binary values and aggregated across exam forms to create one master norm data set for each exam studied.

Subset Norms

The idea of generating exam norms based upon a subset of items has been explored in relation to the customized testing movement of the 1980s and early 1990s in K−12 settings.12−15 These studies focused primarily on the development and analysis of norm-referenced exams based on customization of pre-existing standardized exams and teachers’ own test items. In many standardized testing scenarios, individual users are not given the ability to customize the test items used or have the power to generate new norms on the basis of a subset of items. ACS exams, however, are developed by chemistry practitioners for chemistry practitioners and offer some level of autonomy in use. This means that individual instructors who elect to use an ACS exam in their course often have the opportunity to select when the exam will be administered, how scores will be used and reported, and whether any items will not be included in a student’s total score. This is not to say that the ACS-EI encourages use of partial exams or partial exam scoring but acknowledges that instructors do modify exam use to meet individual course needs. There are a variety of reasons that instructors may choose to omit items from the test. Frequently, they did not teach a particular concept or content area in their course, so they wish to remove items related to that content from students’ total scores. In other less common scenarios, items are omitted due to time constraints for testing or an instructor’s judgment about the quality of an individual question. Another use for understanding stability of subset norms stems from consideration of customized test development. As secure online testing environments become more accessible for individual course assessment, the possibility for customized standardized tests may be feasible. In this scenario, the ability to evaluate the norm stability of subsets of exam items may allow for creation of norm-referenced exams at an individual course level. Finally, the process described here can be investigated and used for classroom assessment purposes as well. When

Sample Generation

Once the norm data were aggregated, selection of a sample class from the norm set occurred. The purpose of using a sample of the norm set was to be able to draw conclusions upon the basis of real student response observations and compare changes in individual student percentile rank to newly generated norms as items were removed from the exam. Each sample class represented approximately 6 to 7% of the aggregated norm set for each exam, and equal numbers of students were selected randomly from each decile. Although the number of student observations selected for sample class analysis was arbitrary, we attempted to keep the number of students in line with typical general chemistry course enrollments while understanding that some instructors may have many more or much fewer students in their courses. The student observations selected for the sample class were then removed from the original norm set for the remainder of the analysis process. The total number of observations with complete responses as well as the sample class size and remaining norm set observations for each exam are shown in Table 1. Within each sample class, student observations were categorized on the basis of performance. Original score observations were divided into three groups to represent low (bottom third), middle (middle third), and high performing students (top third). A distribution of student observations across performance groups for the sample class of each exam B

DOI: 10.1021/acs.jchemed.9b00125 J. Chem. Educ. XXXX, XXX, XXX−XXX

Journal of Chemical Education

Article

Table 1. Number of Observations for Each Exam Analyzed Exam GC13 (full year) GC12F (first term) GC10S (second term)

Total Number of Complete Observations

Sample Class Size

Remaining Norm Set Observations

1535

100

1435

6850

500

6350

2237

160

2077

studied is shown in Table 2. Assigning student score observations to performance groups allowed for analysis of Table 2. Sample Class Performance Observations Across Exam Types Exam GC13 (full year) GC12F (first term) GC10S (second term)

Low Performers

Middle Performers

High Performers

Total

34 150 56

31 161 56

35 189 48

100 500 160

Figure 1. Example of subset norm generation for an exam in which three specific items have been removed. Percentile rankings are generated upon the basis of the adjusted original norm set from which the sample class has been removed. For brevity, not all score and percentile values are shown.

the effects of removing items on the basis of item difficulty. It was posited that removing hard or easy items from the exam would affect changes in percentiles differently across the various performance groups. Subset Norm Generation and Analysis

Generation of subset norms and analysis of the complete norm set was done using the software program STATA.19,20 The aggregated binary data were imported into STATA, and an automated script (see the Supporting Information) was used to generate norms and perform the analyses as specific items were removed. For the analysis of the overall norm set, as items were removed from the test, each student received a new total score and percentile rank. This new percentile was compared to the student’s original percentile rank, and an absolute difference was tabulated. An overall average absolute change in percentile was automatically calculated and reported all within the STATA program. Analysis of each individual sample class was conducted manually in Excel. The STATA script was used to generate subset norms for an adjusted norm sample, which consisted of the original norm data set minus the observations of the sample class data set. The norms generated in STATA were then imported into an Excel document, which used a lookup table function to identify the sample class students’ new percentiles upon the basis of their total scores after items were removed. The average absolute changes in percentiles for the entire sample class and performance subgroups were reported. An example of the type of output generated when subset norms were generated is shown in Figure 1. This example shows the shift in percentiles for a 70 item exam in which 3 items have been removed. The new percentile rankings are now assigned upon the basis of 67 total items. For brevity, only the percentile rankings for scores at the top end of the scale are shown. As shown in this example, a total score of 60 points was initially associated with performance at the 95th percentile on a 70 item exam. When the exam is normed upon the basis of 67 items, the 60 point score now represents performance at the 98th percentile. Additionally, Figure 2 provides an example of what this process looks like at the individual student observation level.

Figure 2. Example comparison of changes in students’ percentile rankings as items are removed from the exam.

This example explores how three high-performing students’ percentile ranks change as specific items are removed from the exam. For example, Student 1 had an original score of 62 points (97th percentile). After the subsequent removal of three items from the exam, Student 1 has a new total score of 61 points. From Figure 1, a score of 61 points now corresponds to the 99th percentile on an exam in which those same three items have been removed. Therefore, the average absolute change in Student 1’s percentile is 2. This process was iterated for each student after an item was removed from the exam. An overall average absolute change in percentile for the class sample was tabulated after each subsequent item was removed. Item Removal

To investigate norm stability, items were removed from the exams at random, by content area, and by item difficulty. Generally, in each removal scenario, no more than 20 items were removed at a time because it was anticipated that removal beyond 20 items would significantly compromise the validity of the assessment. One exception occurred during content removal on the GC10S exam when more than 20 items were associated with a Big Idea (Equilibrium). Random removal of 20 items was repeated five times for each exam. Although it is not anticipated that instructors choose to remove items from a test at random, studying the effects of removing items at random on norm stability gives credence to the idea that the stability is or is not linked to an additional factor such as item difficulty or content. The 20 easiest and 20 hardest items (based upon sample class performance) were removed one at a time in succession. Additionally, items were removed on the basis of the major content area assessed in the item. The items’ content had C

DOI: 10.1021/acs.jchemed.9b00125 J. Chem. Educ. XXXX, XXX, XXX−XXX

Journal of Chemical Education

Article

Figure 3. Effects of repeated removal of 20 items at random from the GC13, GC12F, and GC10S exams on the norm stability for the student observations in the original norm set.

Figure 4. Effects of repeated removal of 20 items at random from the GC13, GC12F, and GC10S exams on the norm stability for the student observations in the respective sample classes.

previously been reviewed and aligned to the ACS Anchoring Concepts Content Map (ACCM) for General Chemistry.5,21−24 The ACCM has 10 Anchoring Concepts or Big Ideas, corresponding to content such as atoms, bonding, and equilibrium. Subset norms were generated by removing items associated with one Big Idea at a time. Items were removed one at time in numerical order within a Big Idea, and new norms were generated upon the basis of the remaining exam items. Generation of subset norms based upon removal of

items by content area was done to investigate norm stability when removing all items associated with a specific content area.



RESULTS AND DISCUSSION Analysis began by evaluating the effects on average absolute percentile change for student performance associated with the original norm set in its entirety for GC13, GC12F, and GC10S exams. The effects of average absolute percentile change for D

DOI: 10.1021/acs.jchemed.9b00125 J. Chem. Educ. XXXX, XXX, XXX−XXX

Journal of Chemical Education

Article

Figure 5. Effects of item removal by content area (Big Idea) from GC13, GC12F, and GC10S exams on the norm stability for the student observations in the original norm set.

Figure 6. Effects of item removal by content area (Big Idea) from GC13, GC12F, and GC10S exams on the norm stability for the student observations in the respective sample class.

E

DOI: 10.1021/acs.jchemed.9b00125 J. Chem. Educ. XXXX, XXX, XXX−XXX

Journal of Chemical Education

Article

Figure 7. Comparison of the removal of the most difficult and easiest items from GC13 in terms of the subset norm stability for the sample class of 100 students.

observations in the original norm set (Figure 5) and the sample class (Figure 6) for each exam. The number of items associated with a content area vary by exam type because the exams test content associated with different general chemistry courses. In several instances, removal of an entire Big Idea had little effect on the average absolute change in percentile. These scenarios involved content areas with few items associated, such as “Atoms” on the GC10S exam or “Bonding” on the GC13 exam. In general, removal of items by content area had a larger effect on the GC10S data sets than on the data sets associated with the other two exams. Particularly, removal of the items associated with “Thermodynamics” and “Equilibrium” from GC10S resulted in relatively large average absolute changes in percentiles, compared with the changes after the removal of the same content from the GC13 and GC12F exams. Although we did not specifically investigate why the removal of these content areas from the GC10S exam may cause such a shift in percentile change, it is possible that as these items constitute nearly half of the exam, there may be item difficulty effects being observed as well (i.e., the most difficult items on the test may be associated with these content areas). When considering average item difficulties for these areas, for GC10S, the averages for thermodynamics and equilibrium were 0.498 and 0.506, respectively, compared with 0.540 and 0.545 for the other tests. As shown in Figure 6, the smaller data sets associated with the sample classes followed trends similar to those in the original norm data sets as items were removed by content. In general, removal of an entire content area (if a few items) or approximately five to seven items from a Big Idea did not cause an unreasonable shift in percentile values and often remained within the suggested 2.5 absolute percentile change cutoff. As instructors use the ACCM to learn more about student performance related to specific content, the observed norm stability supports removal of a Big Idea or two from the exam, but a test should not be normed on the basis of the inclusion of only one or two content areas.

the individual sample classes were then investigated. For ease of comparison, the analyses are grouped upon the basis of the sample analyzed (original norm set or sample class) and the type of item removal investigated (random, content, or item difficulty) and are shown separately for all three tests. Upon review, the authors decided that an average absolute percentile change of 2.5 or less was likely consistent with general norm stability. This means that a user could likely remove up to the number of items that caused an average absolute percentile change of 2.5 without great concern that the generated subset norms had shifted so significantly from the original norms so as to diminish meaningful comparison. Although the average absolute change value of 2.5 percentiles is arbitrary, it was selected because it provides a conservative estimate of norm stability and likely encompasses grade cutoffs of ±5 percentiles (where this could impact grade decisions). Random Item Removal

The effects of the random removal of items from each of the three exams on percentile change for student observations in the original norm set are shown in Figure 3. A similar trend is observed across all three exam data sets. The 2.5 percentile cutoff is reached at approximately 10 items for the GC13 exam, 9 items for the GC12F exam, and 6 items for the GC10S exam. Again, these are semiconservative estimates of norm stability, but they suggest that an instructor could realistically remove up to that number of items from the exam before the average absolute change in percentile would become large enough to invalidate meaningful grade comparisons. Random Item Removal by Sample Class

The effects on students’ average percentile changes in the respective sample classes when items are removed at random from each of the three exams are shown in Figure 4. A similar trend is observed across all three sample class data sets, with the GC13 sample class tending to have the smallest average absolute percentile change as the number of items removed increased. The 2.5 percentile cutoff is reached at approximately 10 items for the GC13 exam, 7 items for the GC12F exam, and 6 items for the GC10S exam. These values are consistent with the effects on percentile change in the original norm set when similar numbers of items were removed. This suggests that norm stability can be achieved even with a smaller sample of student observations and is likely due to the quality of the items themselves.

Item Removal by Difficulty (Sample Class)

Removal by item difficulty (fraction of students who got the item correct) was examined for each sample class associated with an exam. It was posited that removing items on the basis of item difficulty would have varied effects on performance subgroups, and thus instructors may wish to use caution when removing items solely on the basis of difficulty. For example, Figure 7 shows the average absolute change in percentile for the sample class as the most difficult and easiest items are

Item Removal by Content

Exploration of removing exam items on the basis of content area (Big Idea) provided insights about norm stability for F

DOI: 10.1021/acs.jchemed.9b00125 J. Chem. Educ. XXXX, XXX, XXX−XXX

Journal of Chemical Education

Article

Figure 8. Comparison of the removal of the most difficult and easiest items from GC12F in terms of subset norm stability for the sample class of 500 students.

Figure 9. Comparison of the removal of the most difficult and easiest items from GC10S in terms of subset norm stability for the sample class of 160 students.



CONCLUSIONS AND IMPLICATIONS FOR INSTRUCTION This paper highlights pragmatic approaches to understanding subset norm stability for several ACS General Chemistry Exams. Analyses were performed to investigate how observed student performance, in the form of assigned percentile, shifted as items were removed at random, by content, and by item difficulty from the GC13, GC12F, and GC10S exams. Subset norm stability was first explored within the realm of a larger data set (the original norm data set) and then by using a smaller sample classes drawn from the norm set. The findings suggest that, in general, users of these ACS exams can anticipate that removal of approximately 10% of the items on the test will result in an average change in percentile of no more than 2.5 percentiles. Instructors wishing to remove more than approximately 10% of items from an exam should take caution when making comparisons to the original norms, as students’ percentile ranks have likely shifted into a range of possible grade implications. Additionally, in terms of content removal, instructors can anticipate reasonable changes in students’ percentile ranks as items within a content area or Big Idea are removed. However, norming a test upon the basis of a single Big Idea would not be recommended, as there are not enough items to generate valid and stable subset norms. When removing items on the basis of difficulty, users should consider how removal may affect performance subgroups compared with the entire class. Of course, it is important to remember that ACS exams are designed to be used in their entirety, as this is also how the norm set is generated, and as such, the generation and use of subset norms should be done with careful attention. These findings have practical implications for classroom instruction and assessment. Instructors who use ACS exams but are concerned about whether the exam content

removed from GC13. As the most difficult items are removed, there is greater variation in the average absolute change in percentile for students in the middle and high performing groups. This is probably due to students in these groups being more likely to get the most difficult items correct compared with students in the low performing group, so removing those items causes more score fluctuation for students in the middle to high performing groups. For removal of the easiest items (highest difficulty), there is greater score fluctuation for students in the low to middle performance groups because there is varying probability that the students got the item correct. Removal of the easiest items has little effect on high performing students, because those students are most likely already getting most or all of those items correct. For the GC13 sample class, on average the nine most difficult items could be removed before an average absolute change in percentile of 2.5 was observed. However, for students in the middle performance group, that threshold was reached before five of the most difficult items were removed. Greater fluctuation was observed for the GC12F sample class as the most difficult and easiest items were removed. However, the same trends across performance groups were observed as in the GC13 exam, as shown in Figure 8. When the 20 hardest items were removed from GC12F, the average absolute change in percentile was approximately 12, compared with an average absolute change of approximately 6 percentiles when the 20 easiest items were removed. Figure 9 shows the trends across performance subgroups as the most difficult and easiest items were removed from GC10S. The average absolute changes in percentile values were more in line with those observed for the GC13 sample class (Figure 7). In general, the nine most difficult items and the seven easiest items could be removed, respectively, before the 2.5 percentile change threshold was reached. G

DOI: 10.1021/acs.jchemed.9b00125 J. Chem. Educ. XXXX, XXX, XXX−XXX

Journal of Chemical Education

Article

Notes

encompasses more than what is covered in classroom instruction can consider careful removal of items from the exam while maintaining the integrity of the student percentile rank data. Additionally, this could lessen the barrier for adoption of ACS exam use by instructors whose course content may vary slightly from that assessed on the ACS exam. The ability to provide norm referenced exam performance data based upon an instructor selected set of standardized exam items allows for autonomy of content instruction while retaining the possibility of making meaningful student performance comparisons and characterizations. Implications of these findings also have application toward future endeavors of the ACS-EI. Ability to generate valid and reliable exam norms upon the basis of subsets of items allows for the potential to create customized norm referenced assessments. ACS exams are already available for use in a secure online testing platform, so the potential to create customized course assessments upon the basis of instructor selected item subsets may be realized as more research is done to verify norm stability and validity as well as the degree of item independence. Additionally, the ACS-EI may be able to offer additional products or services to users on the basis of these findings. Future development may lead to the ability to generate customized norm reports for instructors and an interactive online system for instructors to generate subset norms and view score reports.

The authors declare no competing financial interest.



ACKNOWLEDGMENTS The authors wish to thank users of ACS Exams who have submitted student performance data for research and analysis purposes.



Limitations

This study is not without limitations. One limitation stems from the use of only three GC exams in this analysis. It was not feasible to analyze all of the exams in the General Chemistry series because the time involved in manually generating and analyzing each set of subset norms. Therefore, the current recommendations should only be applied to the exams studied herein or, at most, tests within the same General Chemistry exam series. We anticipate that additional future efforts to automate the analysis process will streamline the ability to generate subset norm stability reports for a broader array of ACS exams. A final limitation occurred from the necessity of using individual item response data, thus removing student observations that only contained a total score. Although there were certainly adequate numbers of complete responses to conduct the analyses, it is unknown how the results may have varied had all of the reported data for each exam contained complete responses.



ASSOCIATED CONTENT

* Supporting Information S

The Supporting Information is available on the ACS Publications website at DOI: 10.1021/acs.jchemed.9b00125.



REFERENCES

(1) Towns, M. H. Developing Learning Objectives and Assessment Plans at a Variety of Institutions: Examples and Case Studies. J. Chem. Educ. 2010, 87 (1), 91−96. (2) Pienta, N. J. Striking a Balance with Assessment. J. Chem. Educ. 2011, 88 (9), 1199−1200. (3) Bretz, S. L. Navigating the Landscape of Assessment. J. Chem. Educ. 2012, 89 (6), 689−691. (4) Emenike, M. E.; Schroeder, J.; Murphy, K. L.; Holme, T. A. Results from a National Needs Assessment Survey: A Snapshot of Assessment Efforts within Chemistry Faculty Departments. J. Chem. Educ. 2013, 90 (5), 561−567. (5) Holme, T.; Murphy, K. The ACS Exams Institute Undergraduate Chemistry Anchoring Concepts Content Map I: General Chemistry. J. Chem. Educ. 2012, 89 (6), 721−723. (6) Raker, J.; Holme, T.; Murphy, K. The ACS Exams Institute Undergraduate Chemistry Anchoring Concepts Content Map II: Organic Chemistry. J. Chem. Educ. 2013, 90 (11), 1443−1445. (7) Marek, K. A.; Raker, J. R.; Holme, T. A.; Murphy, K. L. The ACS Exams Institute Undergraduate Chemistry Anchoring Concepts Content Map III: Inorganic Chemistry. J. Chem. Educ. 2018, 95 (2), 233−237. (8) Holme, T. A.; Reed, J. J.; Raker, J. R.; Murphy, K. L. The ACS Exams Institute Undergraduate Chemistry Anchoring Concepts Content Map IV: Physical Chemistry. J. Chem. Educ. 2018, 95 (2), 238−241. (9) Brandriet, A.; Holme, T. Development of the Exams Data Analysis Spreadsheet as a Tool To Help Instructors Conduct Customizable Analyses of Student ACS Exam Data. J. Chem. Educ. 2015, 92 (12), 2054−2061. (10) Brandriet, A.; Holme, T. Methods for Addressing Missing Data with Applications from ACS Exams. J. Chem. Educ. 2015, 92 (12), 2045−2053. (11) Holme, T. A. Assessment and Quality Control in Chemistry Education. J. Chem. Educ. 2003, 80 (6), 594−596. (12) Forsyth, R. A.; Ansley, T. N.; Twing, J. S. The Validity of Normative Data Provided for Customized Tests: Two Perspectives. Appl. Meas. Educ. 1992, 5 (1), 49−62. (13) Forsyth, R. A.; Twing, J. S.; Ansley, T. N. Three Applications of Customized Testing in Local School Districts. Appl. Meas. Educ. 1992, 5 (2), 111−122. (14) Way, W. D.; Forsyth, R. A.; Ansley, T. N. IRT Ability Estimates From Customized Achievement Tests Without Representative Content Sampling. Appl. Meas. Educ. 1989, 2 (1), 15−35. (15) Allen, N. L.; Ansley, T. N.; Forsyth, R. A. The Effect of Deleting Content-Related Items on IRT Ability Estimates. Educ. Psychol. Meas. 1987, 47 (4), 1141−1152. (16) ACS Division of Chemical Education Examinations Institute. ACS General Chemistry Examination; American Chemical Society: Ames, IA, 2013. (17) ACS Division of Chemical Education Examinations Institute. ACS First Term General Chemistry Examination; American Chemical Society: Ames, IA, 2012. (18) ACS Division of Chemical Education Examinations Institute. ACS Second Term General Chemistry Examination; American Chemical Society: Ames, IA, 2010. (19) Stata statistical software, Release 14.2; StataCorp: College Station, TX, 2016. (20) StataCorp, LLC. Stata Base Reference Manual, Release 15; Stata Press: College Station, TX, 2017.

Annotated text file of the programming code (PDF)

AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. ORCID

Jessica J. Reed: 0000-0003-4791-6094 Jeffrey R. Raker: 0000-0003-3715-6095 Kristen L. Murphy: 0000-0002-7211-300X H

DOI: 10.1021/acs.jchemed.9b00125 J. Chem. Educ. XXXX, XXX, XXX−XXX

Journal of Chemical Education

Article

(21) Luxford, C. J.; Linenberger, K. J.; Raker, J. R.; Baluyut, J. Y.; Reed, J. J.; De Silva, C.; Holme, T. A. Building a Database for the Historical Analysis of the General Chemistry Curriculum Using ACS General Chemistry Exams as Artifacts. J. Chem. Educ. 2015, 92 (2), 230−236. (22) Luxford, C. J.; Holme, T. A. What do Conceptual Holes in Assessment Say about the Topics We Teach in General Chemistry? J. Chem. Educ. 2015, 92 (6), 993−1002. (23) Reed, J. J.; Villafañe, S. M.; Raker, J. R.; Holme, T. A.; Murphy, K. L. What We Don’t Test: What an Analysis of Unreleased ACS Exam Items Reveals about Content Coverage in General Chemistry Assessments. J. Chem. Educ. 2017, 94 (4), 418−428. (24) Reed, J. J.; Luxford, C. J.; Holme, T. A.; Raker, J. R.; Murphy, K. L. Using the ACS Anchoring Concepts Content Map (ACCM) To Aid in the Evaluation and Development of ACS General Chemistry Exam Items. ACS Symp. Ser. 2016, 1235, 179−194.

I

DOI: 10.1021/acs.jchemed.9b00125 J. Chem. Educ. XXXX, XXX, XXX−XXX