Article pubs.acs.org/jchemeduc
Examining Evidence for External and Consequential Validity of the First Term General Chemistry Exam from the ACS Examinations Institute Scott E. Lewis* Department of Chemistry, University of South Florida, Tampa, Florida 33620, United States ABSTRACT: Validity of educational research instruments and student assessments has appropriately become a growing interest in the chemistry education research community. Of particular concern is an attention to the consequences to students that result from the interpretation of assessment scores and whether those consequences are swayed by invalidity within an assessment. This study examines external and consequential validity of a first-term general chemistry exam from the ACS Examinations Institute. The measure was used as a final exam in the intended course at the research setting. As a result, student performance on the exam contributed to the grading decision in the course. Owing to the prerequisite relationship, this grading decision then informed the decision to permit student enrollment in second-semester general chemistry. This study evaluates the appropriateness of the first-term general chemistry exam to inform this decision. Results indicate that the exam offers meaningful information into students’ performance in the follow-on course with a consistent decision process across student subgroups. KEYWORDS: First-Year Undergraduate/General, Chemical Education Research, Testing/Assessment, Minorities in Chemistry, Women in Chemistry FEATURE: Chemical Education Research
V
1 Content: examining the content of the measurement for the extent it is relevant and representative of the intended domain 2 Substantive: examining the processes respondents are engaged in for the extent it matches the intended processes 3 Structural: examining the scoring model used and the interrelations among subtasks 4 Generalizable: examining the boundaries of generalizability of the scores to other tasks or across time or graders 5 External: examining relations with other measures, which may be either convergent or divergent 6 Consequential: examining whether sources of invalidity impact the consequences of the measurement No one aspect can suffice for demonstrating validity, nor do all aspects need to be addressed. The unifying aspect of validity, Messick argues, is the “trustworthy interpretability of the test scores and their action implications”. Within the sphere of student chemistry assessments, the ACS Examinations Institute is the most visible. The Examinations Institute regularly produces nationally available, secure exams designed for various levels of secondary and postsecondary chemistry courses. A committee of faculty members who regularly teach the courses that the exam targets is selected
alidity of educational research instruments and student assessments has appropriately become a growing interest in the chemistry education research community.1,2 In terms of the validity of student assessments, research has focused on exams created by the American Chemical Society (ACS) Examinations Institute, likely owing to their national presence. Recent examples include a detailed description of the exam construction process and investigations into the internal structure of exams from the institute.2−6 Missing from the research literature is an investigation into validity measures that speak toward the meaning and interpretation of the student scores in the context of decisions and consequences that result from the use. Such an investigation would inform the appropriate uses of the exam in a manner that other studies could not. This study is designed to investigate the external and consequential validity of an exam from the ACS Examinations Institute.
■
LITERATURE BACKGROUND
Messick introduced the unified concept of validity, termed construct validity, as a process that provides an evaluative judgment on “the adequacy and appropriateness of interpretations and actions on the basis of test scores or other modes of assessment”.7 Messick details six aspects of construct validity that are summarized as follows: © XXXX American Chemical Society and Division of Chemical Education, Inc.
A
dx.doi.org/10.1021/ed400819g | J. Chem. Educ. XXXX, XXX, XXX−XXX
Journal of Chemical Education
Article
The relationship also fits under the description of consequential validity within this framework. While the external validity can address the appropriateness of the exam in determining entrance into follow-on courses, consequential validity would focus on whether the properties of the exam are consistent in terms of this decision-making process. One necessary distinction regarding consequential validity is to focus on adverse actions, such as denying future enrollment, that result from sources of invalidity within a measurement, instead of adverse actions that occur from valid results of the exam.7 The latter is of particular concern in this investigation, as it describes inherent problems with the instrument that lead to adverse actions. For example, item bias in an instrument may lead toward one subgroup of students having less opportunity to progress in a curriculum independent of their demonstrated abilities. Investigating sources of invalidity that lead to differential consequences is then essential to ensuring the decision making process is both consistent and based on appropriate information. Past work in the literature has detailed relationships between ACS Exams with other measures of student knowledge within the course. These efforts are supportive of generalizable and external validity. For example, Lewis et al. used scores from the 2002 First Term ACS Exam to validate open-ended chemistry assignments that focused on the same content, finding correlations of approximately 0.2 between the ACS Exam and take-home assignments and approximately 0.5 between the ACS Exam and in-class assignments.8 Similarly, Pyburn et al. found correlations between 0.48 and 0.66 between relevant questions from the General Chemistry Conceptual ACS Exams and in-class midterm tests.9 Finally, Lewis and Lewis reported correlations above 0.5 between midterm exams and the Special First Term ACS Exam (meant to measure conceptual and algorithmic understanding).10 Together, these results show a correspondence between various ACS Exams and instruments intended to measure the same content knowledge. This provides evidence of generalizable and external validity for each exam indicated, in showing that scores correspond to performance on assessments meant to measure the same content. However, these results do not speak toward the use of this exam in determining readiness for pursuing the follow-on course. Some studies have examined the relationship between firstsemester general chemistry (GC1) and second-semester general chemistry (GC2) measures, an indication of external validity, though not with the intent to validate the GC1 measure. Mitchell et al. found a correlation between two ACS Exams, the First Term General Chemistry Exam and the Second Term General Chemistry Exam, of 0.581.11 While this is supportive that the First Term Exam offers insight into general chemistry II (GC2) performance, it does not offer insight into justifying its use as an entrance tool into GC2 as the measure of GC2 performance was a single exam and not success in the course overall (e.g., course grades). Easter investigated the relationship between a modified version of the 1998 Brief General Chemistry ACS Exam and GC2 performance, by assigning course grades in GC2 on a scale of 0 to 1, and found a correlation of 0.410.12 The modifications to the Brief General Chemistry ACS Exam involved using a subset of the exam questions. As a result, this exam differed from the released instrument and cannot speak toward the validity of the released instrument. Further, none of the above studies
nationally and meets to construct the exam. The process of exam construction takes advantage of both committee deliberations and trial testing of the instrument prior to setting the final version. The deliberations of the committee support the content validity, by ensuring that the content is relevant and representative of the intended courses.3 Trial testing of the instrument to screen out test items that function improperly helps remove the threat of content irrelevancy. Examples of such evidence could be item statistics that show low discriminatory ability or distractors that were rarely selected and evidently implausible to the intended audience. This process promotes structural validity as it demonstrates how the content informs the development of a rational scoring system for the test. During trial testing there is also an investigation into item bias by gender when there is sufficient data.4 Committee members are informed of items that have evidence of bias, and items with this characteristic may be removed from the released version of the exam. While the processes described support the validity of the exam in terms of content appropriateness and soundness of scoring structure, these processes are focused primarily on the accuracy and interpretation of the scores. Returning to Messick’s argument that score interpretation and actions made based on scores are both essential considerations in validity, the actions that result from the use of an ACS Exam remain to be addressed. The Examinations Institute does not prescribe uses for the exams; however, it is presumed that the exams are commonly used as a final exam in the target class, particularly because the exams are comprehensive, covering material that spans an entire course or year. Thus, student performance on the exam often contributes in a substantial fashion to determining a student’s course grade. Student grades lead to a wide variety of consequences, but the most direct consequence is in permitting enrollment into a follow-on course through the nearly ubiquitous practice of course prerequisites. The rationale of prerequisites is that student understanding of content in the first course is necessary in order to succeed in the follow-on course. The result is that when the exam serves as a significant contribution to grade determination in one course, the ACS Exam should offer relevant information regarding students’ preparation for the follow-on course. If it does not, then either the exam will need to be refined or the need for the prerequisite relationship will need to be evaluated. The relationship between an ACS Exam and performance in a follow-on course would fit under the aspect of external validity in this framework. External validity is defined as the relationship between the target assessment and other measures. Messick emphasizes that external validity measures that examine relationships pertinent to the decision-making process that results from interpretation of the scores should be emphasized to evaluate the appropriateness of the use of the instrument.7 External validity can be delineated into convergent and divergent validity. In convergent validity, the target assessment is expected to correspond with another measure, and as a result there is an expectation that the scores on the two measures would be correlated. The prerequisite relationship among courses, then, is taken as an expectation of a relationship between the content of the target course and the follow-on course. Finding evidence of external, convergent validity between the exam and performance in the follow-on course can then support the use of the exam as a final exam in the target course. B
dx.doi.org/10.1021/ed400819g | J. Chem. Educ. XXXX, XXX, XXX−XXX
Journal of Chemical Education
Article
form was mapped onto the ordering of the gray form. Students’ performance by topic was also determined based on their performance on a subset of questions within the exam. Four chemistry faculty members independently assigned questions to the following topics, as determined by the course learning objectives: • Ionic compounds (naming and formation of ionic compounds, not solutions) • Stoichiometry (includes mass to mole relations, limiting reactant, and percent yield) • Solutions (includes molarity and solubility) • Gas laws • Thermodynamics • Atomic structure (includes electron configuration) • Periodic trends • Molecular shapes (includes Lewis structures and polarity) Any discrepancies between topic assignment were discussed, and ultimately questions were only included in the topic if it was the consensus across all four members. One question was assigned to multiple categories, and 15 questions were not assigned, as they did not fit any of the topics listed. Student demographic characteristics, SAT scores, and GC1 grades from the Fall 2010 semester were collected from university records. Student performance in GC2 from Spring 2011 through Summer 2013, as measured by grade received, were collected from university records. In cases in which students repeated GC2, only the first attempt at the course, chronologically, was entered into the data. The university used a five-letter grade scheme (A, B, C, D, and F) with no split grades or plus and minus grading available. There were three majors at the research setting that required GC1chemistry, biology, and exercise health sciencesand each also required GC2. IRB approval was obtained to conduct this study.
examined the consequential validity aspect related to the consistency of the actions taken based on the assessment. This study seeks to address the gap in the research literature by presenting an investigation of the external and consequential validity of an ACS Exam. The research questions that guide this study are as follows: • To what extent does a First Term ACS Exam correspond with student success in the follow-on GC2 course as determined by their course grades? • To what extent does this exam serve as a predictor for passing the immediate follow-on course of GC2? • How consistent is this exam in making predictions for passing GC2 across student subgroups?
■
METHODS The exam chosen was the 2009 First Term General Chemistry Exam from the ACS Examinations Institute, herein the First Term ACS Exam.13 The First Term ACS Exam has 70 questions and is intended for use in the first semester of a twosemester general chemistry sequence. The content for this exam includes elementary conversions; chemical formulas; nomenclature; chemical reactions and equations; oxidation numbers; descriptive chemistry (solubility, acids/bases, etc.); stoichiometry; solutions (molarity and stoichiometry); thermochemistry; electron configurations and quantum number rules; ionic and covalent bonding; periodic trends; Lewis structures including resonance and formal charges; VSEPR theory; and gas laws and gas stoichiometry.14 The Examinations Institute reported approximately 48,000 copies of the exam have been sold to approximately 325 institutions.15 These numbers likely make it among the most commonly used assessments of general chemistry for postsecondary students in the United States. Researchers in chemistry education have also relied on the exam’s 2002 and 2005 predecessors as a measure of student learning, making it likely that the 2009 exam will also be used in a similar fashion.11,16−20 The research setting was a large, primarily undergraduate institution in the southeastern United States. General chemistry class sizes typically ranged from 50 to 75 students with instruction primarily lecture or a combination of lecture with peer-led team learning.19 Both GC1 and GC2 used a common syllabus across all classes that establishes the course content and general grading scheme constant across the classes. The grading scheme allocated 60% of the grade to in-class tests that were written by each instructor, 20% to a common final exam (the relevant ACS Exam for each course), and 20% at the instructor’s discretion for attendance, homework, or quizzes. The corresponding lab component was conducted as a separate class with separate grading. The course content for GC1 was Chapters 1 through 11 of a common chemistry textbook,21 which mirrored the aforementioned content coverage of the First Term ACS Exam. GC2 content coverage included intermolecular forces, kinetics, equilibrium, nuclear chemistry, acids and bases, spontaneity, and electrochemistry and covered Chapters 12, 13, 16 through 21, and 24 of the same textbook. The common final exam used was the First Term ACS Exam. The exam was administered according to directions from the Institute, with no programmable calculators and no outside materials allowed. The First Term ACS Exam features two forms, gray and yellow, with the same questions and responses reordered to prevent cheating. After the administration of the exam, to aid data analysis, student performance on the yellow
■
FINDINGS In the Fall 2010 semester, 669 students were enrolled in GC1. Of those 669 students, 3 received an incomplete, owing to extenuating circumstances, and 1 was auditing the course. These 4 were removed from the analysis. Of the remaining 665 students, 110 enrolled in GC1 again during the time frame of the study. Because these 110 students had subsequent GC1 preparation after taking the ACS Exam, they were not considered in this study. Among the remaining 555 students, ACS Exam scores were available for 443 students (79.8%). The most common reasons for missing scores were students withdrawing from the class or not attending the final exam, likely owing to a belief that a failing grade was imminent. Scores on the ACS Exam represent the number of questions correct out of 70; descriptive statistics are presented in Table 1. The descriptive statistics indicate that the average is slightly below the national norm of 37.1 and the distribution is close to normal, with a slightly positive skew.22 The Cronbach’s α was Table 1. Descriptive Statistics for First Term ACS Exam Mean (std deviation) Skewness (std error) Kurtosis (std Error) Median Mode Sample size C
36.82 (10.39) 0.245 (0.116) −0.210 (0.231) 37 30 443
dx.doi.org/10.1021/ed400819g | J. Chem. Educ. XXXX, XXX, XXX−XXX
Journal of Chemical Education
Article
appropriate goal for GC1. One potential exception to this claim is that one correlation, gas laws at 0.145, while significant, is closer to zero. In terms of content, gas laws feature limited use with GC2 content; namely, the ideal gas law serves as the foundation for the relationship between Kc and Kp. Overall, the correlations in Table 2 support the contention that students’ knowledge of GC1, as measured by the First Term ACS Exam, is indicative of performance in the follow-on courses investigated. Disaggregating the exam by topic lends further support, as each topic featured a positive relationship with course grades. In terms of convergent validity, the First Term ACS Exam has a positive correlation with the follow-on course, thereby supporting the use of this exam as a final exam in GC1 that contributes to the decision of allowing students to enroll in follow-on chemistry courses.
0.861 for the ACS Exam with this sample, indicating a desirable level of internal consistency among the items of this test. External Validity
Of the 555 students who enrolled in GC1 in Fall 2010 and did not reenroll in the course, subsequent enrollment in follow-on chemistry courses was tracked through the Spring 2013 semester. Within this time frame 291 students (52.4%) enrolled in GC2 and First Term ACS Exam data was available for all of these students. The most likely reasons for a student enrolling in GC1 but not GC2 was unsatisfactory performance in GC1 as 167 students (30.1%) failed or withdrew from GC1. The remaining 97 students (17.5%) transferred schools, changed majors, or elected to no longer pursue formal education. To investigate the relationship between the ACS Exam and student performance in GC2 a nonparametric test of association, Kendall’s Tau-b, was performed. The nonparametric test was chosen because the measures of student performance were course grades, which is an ordinal-level variable, and thus cannot satisfy the parametric assumptions of normality. The data set for the analysis will place a large number of students into the five possible categories for letter grades, meaning that a large number of ties will be present. As a result, the Kendall’s Tau-b was chosen over Spearman’s ρ as it is specifically designed to handle cases where ties are prevalent. To further investigate the association with the First Term ACS Exam and the subsequent course, the performance by topic, as described earlier, is also associated with GC2 performance. The results from the Kendall’s Tau-b tests are presented in Table 2.
Consequential Validity
As external validity focuses on the relationship between the target measure and other measures, consequential validity focuses on whether sources of invalidity within the exam impact the consequences that derive from the interpretation of the assessment scores. To examine sources of invalidity, student subgroups were considered in terms of gender and ethnicity. The term gender is used as a means to represent students’ biological sex, as it is conventionally used.4 The sample was 59.3% female and 64.6% white, 14.2% Black, and 9.7% Asian, with Hispanic, Native American, multiracial, or undeclared having less than 5% each. To provide appreciable sample sizes in each subgroup, ethnicity was categorized as underrepresented minorities in the sciences as defined by the National Science Foundation (Black, Hispanic, and Native American) or Asian and white (which also includes multiracial and undeclared).24 The first step in seeking sources of invalidity was to examine item bias within the test. Item bias is where individual test questions favor one student subgroup over another, after controlling for ability. To investigate item bias, differential item functioning (DIF) analyses were conducted. The Mantel− Haenszel statistic was calculated for each item, comparing student subgroups based on gender, then a separate analysis for ethnicity. To avoid false positives, and to keep with prior literature, the threshold of significance was set at 0.01.4 For a measure of student ability, the total score on the ACS Exam was used. Because statistical power (false negatives) was also a concern given the sample size and the lower than typical threshold of significance, the analyses were also conducted using students’ course grades in GC1 as a measure of ability. As course grades have only five levels of ability, compared to 70 for the total score on the ACS Exam, the sample size per level of ability is greater thereby increasing statistical power. The DIF analysis for gender, using total ACS Exam score as the measure of ability, indicated two questions with DIF by gender, both favoring males. Using student grades as the measure of ability indicated three questions with DIF, the aforementioned two and one additional one, also favoring males. Examining the questions showed no apparent traits that may cause the observed results, such as reliance on a genderdirected context. The DIF analysis for ethnicity, using total ACS Exam score, indicated no questions with DIF. The analysis for ethnicity with student grades indicated a single question with DIF, where underrepresented students performed worse. Out of 70 items overall, DIF was found on relatively few items,
Table 2. Associations with GC2 Grade
a
Variable
Kendall’s Tau-ba
First Term ACS Exam Atomic structure Ionic compounds Stoichiometry Solutions Gas laws Thermodynamics Shapes Periodic trends
0.419 0.225 0.237 0.269 0.302 0.145 0.341 0.357 0.353
Significant at p < 0.01; N = 291.
All of the associations presented in Table 2 are significant at p < 0.01. The interpretation of Kendall’s Tau-b is similar to the interpretation of the conventional Pearson correlation coefficient. The coefficient between the overall exam and GC2 at 0.419 indicates a notable relationship. Cohen’s criteria on effect size would put the correlation between a medium, 0.3, and large, 0.5, effect.23 To place this correlation in context, the value is just below a Kendall’s Tau-b observed for this sample between GC1 grade and GC2 grade (0.505) and considerably more than the observed value between Mathematics SAT and GC2 (0.152) or Verbal SAT and GC2 (0.128). The correlations by topic to GC2 grade show that each topic features a positive relationship with GC2 performance. This can be interpreted as either: there are content or skills associated with each topic that are necessary for GC2; or there is a general academic trait that underlies performance in both the GC1 topic and GC2 performance. It also provides some support that each of the major topics in GC1 should be retained to the extent that preparing students for GC2 is seen as an D
dx.doi.org/10.1021/ed400819g | J. Chem. Educ. XXXX, XXX, XXX−XXX
Journal of Chemical Education
Article
three by gender and one by ethnicity, indicating minimal evidence of item bias. Returning to statistical power, the effect size of the Mantel− Haenszel statistic25 using letter grades was calculated for the items that were on either side of the cutoff for statistical significance. For gender, the effect size was observed as 1.28 for the item with statistical significance and 1.16 for the item without. Using the ETS classification, the DIF analysis was able to identify statistical significance for items that were midway through Category B, described as moderate DIF.26 For ethnicity, the effect size for the cutoffs was between 1.61 and 1.85, with neither value statistically different from 1.0. This places the significance test at the top end of Category B, which means that the test was likely only able to identify instances of moderate to large DIF by ethnicity. To investigate consequential validity, it is prudent to examine whether this minimal item bias leads to differential consequences for student subgroups, in particular because the DIF analysis could only detect moderate to large DIF by ethnicity. Returning to the consequences that hinge on the First Term ACS Exam as a final exam in GC1, it is used most directly to determine enrollment in GC2. The rationale is that students who perform poorly in GC1 have a low likelihood of passing GC2. Allowing students to enroll in GC2 with little chance of passing the course is damaging to the student, in terms of GPA and course advancement. It also consumes instructional resources (e.g., grading, answering questions) and prevents another student from enrolling who may be more likely to succeed. Thus, the First Term ACS Exam should provide relevant and consistent information in determining the likelihood of students passing GC2. As the focus is on students’ likelihood of passing GC2, their grades in the course were dichotomized to pass (grade of C or better) or fail (grade of D or F or withdrawing from the course). Within the sample of 291 students, 200 passed GC2 (68.7%). A logistic regression was conducted predicting the likelihood of passing GC2 from the First Term ACS Exam and resulted in the data in Table 3.
Table 4. Logistic Regression for Student Subgroups
a
Variable
b
SE
Significance
Exp(b)
−2.405 0.084
0.606 0.016