The Testing Effect: An Intervention on Behalf of Low-Skilled

Sep 11, 2014 - Improving General Chemistry Course Performance through Online ... Use of the Online Version of an ACS General Chemistry Exam: Evaluatio...
0 downloads 0 Views 810KB Size
Article pubs.acs.org/jchemeduc

The Testing Effect: An Intervention on Behalf of Low-Skilled Comprehenders in General Chemistry Daniel T. Pyburn,† Samuel Pazicni,*,† Victor A. Benassi,‡,§ and Elizabeth M. Tappin§ †

Department of Chemistry, University of New Hampshire, Durham, New Hampshire 03801, United States Department of Psychology, University of New Hampshire, Durham, New Hampshire 03801, United States § Center for Excellence in Teaching and Learning, University of New Hampshire, Durham, New Hampshire 03801, United States ‡

S Supporting Information *

ABSTRACT: Past work has demonstrated that language comprehension ability correlates with general chemistry course performance with medium effect sizes. We demonstrate here that language comprehension’s strong cognitive grounding can be used to inform effective and equitable pedagogies, namely, instructional interventions that differentially aid low-skilled language comprehenders. We report the design, implementation, and assessment of such an intervention strategy. Guided by two models of comprehension, we predicted that a multiple pretesting strategy would differentially aid low-skilled comprehenders in a general chemistry class. We also explored the effect of two question types (multiple choice and elaborative interrogation) on this intervention strategy. A within-subjects, learning-goals driven design was used to build the intervention into two semesters of the course; data generated by this approach were analyzed with hierarchical linear models. We found that the achievement gap between low- and high-skilled comprehenders was partially abated by repeated testing prior to course examinations. We also found that the differential benefits of repeated testing could be accounted for entirely by multiple-choice questions, while elaborative interrogation questions had a statistically significant, but negative, impact. The implication of this work for all levels of chemistry teaching is clear: testing can be used to enhance (not just to assess) student learning, and this act affects different groups of students in different ways. KEYWORDS: First-Year Undergraduate/General, Chemical Education Research, Testing/Assessment FEATURE: Chemical Education Research

M

question will affect all subgroups of students equally. Lewis and Lewis have stated that “even when student scores improve on average, there is a distinct possibility that certain students may not be benefiting at all, or worse, be put at a disadvantage, but this phenomenon is simply masked by the improvements from other groups”.8 Indeed, these concerns led the authors to frame a study of PLTL around whether the intervention differentially affected underprepared students (i.e., those who possessed low SAT scores). Their results indicated that although the PLTL intervention benefited all students on average, it did not differentially hurt underprepared studentsa crucial piece of information concerning the nature of PLTL (and any pedagogy for that matter). Moreover, in a subsequent investigation, S. Lewis observed that PLTL differentially improved the pass rates of underrepresented minorities in comparison to majority students.9 Studies like these, which account for the possibility that an instructional intervention will likely affect different populations of students in different ways, are imperative for more fully understanding the nature of the intervention. The present study sought to evaluate an instructional intervention designed on behalf of a disadvantaged population of general chemistry students. The disadvantaged population

any studies have assessed the effectiveness of strategies aimed at improving student performance in general chemistry. For example, two popular strategies are peer-led team learning (PLTL) and process oriented guided inquiry learning (POGIL). In PLTL, students participate in structured problem-solving activities, guided by a peer-leader, that supplement traditional classroom time (lectures or recitations).1,2 Numerous studies have investigated the effects of PLTL on various student outcomes, including attitudes, course performance, program retention/persistence, and course pass/ fail rates.3−10 In POGIL, groups of students complete learning cycle-based activities in order to collaboratively develop conceptual understanding.11 Researchers have investigated the effects of POGIL on student attitudes, course performance, and course pass/fail rates.11−19 While the average effect of PLTL on student outcomes appears uniformly positive, studies of POGIL’s effect on similar outcomes range from those reporting mostly positive effects,11,12,14,17,18 to those reporting mixed or limited effects.13,15,16,19 Varying fidelity of implementation has been cited as a possible explanation for this range of results regarding the POGIL strategy.13,15,19 Studies measuring the average effect of interventions like PLTL and POGIL possess a limitation: the average effect is being measured. Such work assumes that the intervention in © 2014 American Chemical Society and Division of Chemical Education, Inc.

Published: September 11, 2014 2045

dx.doi.org/10.1021/ed4009045 | J. Chem. Educ. 2014, 91, 2045−2057

Journal of Chemical Education

Article

attempting to comprehend new information. Unfortunately, students who spontaneously and skillfully engage in these strategies are rare.30 Both the Multilevel Model and Structure Building have informed previous CER, which has sought to analyze the textual attributes of general chemistry texts,31 and characterize the relationship between language comprehension ability and performance in general chemistry.32 Similar learning models have been employed in CER (e.g., information processing33−38). However, the Multilevel Model and Structure Building are preferred here because they directly reference the cognitive and behavioral hallmarks of low-skilled comprehenders, which are useful for informing classroom strategies for intervening on behalf of this disadvantaged population.

chosen was low-skilled language comprehenders, as theoretical models of language comprehension provide clear details concerning the types of activities that should aid such students. The fundamental criterion used to evaluate this intervention was equity,20 i.e., will this intervention allow the achievement gap between students of low and high language comprehension to be mitigated?



BACKGROUND

Language Comprehension Ability

Various psychological theories of comprehension have identified multiple levels of language and discourse, as well as the representations, structures, strategies, and processes associated with these levels.21 For example, Graesser and McNamara present six levels: words, syntax, the textbase (explicit ideas that aid meaning-making, but are not precise words or syntax), the situation model (dimensions of spatiality, temporality, and inferences), the genre and rhetorical structure (the type of discourse and its composition), and the pragmatic communication level (goals of the writer or speaker).22,23 Of particular relevance to this work, the situation model is constructed by combining prior knowledge with information in the textbase, i.e., by making inferences. McNamara and Magliano argue that the most critical difference between lowskilled and high-skilled comprehenders is the ability and tendency to generate these inferences.24 Structure Building assumes that many of the processes and mechanisms involved in language comprehension are general and thus describes the comprehension of both linguistic (written and spoken) and nonlinguistic (pictorial) media.25 An individual’s initial exposure to new information will activate memory cells, which produces a foundation. As additional information relevant to this foundation is encountered, a mental representation, or structure, is built. When information irrelevant to the structure is encountered, the activated foundation is suppressed as another foundation is activated; a new substructure is built from this newly activated foundation. Once the building of a new substructure begins, information coded within the previously built substructure becomes less accessible. Information that is comprehensible, regardless of media, is structured; low- and high-skilled comprehenders differ in how skillfully they employ the aforementioned cognitive processes and mechanisms that capture this structure. Students of low comprehension skill are believed to ineffectively filter (suppress) irrelevant information. As a consequence, lessskilled comprehenders shif t too often and build too many substructures.26 A mental representation composed of many substructures is very fragmented in nature; much of the information encoded in this way quickly becomes inaccessible. A feature common to both models is the explicit connection of new information with prior knowledge. Beyond these models, which consider elements associated with either the text (Multilevel Model) or the comprehender’s cognitive make-up (Structure Building), previous work has also considered the comprehender’s behaviors. Chief among behaviors of interest is the use of reading strategies. Deep comprehension of content is presumed to emerge from strategies that prompt the learner to generate inferences that connect what is being learned to prior knowledge. These strategies include inquiry, explanation, selfregulation, and metacognition.27−29 In other words, skilled comprehenders are better able to strategically use prior knowledge to fill in the conceptual gaps encountered when

Test-Enhanced Learning

Given that low-skilled comprehenders often fail to engage in inference-making strategies like inquiry and explanation (i.e., asking and answering questions), it follows that a potential instructional intervention for low-skilled students is testing. Testing is typically used to assess student mastery of content; however, testing can also be used to facilitate learning. For example, testing has been shown to be of greater benefit than repeated studying.39−41 Substantial laboratory-based research has demonstrated that retrieving information while taking a quiz or exam promotes initial learning and long-term retention of that information. This phenomenon has been termed the testing effect or test-enhanced learning42 and research on testing has validated the benefits of this effect.40,43−50 Although the benefits of testing have been confirmed in laboratory settings, this effect has only recently received attention in authentic classroom scenarios.40,51,52 For example, Roediger and coworkers demonstrated the utility of testing in a sixth-grade social studies course.52 They demonstrated that student performance on pre-tested exam items was greater, on average, than performance on items that were not previously quizzed. In a separate experiment, a re-reading control was added to account for the possibility that this result was due to time-ontask and/or repeated exposure to content; the result of this second experiment suggested a benefit from testing beyond what resulted from re-exposure to material. The benefits of testing have also been investigated using different types of questions. For example, elaborative interrogation (EI) questions are “why” questions (e.g., why do atomic radii decrease going from left to right on the periodic table?). Positive effects on subsequent recall and inferencemaking tasks have been observed when EI questions were included with reading assignments.39,53−58 EI questions are believed to activate prior knowledge and help students relate new information to existing knowledge.59 Hypothesizing that EI questions should differentially benefit low-skilled comprehenders when built into a classroom intervention strategy is thus consistent with our guiding models of comprehension. EI questions are thought to aid students in building inferences, a cornerstone of the situation model. In addition, EI questions should anchor students to relevant prior knowledge, crucial for low-skilled comprehenders, who experience difficulty in suppressing irrelevant information (resulting in a high amount of shifting, according to Structure Building). Test-enhanced learning has also been investigated with multiple-choice (MC) questions. For example, McDaniel and co-workers have shown that repeated MC testing produces significant learning gains in a middle school science class2046

dx.doi.org/10.1021/ed4009045 | J. Chem. Educ. 2014, 91, 2045−2057

Journal of Chemical Education

Article

room.60 Recently, Little and co-workers demonstrated that MC questions not only fostered retention of information, but also facilitated recall of information pertaining to incorrect alternatives.61 Thus, using MC questions to aid students of low comprehension is also consistent with our guiding models of comprehension, as knowledge of both correct and incorrect responses to MC questions may help low-skilled comprehenders suppress irrelevant information and build more coherent structures of concepts. Moreover, MC questions are practically desirable; given the ease of grading such questions, they could constitute an instructor-friendly intervention strategy for large lecture courses like general chemistry.

for both semesters of the study. Textbooks, topic coverage, lecture notes, assignments, and presentation aids were consistent across both semesters. Study Design

A within-subjects learning goals-driven approach66 was chosen for this study. Course content was unpacked into four sets of learning goals (LGs) that described what concepts, ideas, and tasks students should master; these LG sets corresponded to each of the course’s four midterm examinations. LGs were shared with students, who were informed that all coursework was keyed to these goals. The LGs used for this study are provided in the Supporting Information, Section S2. This design is of note because it permits the treatment of LGs to be different, while all students are treated equally. If a controltreatment group design were chosen, only a subset of students would have received the hypothesized benefits of pre-testing. It is our view that this type of design would have constituted an unethical treatment of students in the control group. Therefore, pre-testing conditions were manipulated at the level of LGs, while all students in the study experienced the same treatment. Figure 1 summarizes the study design, while specific details follow.

Rationale and Research Questions

Because language comprehension ability plays a fundamental role in learning introductory chemistry,32 it is imperative to design instructional interventions to help students of low comprehension skill. It is well recognized that reading strategy instruction improves comprehension and is one of the most effective means of helping students to overcome deficits in comprehension ability.56,62−65 However, it is rather unreasonable to expect chemistry faculty to provide deliberate strategy instruction. As an alternative, we conjecture that purposefully scaffolding the use of strategies onto course activities may be beneficial. Low-skilled comprehenders tend not to employ the range and frequency of strategy use exhibited by high-skilled comprehenders; thus, learning activities that simulate successful reading strategies should help to close the achievement gap between low- and high-skilled comprehenders. Guided by models of comprehension ability, we aimed to exploit the fact that high-skilled comprehenders tend to engage in inquiry and explanation, strategies believed to aid inference making, connecting what is being learned to prior knowledge. Consequently, our intervention took the form of “pre-testing” with items that promoted these strategies. The present study examined the benefits of this pre-testing strategy using both MC and EI questions in a large-lecture general chemistry course. Specifically, the following research questions were addressed: (1) To what extent does pre-testing differentially aid students of low language comprehension? (2) How do the efficacies of two question types (EI and MC) compare in the context of this intervention?



Figure 1. Summary of the learning goals-driven study design. This design was repeated for four exams per semester and over two semesters.

Multiple Pre-Testing Intervention

For each course examination, 18 LGs were randomly selected and grouped into three pre-testing conditions: control, EI, and MC. This fully random selection and assignment of LGs to treatment conditions ensured that any gross differences in LG difficulty were controlled (Supporting Information, Section S3). Prior to examination, six LGs were tested twice with MC questions, another six LGs were tested twice with EI questions, and the remaining six LGs were not tested prior to the exam to serve as a control. An exposure control was not employed because the effect of testing beyond re-reading or re-studying has been well documented, both in laboratory experiments and in classroom settings.39−41 The pre-tests were created as unsupervised online quizzes, deployed using the Blackboard Learn platform. The first pre-test for each tested LG was administered as a “reading check” that students completed after they (presumably) read an assigned section of text specific to a subsequent class meeting. The second pre-test was administered as a weekly summary quiz. Both the reading checks and the weekly summary quizzes contained multiple questions and a mix of EI and MC, as well as items keyed to LGs that were not selected for pre-testing treatment. Each pre-test item was

METHODS

Setting

Participants in this study included students enrolled in two sequential semesters of a one-term general chemistry course for engineering majors at a 4-year public university with high research activity in the northeastern United States. Topics covered in the course included atomic theory, solid state structure, chemical bonding, molecular structure, intermolecular forces, chemical reactions and stoichiometry, introductory organic chemistry, thermochemistry and thermodynamics, chemical kinetics, chemical equilibrium, acids/bases, and electrochemistry. A typical week of the course involved three 50-min large-lecture periods, a 3 h hands-on laboratory, and a 50-min small-class recitation period, led by a teaching assistant. The large lecture periods included standard lecturing and smallgroup discussion facilitated by student response technology (clickers). Computer-based out-of-class work was also required. Examinations and laboratory activities composed the majority (∼66%) of the course grade, as detailed in the Supporting Information, Section S1. The same instructor taught the course 2047

dx.doi.org/10.1021/ed4009045 | J. Chem. Educ. 2014, 91, 2045−2057

Journal of Chemical Education

Article

Table 1. Example of a Learning Goal That Received Pre-Testing Treatment with EI Questions

Table 2. Example of a Learning Goal That Received Pre-Testing Treatment with MC Questions Learning Goal (3.17) MC Question 1 MC Question 2

Exam Item from Semester 1

Apply the First Law of Thermodynamics to describe how the internal energy of a system changes when energy is transferred as heat and/or work. A system receives 575 J of heat and delivers 425 J of work. What is the change in the internal energy (ΔE) of the system? (a) −150 J; (b) 150 J; (c) −1000 J; (d) 1000 J; or (e) 575 J. Which of the following is TRUE if ΔEsys = −95 J? (a) The system is losing 95 J, while the surroundings are gaining 95 J; (b) Both the system and surroundings are losing 95 J; (c) The system is gaining 95 J, while the surroundings are losing 95 J; (d) Both the system and surroundings are gaining 95 J; or (e) None of the above are true. The following reaction for the synthesis of gaseous ammonia (NH3) from the elements releases 92.22 kJ of energy to the surroundings as heat: N2(g) + 3H 2(g) → 2NH3(g)

Exam Item from Semester 2

During this reaction, the surroundings do approximately 4.5 kJ of work on the system. Determine the change in internal energy (in kJ) for this reaction. Support your response with descriptions or illustrations of how energy is transferred to/from the reaction as work and/or heat. The following reaction for the decomposition of sodium azide (NaN3) is used to rapidly inflate automobile airbags. This reaction releases 43.4 kJ of energy to the surroundings as heat: 2NaN3(s) → 2Na(s) + 3N2(g)

If the total change in internal energy for this reaction is −50.21 kJ, what quantity of energy (in kJ) is transferred as work? Is work performed by the system or by the surroundings? Support your response with descriptions or illustrations of how energy is transferred to/from the reaction as work and heat.

(depending on exam) following the first LGs treated with reading checks. Further examples of pre-test and midterm exam items keyed to the LGs used in this study are presented elsewhere.67

scored and feedback was provided; multiple attempts were not permitted. MC questions were scored automatically (as correct or incorrect); EI questions were scored manually (see below). Correct responses on all pre-test items contributed positively to students’ overall course grades; incorrect responses contributed negatively. Pre-testing activities together contributed 16.3% to students’ overall course grades (Supporting Information, Section S1). EI questions were scored by course staff members using the following rubric adapted from Smith and co-workers.58 Students earned full credit if a response was scientifically accurate and was linked to the question asked. Half-credit was assigned if the response was scientifically accurate, but was not linked to the question. No credit was given if the response was not scientifically accurate or linked to the question. Along with scores for EI questions, feedback was also provided to students within 1 week. Course staff members were instructed that, in cases that half-credit or no credit was awarded, they should inform students (via written comments) what aspect(s) of the EI response warranted less than full credit. Examples of pre-tests and exam items keyed to specific LGs are presented in Tables 1 and 2. For example, LG 2.23 (Supporting Information, Section S2) was randomly chosen for pre-testing treatment with EI (Table 1). The first EI question was presented to students within a reading check, while the second EI question was presented to students within a subsequent weekly summary quiz, which students completed 1−5 days following the reading check. Another example, LG 3.17, was treated similarly, but with MC questions (Table 2). Following the pre-testing treatment, the LG would not be tested again in a high-stakes manner until the corresponding midterm exam. Midterm exams were administered 15−34 days

Outcome Measure

Scores on items from the four course exams served as the outcome measure for this study. Course exams required written problem solving and short answer essays; exams did not contain MC questions. From the sets of LGs, we randomly chose 18 LGs to be assessed on each course examination. Students were informed that 18 LGs would be tested, but they did not know which goals were chosen. Each LG chosen for examination was converted into a five-point free-response examination question. Thus, a total of 72 LGs were tested per semester. The same 72 LGs were tested (but with different exam items) during both semesters of data collection; this also helped to ensure that more “easy” LGs or exam items were not overassociated with one particular treatment condition. Course staff members following instructor-generated rubrics scored exam items. “Practice exams” were made available to students as a courtesy so that the structure of examinations and exam items was known a priori. To control for potentially confounding timeon-task and cueing issues, the practice exam items and midterm exam items were keyed to the same LGs. Reliability was assessed for all course examinations (eight total) by calculating Cronbach’s α, a measure of internal consistency, for each exam. Each course exam had adequate internal consistency (α = 0.73−0.85). Content validity was assessed (Supporting Information, Section S4) by calculating inter-exam correlations (r = 0.54−0.73) and correlations with the ACS General Chemistry (Conceptual) Exam (form 2008), obtained from the American Chemical Society Division of 2048

dx.doi.org/10.1021/ed4009045 | J. Chem. Educ. 2014, 91, 2045−2057

Journal of Chemical Education

Article

Chemical Education Examinations Institute, which was used as a portion of the course’s final examination (r = 0.49−0.64). In addition, content experts assessed the validity of course exams with respect to the study design. This was done by determining the frequency with which a panel of three experts agreed on the pairing of LGs to exam items. The expert panel confirmed that a randomly selected subset of 30 exam items (∼21% of the total items from eight exams) matched the intended LG (Supporting Information, Section S5). Depending on LG set, Fleiss’ κ ranged from 0.72 to 1.00, indicating substantial agreement between raters.68,69 Moreover, the panel was instructed to describe the cognitive level of these exam items by assigning each to a category of Bloom’s taxonomy of cognitive domains.70 The panel rated 80% of the exam items as corresponding to the comprehension category of Bloom’s taxonomy or higher, and 40% of the exam items as corresponding to the application category or higher.

Table 3. Demographics of Participants Parameter Sex

Portion of Students, %a 100.0

Ethnicity

96.4

Class standing

100.0

Academic major

100.0

Assessments of Language Comprehension Ability and Prior Chemistry Knowledge

SAT Critical Reading (SAT-CR) section scores served as the measure of students’ language comprehension abilities. The critical reading section of the test contains passages of text and sentence completions71 and internal consistency measures for this section of the SAT have been found to be high (α > 0.90).72 SAT-CR scores were obtained from institutional records following data collection and are presented herein as raw scores. Even though students completed the SAT-CR section at a point in time prior to the beginning of the course, we have shown that SAT-CR scores correlated strongly with other language comprehension measures administered closer to the time of study.32 We also administered an assessment of preparation for general chemistry (the Toledo Chemistry Placement Examination, Form 2009, obtained from the ACS Examinations Institute) at the beginning of each semester of the study. This exam was originally designed for placing undergraduate students into chemistry courses73 and consists of three sections, each containing 20 MC questions. The first section contains items that assess math ability, while the second and third sections assess general and specific chemistry knowledge, respectively. For this study, we used the Toledo Exam in two ways. First, a sum of all section scores provided insight into the preparation of our sample for a general chemistry course as compared to national norms. In our research scenario, the full Toledo Exam had a Kuder-Richardson Formula 20 (KR-20, a measure of internal consistency for measures with dichotomous choices, analogous to Cronbach’s α) of 0.73. Second, a sum of scores from the latter two (chemistry-related) exam sections served as a measure of prior chemistry knowledge. In our research scenario, a combination of Parts II and III of the Toledo Exam had a KR-20 of 0.70. Scores for this measure of prior chemistry knowledge are presented herein as percentages.

Variables 16.8% female 83.2% male 92.4% White 3.4% Asian 1.9% Hispanic or Latino 1.1% non-Hispanic/2 or more races 0.4% Black or African American 62.9% first-years 14.3% sophomores 15.0% juniors 7.1% seniors 41.8% Mechanical Engineering 19.3% Civil Engineering 11.8% Chemical Engineering 11.5% Environmental Engineering 6.8% Electrical Engineering

a

Percentage of students in the sample for whom these data were available.

more prepared for a general chemistry course than the national average. However, Toledo Exam scores did range from the 2nd to the 99th percentiles, indicating a wide range of preparation within the sample.



DATA ANALYSIS

Concerning the Data

As described above, outcome data consisted of 18 exam item scores (ranging from 0 to 5) collected from each student’s four exams. Exam item scores were grouped into subsets according to pre-testing condition (EI, MC, or control) and exam item subset means were calculated for each exam. Thus, the outcome data were reduced to 12 exam item subset means, expressed herein as percentages, for each student. This reduction was done to ensure a normal distribution of outcome data. These exam item subset means served as dependent variables in the analysis presented below. Independent variables included SATCR section scores, Toledo Exam scores, time, and indicators representing pre-testing condition. A summary of items comprising the data set for each participant is provided in Table 4. Data from both semesters were combined (Supporting Information, Section S6) and descriptive statistics for the continuous variables used in analyses are presented in Table 5. SAT-CR section and Toledo Exam scores were normally distributed and the ranges of these scores encompassed ±2 standard deviations, sufficient to warrant inclusion of these measures as independent variables. However, while mean exam item subset data possessed an acceptable range, significance testing revealed less than ideal skewness. Upon examination, we determined that the skewness values for these data were within the range of ±1; the inferential statistics used here were robust to these modest violations of the normality assumption.75 Students lacking SAT-CR section and ACS Toledo exam scores were removed from the analysis; data for 197 participants remained. We also removed data so as to control for treatment fidelity. With regard to exam item subset means, we could not assume that every student completed every pre-test item prior to each

Participants

Participants’ demographic data were collected from institutional records and is presented in Table 3. The sample was predominately white (92.4%) and male (83.2%); 63% of students were in their first year of university study. The most represented major among the sample was mechanical engineering. According to norms published by the ACS Examinations Institute,74 the mean Toledo Exam score for the sample (M = 35.35, SD = 6.28, N = 258) equated to approximately the 64th percentile, indicating that participants may have been slightly 2049

dx.doi.org/10.1021/ed4009045 | J. Chem. Educ. 2014, 91, 2045−2057

Journal of Chemical Education

Article

Table 4. Items Comprising the Data Set for Each Participant Item

Variable Type

SAT-CR Section Scores ACS Toledo Exam Scores Control Subset Means EI Subset Means MC Subset Means EI Indicator MC Indictor Time

Description

Continuous, Independent

Measure of language comprehension ability.

Continuous, Independent

Measure of prior chemistry knowledge. Parts II and III of the ACS Toledo Exam were combined to give a percentage score for each student.

Continuous, Dependent

For each of the four midterm exams, those items corresponding to a control LG were averaged. Thus, the data set for each student contained four control subset means.

Continuous, Dependent Continuous, Dependent Dichotomous, Independent Dichotomous, Independent Ordinal, Independent

For each of the four midterm exams, those items corresponding to a LG pre-tested with EI questions were averaged. Thus, the data set for each student contained four EI subset means. For each of the four midterm exams, those items corresponding to a LG pre-tested with MC questions were averaged. Thus, the data set for each student contained four MC subset means. This variable was coded “1” if a subset mean entry corresponded to LGs pre-tested with EI and “0” if not. This variable was coded “1” if a subset mean entry corresponded to LGs pre-tested with MC and “0” if not. This variable took on values 0−3: “0” corresponded to the first exam, “1” corresponded to the second exam, and so on. As exams were administered at regular time intervals (on average, every 22.9 days) across the semesters of the study, exam number served as a proxy for time.

Table 5. Descriptive Statistics for Continuous Variables Used in This Study

a

Statistic

Exam Item Subset Means

SAT-CR Section Scoresa

ACS Toledo Exam (Parts II and III) Scoresa

N Mean Minimum Maximum Standard Deviation Skewness Excess Kurtosis

2536 68.52 0.00 100.00 22.08 −0.74 (std. error = 0.05) 0.10 (std. error = 0.10)

197 557.51 380.00 750.00 73.90 0.17 (std. error = 0.17) 0.21 (std. error = 0.35)

200 49.04 12.50 83.00 13.23 −0.07 (std. error = 0.17) 0.02 (std. error = 0.34)

For HLM analyses, these scores were standardized to sample grand means to facilitate interpretation of results.

were found. This result suggested that students of different levels of language comprehension did not have more or less of a tendency to complete the intervention.

exam, or that every pre-test response was captured properly by the Blackboard system. For example, LG 4.02 (Supporting Information, Section S2) was pre-tested twice with MC prior to exam four. While some students may have completed both pretests corresponding to this LG prior to its examination, others might have completed only one of the two pre-tests; others may have completed neither pre-test. Therefore, we determined the frequency with which students completed pre-test items for each LG tested on a subsequent exam (Table 6). We removed

Analysis Using Hierarchical Linear Models

To test the hypothesis that pre-testing would differentially aid low-skilled comprehenders and determine the efficacy of different question types, hierarchical linear models (HLMs) were used. The data described in Table 4 were “nested” in nature, i.e., multiple observations (exam item subscore means) were nested within each student, and the students were nested within classrooms. HLMs are those in which nested data may be studied without violating assumptions of independence.76 HLMs have become a powerful tool for investigating longitudinal effects in CER, e.g., for analyzing multiple measures of chemistry performance in one semester8,32 and how students’ performance self-perceptions77 or self-efficacy beliefs78 change over time. Hypothesized HLM. The structure of the data in Table 4 suggested a three-level HLM describing change in student performance over time as a function of pre-testing condition. Level-1 units were examination events, Level-2 units were students, and Level-3 units were the two general chemistry classes over which the students were partitioned. This hypothesized HLM included indicator variables to denote pre-testing condition and time (so as to account for the repeated measures nature of the data) as fixed effects at level-1 (within-student), i.e., as predictors of exam item subset means. Language comprehension ability and prior chemistry knowledge were declared fixed effects at level-2 (between-student). Students were declared a random effect at level-2 to assess

Table 6. Summary of Missing Pre-Testing Data Description Instances a LG was Instances a LG was Instances a LG was Instances a LG was Instances a LG was Instances a LG was

EI Questions not tested tested once tested twice MC Questions not tested tested once tested twice

Frequency

Percent

222 863 4172

4.2 16.4 79.4

462 926 3994

8.6 17.2 74.2

from a student’s data set exam item scores that corresponded to LGs whose pre-test questions were completed only once or not at all. So, if a student did not complete both MC pre-test questions corresponding to LG 4.02, for example, the exam item score corresponding to LG 4.02 was not used in calculating the exam four MC exam item subset mean. We performed Pearson correlations to assess any relationship between the frequency with which students completed pre-tests and language comprehension ability; no significant correlations 2050

dx.doi.org/10.1021/ed4009045 | J. Chem. Educ. 2014, 91, 2045−2057

Journal of Chemical Education

Article

Table 7. Descriptions of Parameters Included in the Hypothesized HLM Parameter

Description

β00

Describes the grand mean of unstandardized exam item subset means across students of average comprehension ability and prior knowledge for LGs that were not pre-tested prior to the first examination (i.e., at “time zero”) Describes the effect of language comprehension ability (standardized SAT-CR section scores) on unstandardized exam item subset means for LGs that were not pre-tested prior to the first examination (i.e., at “time zero”) Describes the effect of prior chemistry knowledge (standardized sum of Sections II and III from the ACS Toledo exam) on unstandardized exam item subset means for LGs that were not pre-tested prior to the first examination (i.e., at “time zero”) Describes the effect of EI pre-testing on unstandardized exam item subset means from the first examination (i.e., at “time zero”) for students of average comprehension ability and average prior knowledge Describes the effect of language comprehension ability (standardized SAT-CR section scores) on the relationship between EI pre-testing and unstandardized exam item subset means from the first examination (i.e., at “time zero”) Describes the effect of prior chemistry knowledge (standardized scores from Sections II and III of the ACS Toledo exam) on the relationship between EI pre-testing and unstandardized exam item subset means from the first examination (i.e., at “time zero”) Describes the effect of MC pre-testing on unstandardized exam item subset means from the first examination (i.e., at “time zero”) for students of average comprehension ability and average prior knowledge Describes the effect of language comprehension ability (standardized SAT-CR section scores) on the relationship between MC pre-testing and unstandardized exam item subset means from the first examination (i.e., at “time zero”) Describes the effect of prior chemistry knowledge (standardized sum of Sections II and III from the ACS Toledo exam) on the relationship between MC pre-testing and unstandardized exam item subset means from the first examination (i.e., at “time zero”) Describes the change in unstandardized exam item subset means over time for students of average comprehension ability and prior knowledge Describes the effect of language comprehension ability (standardized SAT-CR section scores) on the change in unstandardized exam item subset means over time Describes the effect of prior chemistry knowledge (standardized sum of Sections II and III from the ACS Toledo exam) on the change in unstandardized exam item subset means over time

β01 β02 β10 β11 β12 β20 β21 β22 β30 β31 β32

Table 8. Taxonomy of Two-Level HLMs, Comparing an Unconditional Growth Model, the Hypothesized Growth Model, and the Final Model Fixed Effect

Parameter

Unconditional Growth Modelb,c Hypothesized Modelb,c

Intercept Standardized SAT-CR Section Scores Standardized ACS Toledo Exam Scores EI Pre-Testing SAT-CR Section Scores by EI Pre-Testing ACS Toledo Scores by EI Pre-Testing MC Pre-Testing SAT-CR Section Scores by MC Pre-Testing ACS Toledo Scores by MC Pre-Testing Time Time by SAT-CR Section Scores Time by ACS Toledo Scores

β00a β01 β02 β10 β11 β12 β20 β21 β22 β30 β31 β32

66.31*** (1.11)

Level-1 Variance Level-2 Variance

σ2 τ00 Pseudo-R2 (within-student) Pseudo-R2 (between-student) −2LL

Final Modelb,c

66.35*** (1.20) 4.43** (1.27) 8.33*** (1.30) −2.76** (0.89) −1.14 (0.95) −0.31 (0.96) 5.65*** (0.88) −2.20* (0.95) −1.02 (0.96) 0.91** (0.32) −0.03 (0.35) −0.32 (0.36)

66.41*** (1.20) 3.95*** (1.06) 7.42*** (1.06) −2.80** (0.89)

290.11*** (8.54) 205.92*** (22.21)

264.56*** (8.70) 143.40*** (17.93)

265.11*** (8.72) 143.50*** (17.94)

N/A N/A 22,067.90

0.09 0.30 17,424.64

0.09 0.30 17,428.66

1.10*** (0.30)

5.59*** (0.88) −1.93* (0.79) 0.90** (0.33)

a

Parameters in these models were estimated using the maximum likelihood (ML) algorithm. bThe statistical significance of each parameter is indicated by the appropriate superscript: ***p < 0.001; **p < 0.01; *p < 0.05. cStandard errors for all parameters are provided in parentheses.

significant variability between the two general chemistry classes used in this study. Thus, we revised our hypothesized model and employed a two-level HLM to analyze these data. HLM Construction and Relevant Parameters. The following regression equation described level-1, using the notation of Raudenbush and Bryk:76

variability among students; classrooms were declared a random effect at level-3 to assess variability among classrooms. Concerning Assumptions. We evaluated the extent to which assumptions of independence were violated in our nested data set. To do this, we calculated intraclass correlations76 between the three data levels. Large intraclass correlations imply that the assumption of independence has been violated, i.e., that analysis using HLMs is more appropriate than using simple regressions. The intraclass correlation for the second level (between-students, ρ = 0.33) was large, validating our choice to include students as a random second-level unit. However, the intraclass correlation for the third level (betweenclassrooms, ρ = 0.09) was small, indicating that there was no

subscoreij = π0j + π1j(EI) + π2j(MC) + π3j(time) + e ij

Exam item subset means (subscore) served as the dependent variables that “EI”, “MC”, and “time” predicted. “EI” and “MC” are the indicator variables described in Table 4. Deviations in true student score and predicted score are represented by eij. Level-2 equations were constructed to predict the intercept and 2051

dx.doi.org/10.1021/ed4009045 | J. Chem. Educ. 2014, 91, 2045−2057

Journal of Chemical Education

Article

For the final model, the mean initial status (β00 = 66.41) represented the mean score for untested LGs (i.e., the control condition) on the first exam for students of average prior knowledge and comprehension ability. The mean change in performance over time was positive, but relatively modest (β30 = 0.90, p < 0.01), indicating that student performance did not vary substantially over a semester. Consistent with our previous work,32 the effects of comprehension ability (β01 = 3.95, p < 0.01) and prior chemistry knowledge (β02 = 7.42, p < 0.001) on this mean initial status were positive and statistically significant. As stated above, two research questions guided this study. Fundamentally, each focused on the efficacy of a classroom intervention specifically designed to aid low-skilled comprehenders in general chemistry. Results concerning the effect of pre-testing differed considerably when question type was considered. The effect of EI pre-testing was negative and statistically significant (β10 = −2.80, p < 0.01). In other words, on average, students tended to score lower on LGs that were pre-tested with EI compared to control. In fact, the effect of EI was equivalent to negating a performance enhancement due to a 0.75 standard-deviation increase in language comprehension ability. The effects of comprehension ability and prior knowledge on this relationship between EI pre-testing and exam score were not statistically significant and were not included in the final HLM. This result suggests that EI questions provided no benefit to any student, let alone students of low comprehension ability. The result for pre-testing with MC was markedly different (β20 = 5.59, p < 0.001): on average, scores were 5.59% greater when LGs were pre-tested with MC compared to control. Moreover, the effect of comprehension ability on the relationship between MC pre-testing and exam performance was negative and statistically significant (β21 = −1.93, p < 0.05). In other words, although all students benefited to some degree from MC pre-testing, students with low comprehension skills benefited more. Interestingly, the effect of prior chemistry knowledge on this relationship between MC pre-testing and exam performance was not statistically significant and not included in the final HLM. We find it informative to parallel the results of this HLM analysis with direct comparisons of performance means. First, we refer the reader to Section S3 of the Supporting Information. The descriptive statistics presented in Table S2 were meant to support the reliability of our full-randomized LG design. However, this information also provides a more detailed context regarding the overall impact of the multiple pre-testing intervention strategy. Second, we graphically compared performance means based on student comprehension ability and treatment condition (Figure 2). To do this, we partitioned student comprehension ability data as follows. Students possessing SAT-CR section scores within the upper third of the sample were considered of high comprehension ability, while students who scored in the lower third were considered of low comprehension ability. We then calculated the grand mean of exam item subset means (based on whether LGs were pre-tested with EI, MC, or not pre-tested) for students of low and high comprehension ability. Consistent with the HLM analysis, both low- and high-skilled comprehenders benefited from MC pre-testing; low-skilled comprehenders, though, benefited more from pre-testing with MC. Thus, this graphical comparison afforded a straightforward illustration of the HLM results.

slopes in the level-1 equation from language comprehension ability (comp) and prior chemistry knowledge (priork): π0j = β00 + β01(comp) + β02(priork) + r0j π1j = β10 + β11(comp) + β12(priork) + r1j π2j = β20 + β21(comp) + β22(priork) + r2j π3j = β30 + β31(comp) + β32(priork) + r3j

The various intercepts and slopes of these level-2 equations are described in Table 7. Deviations in true student score and predicted score are represented in each equation by the rij terms. Estimating β10 and β20 (as well as the statistical significance of these parameters) was essential, as doing so tested the effects of EI and MC questions, respectively, on student performance. Estimating β11 and β21 was also important, as doing so tested the overall hypothesis that pretesting would differentially aid low-skilled language comprehenders.



RESULTS All HLMs presented here were computed with IBM SPSS MIXED MODELS, Version 21, and converged using the maximum likelihood (ML) estimation algorithm. Residuals for all models followed a normal distribution, with means of approximately zero and standard deviations of ∼0.5. Similar results79 were obtained using the restricted maximum likelihood (REML) method, likely because the two methods produce very similar results when the number of level-2 units is large.80 Although REML tends to produce less biased random effect estimates, ML was preferred because it permitted us to use the χ2 likelihood-ratio test (described below) to compare fits of multiple models.80 All HLM results are presented in Table 8. We evaluated the overall fit of the hypothesized HLM using general linear hypothesis testing and the −2 Log Likelihood (−2LL) statistic.80 The difference in the −2LL statistics between two HLMs follows a χ2 distribution, with degrees of freedom equal to the difference in the number of parameters between the two models. First, we fit an unconditional growth model; the intercept (β00) and time (β30) were the only parameters estimated in this model. Next we fit the hypothesized HLM described above, which gave a statistically better fit than the unconditional growth model, χ2(df = 10) = 22067.91− 17424.64 = 4643.27, p < 0.001. We then fit a “final” model, which contained only the statistically significant parameters from hypothesized HLM: the intercept (β00), comprehension ability (β01), prior knowledge (β02), both pre-testing conditions (β10 and β20), the cross-level interaction term for MC pretesting and comprehension ability (β21), and time (β30). For the final model (compared to the unconditional model), χ2(df = 5) = 22067.91−17428.66 = 4639.24, p < 0.001. As expected, the final model and the full hypothesized model gave indistinguishable fits, χ2(df = 5) = 17428.66−17424.66 = 4.03, p > 0.05. Similar to a R2 statistic in regression, pseudo-R2 statistics were calculated for the final model. The proportion of variance in exam item subset means that was explained by the final model was obtained by comparing the level-1 and level-2 variance components between the unconditional and final models. The final model explained 9% of the within-student variance and 30% of the between-student variance. 2052

dx.doi.org/10.1021/ed4009045 | J. Chem. Educ. 2014, 91, 2045−2057

Journal of Chemical Education

Article

that the intervention helped all students. In particular, the performance of low-skilled comprehenders increased by 7.5%, which could very well amount to a different letter grade, depending on how this construct is defined. Moreover, MC pre-testing closed the achievement gap between students of low and high comprehension ability by ∼4.0%. Therefore, we have demonstrated the testing effect to positively affect student performance in general chemistry, regardless of the level of comprehension ability. Moreover, the students that needed the most help (the low-skilled comprehenders) benefited more from this intervention strategy. We believe this testing effect to be beyond the intervention simply cueing students to which LGs would eventually be tested (i.e., “learning to the test”), as two different question types produced considerably different results. If the multiple pre-testing intervention was indeed cueing students to future exam content, we might have expected to observe similar effects for MC and EI questions. At the very least, we would have expected the two question types to yield positive effects on student performance; we however did not observe this for EI questions (discussed below). A second goal of this study was to determine the effect of question type on the pre-testing intervention. Both MC and EI questions were investigated. EI questions are believed to serve as “anchors” for students with poor comprehension ability. For example, Callender and McDaniel have demonstrated that EI questions are beneficial to poor structure builders when they are included in reading passages.39 Smith et al. have shown that EI questions aid students in an introductory biology course.58 Therefore, including EI questions in this study was a rational choice. We included MC questions because they are common in large-lecture scenarios, relatively straightforward to assess, and have been shown to promote productive retrieval processes.61 We mused that if MC questions yielded at least the same effect as EI questions, MC would be a logical instructional option, especially in large lecture classes, because these questions can be deployed online and automatically graded. The results presented in Table 8 and Figure 2 are clear: not only did EI fail to differentially aid students of poor comprehension ability, but they failed to be of benefit to any student. Given the body of literature supporting EI questions, this result was very surprising. Why was a testing effect observed with MC, and not with EI? Callender and McDaniel provide possible explanations.39 First, they suggest that although relevant prior knowledge is activated by EI questions, it is also possible that irrelevant information remains activated, thus affecting the quality of the mental representation of the related information. The amount of information per EI question may have also been a factor. If too much information were associated with an EI question, there would be too much content through which to sift in order to answer the question. The amount of text for each reading assignment and therefore each EI question was not controlled in this study. Another possible explanation concerns feedback, or lack thereof. MC questions, by virtue of being scored automatically, provided immediate performance feedback to students. For EI questions, feedback was delayed by up to 1 week because of the sheer volume of EI questions that were passed to the course staff for scoring. Given that feedback is known to modify testing effects,82 EI questions may have elicited performance gains if students were provided immediate feedback. Such a scenario, however, was not possible in the context of this study.

Figure 2. Comparing the effect of EI and MC pre-testing on course performance versus control for students of low (blue) and high (green) comprehension ability. Error bars represent 95% confidence intervals.



SUMMARY AND DISCUSSION The Multilevel Model and Structure Building together form a useful theoretical basis for understanding how instructors can intervene on behalf of students of low language comprehension. Here we have demonstrated that a strategy typically employed by high-skilled comprehenders (asking/answering questions) can be grafted upon the structure of a general chemistry course to help students of low comprehension ability. We investigated two question types (EI and MC) in the context of this intervention strategy and found that pre-testing with EI actually produced a negative effect. Conversely, pre-testing with MC produced a considerable positive effect. Moreover, we have shown that pre-testing with MC differentially aids low-skilled comprehenders. In this work, we measured comprehension ability using SAT-CR section scores; analysis using another measure (the Gates MacGinitie Reading Test) can be found elsewhere.81 We believe this work is the first to demonstrate that the testing effect can differentially affect a particular subgroup of students. The first goal of this study was to determine whether pretesting would differentially aid low-skilled comprehenders in a large-lecture general chemistry course. Previous laboratory studies have demonstrated that testing does provide a learning benefit to students, and that this benefit is due to recall and not additional time-on-task or content exposure. In addition, others have demonstrated that testing can provide learning benefits in more authentic learning situations. For example, the testing effect has been studied in the context of a brain and behavior course40 as well as in a middle school social science course.52 This work confirmed the benefits of testing in a general chemistry course using MC questions. Because standardized SAT-CR section scores were used in the HLM analysis, the achievement gap between low- and high-skilled comprehenders (defined for this discussion as students scoring one standard deviation below and above the mean SAT-CR section score, respectively) for untested (control) LGs could be calculated from β01. Assuming average prior chemistry knowledge, lowskilled comprehenders scored 62.5%, while high-skilled comprehenders scored 70.4%, an achievement gap of 7.9%. Our HLM results indicate that low-skilled comprehenders (on average) scored 70.0% on LGs pre-tested with MC, while highskilled comprehenders scored 74.0%. Thus, we can conclude 2053

dx.doi.org/10.1021/ed4009045 | J. Chem. Educ. 2014, 91, 2045−2057

Journal of Chemical Education

Article

typical for a traditional general chemistry course. While we are confident that results presented here are generalizable to other settings, we have no way to elucidate whether factors that could be heightened in this sample in comparison to a traditional general chemistry course sample (e.g., motivation89 or attitude toward chemistry90) may have had an influence. Speaking of exogenous factors over which we had little control, it is important to note this was a field study using an authentic classroom setting. While our fully randomized, within-subjects, learning goal-driven design attempted to minimize potential confounds, including the expectation that some chemistry concepts are inherently more difficult than others, we could not ensure tight control of all relevant parameters. For example, the alignment of LGs, pre-test items, and exam items (such as those presented in Tables 1 and 2) was not explicitly controlled and could have varied; we do not expect this to be a large threat to internal validity, however, as such variation would have occurred randomly across the three pre-testing conditions. We also did not directly manipulate student-level factors such as prior chemistry knowledge or comprehension ability, nor did we probe more deeply into student attitudes/behaviors associated with the pre-testing intervention. Thus, we can only make claims concerning the possible relationships between pre-testing and performance; we cannot make any claims about causation. As discussed above, the fidelity of implementing EI questions as pre-testing strategy in this study was less than optimal. Many factors prevented the control of potential confounds associated with EI questions that were not an issue with MC questions (e.g., timing of feedback and amount of information associated with each question). Thus, the negative effect of EI questions reported here should be taken strongly in the context of these limitations. That said, the practicality of attending to these limitations in an authentic general chemistry class is low. For example, providing timely feedback to students concerning responses to EI questions is likely quite difficult in any largelecture setting. So rather than an ill-effect of EI questions on student performance, our results speak more to the challenges of employing EI questions in large-lecture settings. Lastly, our implementation of MC questions as a pre-testing strategy had limitations as well. We either used MC questions provided by publisher test banks or crafted them ourselves without further inquiry. A stronger approach may have been to utilize MC questions like those that comprise concept inventories (CIs).91,92 CIs are instruments that include MC questions containing distractors representing common student alternative conceptions. Constructing CI questions and distractors by using guiding interviews that probe student understanding is a particularly powerful methodology.93−95 As stated above, we conjecture here that student learning is influenced not only by the correct answer of a MC question, but also by the distracting information. It follows, therefore, that students may benefit more from MC questions whose distracting information was purposefully aligned with common alternative conceptions, like those found in CIs.

Potential reasons exist for EI questions to have no effect on student performance; but, why would MC questions have a positive effect? Historically, it has been suggested that MC questions are not as useful as cued recall questions for learning because they promote recognition over the recall of information.39,83−87 However, McDaniel et al. recently documented that giving MC quizzes twice prior to final testing produced a testing effect as large as that obtained when recall quizzes were given.87 In addition, Little et al. demonstrated that MC questions trigger retrieval processes if the questions are created carefully.61 If so, then MC questions can allow for students to not only retrieve the correct answer and why it is correct, but also retrieve information pertaining to the distractors, as well as information related to why those distractors are incorrect. This is an advantage that other cued recall tests do not provide. We conclude our discussion by conjecturing that the patterns reported here are consistent with MC pre-testing (coupled with immediate feedback) serving as a formative assessment mechanism for students. Such a mechanism could have differentially aided low-skilled comprehenders, who cannot filter irrelevant information or make inferences as effectively as high-skilled comprehenders. MC pre-testing likely allowed students to identify content that warranted further review before each midterm exam was given. EI questions were not effective in this capacity; the lack of immediate feedback associated with these questions in our research scenario did not permit students to either gauge EI response accuracy or justify additional review of information related to EI questions. This conjecture is further suggested by the performance improvements produced by MC pre-testing relative to the no-test control. All treatment conditions were linked to LGs, which were made available to students who (presumably) used them for study and review purposes. Thus, students were made aware of possible exam content well before exam midterm exam. MC question feedback included connections to additional related information that simply studying a LG may not have afforded. It is possible that MC questions coupled with immediate feedback permitted low-skilled comprehenders to build inferences, an act not possible with either EI questions or simply studying information related to nontested LGs. While we cannot rule out the contribution of pure testing effects (i.e., retrieval of information promoting retention of the information), it is unlikely that they exclusively dominate the patterns observed here. If so, EI and MC questions would have yielded similar effects, given the unsupervised nature of the pre-test questions and the partially open-book mode in which students presumably completed the interventions.87,88 Regardless of the mechanism by which MC questions were effective, the present study coupled with prior work demonstrates that administering MC questions is a viable option for instructors who wish for students to benefit from the effects of testing.



LIMITATIONS Although this study possessed clear strengths, certain limitations of must be acknowledged so that the findings presented here can be properly interpreted. First, the general chemistry course that served as the setting of this study was taught by one of the authors; thus, the course and student sample used here were ones of convenience. Given the course’s nature as a one-semester accelerated general chemistry experience for engineers, the sample demographics and preparation for general chemistry were different from what is



CONCLUSIONS AND IMPLICATIONS In this work, we successfully used models of language comprehension ability to build an effective and equitable intervention strategy into a general chemistry course. We designed a multiple pre-testing strategy in which MC questions (on average) provided the most benefit. Repeated testing with MC questions not only aided all students, but also dif ferentially 2054

dx.doi.org/10.1021/ed4009045 | J. Chem. Educ. 2014, 91, 2045−2057

Journal of Chemical Education



ACKNOWLEDGMENTS This work is supported in part by a grant from the Davis Educational Foundation. The Foundation was established by Stanton and Elisabeth Davis after Mr. Davis’s retirement as chairman of Shaw’s Supermarkets, Inc. We also thank John Fornoff for his assistance with graphics for this manuscript.

helped low-skilled comprehenders. We believe this intervention to be widely applicable beyond general chemistry, as low-skilled comprehenders will likely be a disadvantaged population in any classroom. There are two main implications of this work for chemistry educators. First, as a handful of studies40,51,52,87 have previously, we have demonstrated in an authentic classroom environment that testing can be used to enhance learning, and not just to assess learning. We believe that this work is the first such confirmation of testing effect benefits in STEM. While frequent quizzing with a variety of question types is likely a common strategy in general chemistry courses, we encourage chemistry educators to adopt a “testing-for-learning” outlook, in addition to simply “testing-for-assessment”. Given this work, we advocate that the former mindset include testing students with question types for which immediate feedback is possible (e.g., MC instead of EI), as well as intentionally linking tests to summative assessment items via constructs such as learning goals. One might also consider encouraging multiple attempts in testing-for-learning activities; although we did not employ this strategy here, McDaniel and co-workers87 found similar learning gains in an experiment that permitted two attempts at online pre-tests. Further, it is likely that our results are not limited to online applications, as Glass and co-workers have reported that a mix of online and in-class quizzing with feedback improves subsequent test performance.96 Whether these results are applicable to in-class quizzing using autoresponse technology warrants further work, given mixed prior results97−99 and attending to the potentially confounding effects of peer-to-peer interactions during quizzing events.99,100 The second implication of this work is that theory can indeed inform pedagogy. As Lewis and Lewis101 have stated, “... knowledge about the factors contributing to low (at-risk) performance can inform the design of interventions aimed toward reducing the challenges faced by these students... Ideally, the measure used for identification [of at-risk] students contains within itself implications for a potential remedy.” We have demonstrated here that language comprehension ability is well aligned to these assertions. This construct can be measured with an instrument known to generate valid and reliable data,72 its relationship to general chemistry performance is both predictive and generalizable,32 and psychological models of this construct can be used to guide the implementation of effective and equitable classroom interventions that differentially aid low-skilled language comprehenders.





REFERENCES

(1) Gosser, D. K.; Roth, V. The workshop chemistry project: Peer-led team-learning. J. Chem. Educ. 1998, 75 (2), 185−187. (2) Gosser, D. K.; Strozak, V. S.; Cracolice, M. S. Peer-Led Team Learning: Workshops for General Chemistry; Prentice Hall: Upper Saddle River, NJ, 2001. (3) Tien, L. T.; Roth, V.; Kampmeier, J. A. Implementation of a peerled team learning instructional approach in an undergraduate organic chemistry course. J. Res. Sci. Teach. 2002, 39 (7), 606−632. (4) Báez-Galib, R.; Colón-Cruz, H.; Resto, W.; Rubi, M. R. Chem-2Chem: A one-to-one supportive learning environment for chemistry. J. Chem. Educ. 2005, 82 (12), 1859−1863. (5) Lewis, S. E.; Lewis, J. E. Departing from lectures: An evaluation of a peer-led guided inquiry alternative. J. Chem. Educ. 2005, 82 (1), 135−139. (6) Wamser, C. C. Peer-led team learning in organic chemistry: Effects on student performance, success, and persistence in college. J. Chem. Educ. 2006, 83 (10), 1562−1566. (7) Hockings, S. C.; DeAngeles, K. J.; Frey, R. F. Peer-led team learning in general chemistry: Implementation and evaluation. J. Chem. Educ. 2008, 85 (7), 990−996. (8) Lewis, S. E.; Lewis, J. E. Seeking effectiveness and equity in a large college chemistry course: An HLM investigation of peer-led guided inquiry. J. Res. Sci. Teach. 2008, 45 (7), 794−811. (9) Lewis, S. E. Retention and reform: An evaluation of peer-led team learning. J. Chem. Educ. 2011, 88 (6), 703−707. (10) Mitchell, Y. D.; Ippolito, J.; Lewis, S. E. Evaluating peer-led team learning across the two semester general chemistry sequence. Chem. Educ. Res. Pract. 2012, 13 (3), 378−383. (11) Farrell, J. J.; Moog, R. S.; Spencer, J. N. A guided-inquiry general chemistry course. J. Chem. Educ. 1999, 76 (4), 570−574. (12) Minderhout, V.; Loertscher, J. Lecture-free biochemistry. Biochem. Mol. Biol. Educ. 2007, 35 (3), 172−180. (13) Daubenmire, P. L.; Bunce, D. M. In Process Oriented Guided Inquiry Learning (POGIL); Moog, R. S., Spencer, J. N., Eds.; American Chemical Society: Washington, DC, 2008; pp 87−99. (14) Ruder, S. M.; Hunnicutt, S. S. In Process Oriented Guided Inquiry Learning (POGIL); Moog, R. S., Spencer, J. N., Eds.; American Chemical Society: Washington, DC, 2008; pp 133−147. (15) Straumanis, A.; Simons, E. A. In Process Oriented Guided Inquiry Learning (POGIL); Moog, R. S., Spencer, J. N., Eds.; American Chemical Society: Washington, DC, 2008; pp 226−239. (16) Rajan, N.; Marcus, L. Student attitudes and learning outcomes from process oriented guided-inquiry learning (POGIL) strategy in an introductory chemistry course for non-science majors: An action research study. Chem. Educ. 2009, 14 (2), 85−93. (17) Brown, P. J. P. Process-oriented guided-inquiry learning in an introductory anatomy and physiology course with a diverse student population. Adv. Physiol. Educ. 2010, 34 (3), 150−155. (18) Brown, S. D. A process-oriented guided inquiry approach to teaching medicinal chemistry. Am. J. Pharm. Educ. 2010, 74 (7), 1−6. (19) Chase, A.; Pakhira, D.; Stains, M. Implementing processoriented, guided-inquiry learning for the first time: Adaptations and short-term impacts on students’ attitude and performance. J. Chem. Educ. 2013, 90 (4), 409−416. (20) Lynch, S. Equity and Science Education Reform; Lawrence Erlbaum Associates: Mahwah, NJ, 2000. (21) Graesser, A. C.; Millis, K. K.; Zwaan, R. A. Discourse comprehension. Annu. Rev. Psychol. 1997, 48 (1), 163−189.

ASSOCIATED CONTENT

S Supporting Information *

A description of how general chemistry course grades were calculated; a complete listing and discussion of LGs used in the design of this study; a commentary on controlling for LG difficulty in the randomized study design; a justification for combining data from the two semesters of the study; discussions of midterm exam content validity; and data analyses using three-level HLMs and the REML estimation method. This material is available via the Internet at http://pubs.acs.org.



Article

AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. Notes

The authors declare no competing financial interest. 2055

dx.doi.org/10.1021/ed4009045 | J. Chem. Educ. 2014, 91, 2045−2057

Journal of Chemical Education

Article

(22) Graesser, A. C.; McNamara, D. S. Computational analyses of multilevel discourse comprehension. Top. Cognit. Sci. 2011, 3 (2), 371−398. (23) McNamara, D. S.; Graesser, A. C.; McCarthy, P. A.; Cai, Z. Automated Evaluation of Text and Discourse with Coh-Metrix; Cambridge University Press: New York, NY, 2014; pp 14−17. (24) McNamara, D. S.; Magliano, J. In The Psychology of Learning and Motivation; Ross, B., Ed.; Academic Press: New York, NY, 2009; pp 298−384. (25) Gernsbacher, M. A. Language Comprehension As Structure Building; Lawrence Erlbaum Associates: Hillsdale, NJ, 1990. (26) Gernsbacher, M. A.; Varner, K. R.; Faust, M. E. Investigating differences in general comprehension skill. J. Exp. Psychol. Learn. Mem. Cogn. 1990, 16 (3), 430−445. (27) Graesser, A. C.; McNamara, D. S.; Van Lehn, K. Scaffolding deep comprehension strategies through Point&Query, AutoTutor, and iSTART. Educ. Psychol. 2005, 40 (4), 225−234. (28) McNamara, D. S. The importance of teaching reading strategies. Perspect. Language Literacy 2009, 35 (2), 34−40. (29) McNamara, D. S. Strategies to read and learn: Overcoming learning by consumption. Med. Educ. 2010, 44 (44), 340−346. (30) Azevedo, R. In Recent Innovations in Educational Technology That Facilitate Student Learning; Robinson, D., Schraw, G., Eds.; Information Age Publishing: Charlotte, NC, 2008; pp 127−156. (31) Pyburn, D. T.; Pazicni, S. Applying the multilevel framework of discourse comprehension to evaluate the text characteristics of general chemistry texts. J. Chem. Educ. 2014, 91 (6), 778−783. (32) Pyburn, D. T.; Pazicni, S.; Benassi, V. A.; Tappin, E. E. Assessing the relation between language comprehension and performance in general chemistry. Chem. Educ. Res. Pract. 2013, 14 (4), 524−541. (33) Mayer, R. E. Multimedia Learning; Cambridge University Press: New York, NY, 2001; pp 41−62. (34) Mayer, R. E.; Wittrock, M. C. Problem Solving. In Handbook of Educational Psychology, 2nd ed.; Alexander, P. A., Winne, P. H., Eds.; Routledge: New York, NY, 2009; pp 287−303. (35) Johnstone, A. H. Chemistry teachingScience or alchemy? J. Chem. Educ. 1997, 74 (3), 262−268. (36) Tsaparlis, G. Atomic and molecular structure in chemical education: A critical analysis from various perspectives of science education. J. Chem. Educ. 1997, 74 (8), 922−925. (37) Gabel, D. Improving teaching and learning through chemistry education research: A look to the future. J. Chem. Educ. 1999, 76 (4), 548−554. (38) Johnstone, A. H. You can’t get there from here. J. Chem. Educ. 2010, 87 (1), 22−29. (39) Callender, A. A.; McDaniel, M. A. The benefits of embedded questions for low and high structure builders. J. Educ. Psychol. 2007, 99 (2), 339−348. (40) McDaniel, M. A.; Anderson, J. L.; Derbish, M. H.; Morrisette, N. Testing the testing effect in the classroom. Eur. J. Cogn. Psychol. 2007, 19 (4/5), 494−513. (41) Callender, A. A.; McDaniel, M. A. The limited benefits of rereading educational texts. Contemp. Educ. Psychol. 2009, 34 (1), 30− 41. (42) Roediger, H. L.; Karpicke, J. D. The power of testing memory: Basic research and implications for educational practice. Perspect. Psychol. Sci. 2006, 1 (3), 181−210. (43) Hanawalt, N. G.; Tarr, A. G. The effect of recall on recognition. J. Exp. Psychol. 1961, 62 (4), 361−367. (44) Darley, C. F.; Murdock, B. B. Effects of prior free recall testing on final recall and recognition. J. Exp. Psychol. 1971, 91 (1), 66−73. (45) Hogan, R. M.; Kintsch, W. Differential effects of study and test trials on long-term recognition and recall. J. Verbal Learn. Verbal Behav. 1971, 10 (5), 562−567. (46) Bartlett, J. C. Effects of immediate testing on delayed retrieval: Search and recovery operations with four types of cue. J. Exp. Psychol.: Hum. Learn. Mem. 1977, 3 (6), 719−732. (47) Whitten, W. B.; Bjork, R. A. Learning from tests: Effects of spacing. J. Verbal Learn. Verb. Behav. 1977, 16 (4), 465−478.

(48) Masson, M. E. J.; McDaniel, M. A. The role of organization processes in long-term retention. J. Exp. Psychol.: Hum. Learn. Mem. 1981, 7 (2), 100−110. (49) McDaniel, M. A.; Masson, M. E. J. Altering memory representations through retrieval. J. Exp. Psychol. Learn. Mem. Cogn. 1985, 11 (2), 371−385. (50) McDaniel, M. A.; Kowitz, M. D.; Dunay, P. K. Altering memory through recall: The effects of cue-guided retrieval processing. Mem. Cognit. 1989, 17 (4), 423−434. (51) Rohrer, D.; Taylor, K.; Sholar, B. Tests enhance the transfer of learning. J. Exp. Psychol.: Learn. Mem. Cogn. 2010, 36 (1), 233−239. (52) Roediger, H. L.; Agarwal, P. K.; McDaniel, M. A.; McDermott, K. B. Test-enhanced learning in the classroom: Long-term improvements from quizzing. J. Exp. Psychol.: Appl. 2011, 17 (4), 382−395. (53) Pressley, M.; McDaniel, M. A.; Turnure, J. E.; Wood, E.; Ahmed, M. Generation and precision of elaboration: Effects on intentional and incidental learning. J. Exp. Psychol.: Learn. Mem. Cogn. 1987, 13 (2), 291−300. (54) McDaniel, M. A.; Donnelly, C. M. Learning with analogy and elaborative interrogation. J. Educ. Psychol. 1996, 88 (3), 508−519. (55) Boudreau, R. L.; Wood, E.; Willoughby, T.; Specht, J. Evaluating the efficacy of elaborative strategies for remembering expository text. Alberta J. Educ. Res. 1999, 45 (2), 170−183. (56) Ozgungor, S.; Guthrie, J. T. Interactions among elaborative interrogation, knowledge, and interest in the process of constructing knowledge from text. J. Educ. Psychol. 2004, 96 (3), 437−443. (57) Ramsay, C. M.; Sperling, R. A.; Dornisch, M. M. A comparison of the effects of students’ expository text comprehension strategies. Instr. Sci. 2010, 38 (6), 551−570. (58) Smith, B. L.; Holliday, W. G.; Austin, H. W. Students’ comprehension of science textbooks using a question-based reading strategy. J. Res. Sci. Teach. 2010, 47 (4), 363−379. (59) Wood, E.; Willoughby, T.; McDermott, C.; Motz, M.; Kaspar, V.; Ducharme, M. Developmental differences in study behavior. J. Educ. Psychol. 1999, 91 (3), 527−536. (60) McDaniel, M. A.; Agarwal, P. K.; Huelser, B. J.; McDermott, K. B.; Roediger, H. L. Test-enhanced learning in a middle school science classroom: The effects of quiz frequency and placement. J. Educ. Psychol. 2011, 103 (2), 399−414. (61) Little, J. L.; Bjork, E. L.; Bjork, R. A.; Angello, G. Multiplechoice tests exonerated, at least of some charges: Fostering testinduced learning and avoiding test-induced forgetting. Psychol. Sci. 2012, 23 (11), 1337−1344. (62) Bereiter, C.; Bird, M. Use of thinking aloud in identification and teaching of reading comprehension strategies. Cogn. Instr. 1985, 2 (2), 131−156. (63) King, A.; Rosenshine, B. Effects of guided cooperativequestioning on children’s knowledge construction. J. Exp. Educ. 1993, 61 (2), 127−148. (64) Reading Comprehension Strategies: Theory, Interventions, And Technologies; McNamara, D. S., Ed.; Lawrence Erlbaum Associates: Mahwah, NJ, 2007. (65) Palinscar, A. S.; Brown, A. L. Reciprocal teaching of comprehension-fostering and comprehension monitoring activities. Cogn. Inst. 1984, 1 (2), 117−175. (66) Krajcik, J.; McNeill, K. L.; Reiser, B. J. Learning-goals-driven design model: Developing curriculum materials that align with national standards and incorporate project-based pedagogy. Sci. Educ. 2008, 92 (1), 1−32. (67) Pyburn, D. T. Investigations of language comprehension and chemistry education. Ph.D. Thesis, University of New Hampshire, Durham, NH, 2014. (68) Fleiss, J. L. Measuring nominal scale agreement among many raters. Psychol. Bull. 1971, 76 (5), 378−382. (69) Landis, J. R.; Koch, G. G. The measurement of observer agreement for categorical data. Biometrics 1977, 33 (1), 159−174. (70) Taxonomy of Educational Objectives: The Classification of Educational Goals. Handbook 1: Cognitive Domain; Bloom, B. S., Ed.; David McKay Company, Inc.: New York, 1956. 2056

dx.doi.org/10.1021/ed4009045 | J. Chem. Educ. 2014, 91, 2045−2057

Journal of Chemical Education

Article

(71) Educational Testing Service. Critical Reading Section. http:// professionals.collegeboard.com/testing/sat-reasoning/about/sections/ critical-reading (accessed Aug 2014). (72) Ewing, M.; Huff, K.; Andrews, M.; King, K. Assessing the Reliability of Skills Measured by the SAT; The College Board Office of Research and Analysis: New York, NY, 2005. (73) Hovey, N. W.; Krohn, A. An evaluation of the Toledo chemistry placement examination. J. Chem. Educ. 1963, 40 (7), 370. (74) American Chemical Society Division of Chemical Education Exams Institute. Composite Norms for the Toledo Examination 2009. http://chemexams.chem.iastate.edu/national-norms/tp09.html (accessed Aug 2014). (75) Cohen, J.; Cohen, P.; West, S. G.; Aiken, L. S. Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences, 3rd ed: Lawrence Erlbaum Associates: Mahwah, NJ, 2003; pp 41−42. (76) Raudenbush, S. W.; Bryk, A. S. Hierarchical Linear Models: Applications and Data Analysis Methods, 2nd ed.; Sage Publications: Thousand Oaks, CA, 2002. (77) Pazicni, S.; Bauer, C. F. Characterizing illusions of competence in introductory general chemistry students. Chem. Educ. Res. Pract. 2014, 15 (1), 24−34. (78) Villafañe, S. M.; Garcia, C. A.; Lewis, J. E. Exploring diverse students’ trends in chemistry self-efficacy throughout a semester of college-level preparatory chemistry. Chem. Educ. Res. Pract. 2014, 15 (2), 114−127. (79) We note here one important difference in the results obtained using ML vs REML. Using REML to calculate intraclass correlations resulted in a substantially larger intraclass correlation (ρ = 0.18) for the third (between-classrooms) level of our nested data set. Unlike the results obtained using ML presented above, this intraclass correlation suggested that a three-level HLM was indeed appropriate. However, when the data were analyzed with a three-level HLM using the REML algorithm (Supporting Information, Section S7), results were similar to those obtained with a two-level HLM using the ML algorithm. Thus, for the sake of parsimony, we present here the results obtained with a two-level HLM using the ML algorithm. (80) Tabachnick, B. G.; Fidell, L. S. Using Multivariate Statistics, 6th ed.; Pearson: Boston, 2013. (81) Pazicni, S.; Pyburn, D. T. Intervening on behalf of low-skilled comprehenders in a university general chemistry course. In Applying Science of Learning in Education: Infusing Psychological Science into the Curriculum [Online]; Benassi, V. A.; Overson, C. E.; Hakala, C. M., Eds.; Society for the Teaching of Psychology, 2014; pp 279−292. http://teachpsych.org/ebooks/asle2014/index.php (accessed Aug 2014). (82) Kang, S. H. K.; McDermott, K. B.; Roediger, H. L. Test format and corrective feedback modify the effect of testing on long-term retention. Eur. J. Cogn. Psychol. 2007, 19 (4/5), 528−558. (83) Anderson, R. C.; Biddle, W. B. In The Psychology of Learning and Motivation; Bower, G. H., Ed.; Academic Press: New York, NY, 1975; pp 89−132. (84) Duchastel, P. C. Retention of prose following testing with different types of test. Contemp. Educ. Psychol. 1981, 6 (3), 217−226. (85) Hamaker, C. The effects of adjunct question on prose learning. Rev. Educ. Res. 1986, 56 (2), 212−242. (86) Foos, P. W.; Fisher, R. P. Using tests as learning opportunities. J. Educ. Psychol. 1988, 80 (2), 179−183. (87) McDaniel, M. A.; Wildman, K. M.; Anderson, J. L. Using quizzes to enhance summative-assessment performance in a web-based class: An experimental study. J. Appl. Res. Mem. Cogn. 2012, 1 (1), 18−26. (88) Agarwal, P. K.; Karpicke, J. D.; Kang, S. H. K.; Roediger, H. L.; McDermott, K. B. Examining the testing effect with open- and closedbook tests. Appl. Cogn. Psychol. 2008, 22 (7), 861−876. (89) Pintrich, P. R.; DeGroot, E. V. Motivational and self-regulated learning components of classroom academic performance. J. Educ. Psychol. 1990, 82 (1), 33−40. (90) Bauer, C. F. Attitude towards chemistry: A semantic differential instrument for assessing curriculum impacts. J. Chem. Educ. 2008, 85 (10), 1440−1445.

(91) Treagust, D. F. Development and use of diagnostic tests to evaluate students’ misconceptions in science. Int. J. Sci. Educ 1988, 10 (2), 159−169. (92) Peterson, R. F.; Treagust, D. F. Grade-12 students’ misconceptions of covalent bonding and structure. J. Chem. Educ. 1989, 66 (6), 459−460. (93) National Research Council. Knowing What Students Know: The Science and Design of Educational Assessment; Pellegrino, J. W., Chudowsky, N., Glasser, R., Eds.; The National Academies Press: Washington, DC, 2001; pp 1−14. (94) Bretz, S. L.; Linenberger, K. J. Development of the enzymesubstrate interactions concept inventory. Biochem. Mol. Biol. Educ. 2012, 40 (4), 229−233. (95) Luxford, C. J.; Bretz, S. L. Development of bonding representations inventory to identify student misconceptions about covalent and ionic bonding representations. J. Chem. Educ. 2014, 91 (3), 312−320. (96) Glass, A. L.; Brill, G.; Ingate, M. Combined online and in-class pretesting improves exam performance in general psychology. Educ. Psychol. 2008, 28 (5), 483−503. (97) Bunce, D. M.; VandenPlas, J. R.; Havanki, K. L. Comparing the effectiveness on student achievement of a student response system versus online WebCT quizzes. J. Chem. Educ. 2006, 83 (3), 488−493. (98) Mayer, R. E.; Stull, A.; DeLeeuw, K.; Almeroth, K.; Bimber, B.; Chun, D.; Bulger, M.; Campbell, J.; Knight, A.; Zhang, H. Clickers in college classrooms: Fostering learning with questioning methods in large lecture classes. Contemp. Educ. Psychol. 2009, 34 (1), 51−57. (99) MacArthur, J. R.; Jones, L. L. A review of literature reports of clickers applicable to college chemistry classrooms. Chem. Educ. Res. Pract. 2008, 9 (3), 187−195. (100) Smith, M. K.; Wood, W. B.; Adams, W. K.; Wieman, C.; Knight, J. K.; Guild, N.; Su, T. T. Why peer discussion improves student performance on in-class concept questions. Science 2009, 323 (5910), 122−124. (101) Lewis, S. E.; Lewis, J. E. Predicting at-risk students in general chemistry: Comparing formal thought to a general achievement measure. Chem. Educ. Res. Pract. 2007, 8 (1), 32−51.

2057

dx.doi.org/10.1021/ed4009045 | J. Chem. Educ. 2014, 91, 2045−2057