Using Rasch Measurement To Develop a Computer Modeling-Based

Jan 12, 2012 - computer modeling is promising to help students make the connections. However, no ... understanding of chemistry concepts through compu...
1 downloads 0 Views 1MB Size
Article pubs.acs.org/jchemeduc

Using Rasch Measurement To Develop a Computer Modeling-Based Instrument To Assess Students’ Conceptual Understanding of Matter Silin Wei,† Xiufeng Liu,*,‡ Zuhao Wang,§ and Xingqiao Wang§ †

College of Material, Chemistry, and Chemical Engineering, Hangzhou Normal University, Hangzhou, Zhejiang Province 310036, China ‡ Department of Learning and Instruction, State University of New York at Buffalo, Buffalo, New York 14260-1000, United States § Department of Chemistry, East China Normal University, Shanghai 200062, China S Supporting Information *

ABSTRACT: Research suggests that difficulty in making connections among three levels of chemical representations macroscopic, submicroscopic, and symbolicis a primary reason for student alternative conceptions of chemistry concepts, and computer modeling is promising to help students make the connections. However, no computer modeling-based assessment tools are currently available. This article introduces an approach to using Rasch measurement to develop a computer modelingbased instrument to assess students’ conceptual understanding of matter. The instrument contained 15 multiple-choice and 3 constructed-response questions. Two pilot tests were carried out in high schools both in the United States and China. In pilot test I, the instrument was given to 112 students. The partial-credit Rasch model was applied to the data to examine item and test properties. The instrument was then revised and given to 403 students in pilot test II. The partial-credit Rasch model was applied again to the data, and noticeable improvements were achieved in technical qualities of both items and the instrument as a whole. This study demonstrates that using Rasch measurement to develop computer modeling-based measurement instruments in chemistry is promising. KEYWORDS: High School/Introductory Chemistry, Chemical Education Research, Computer-Based Learning, Testing/Assessment, Learning Theories FEATURE: Chemical Education Research



INTRODUCTION One consensus from science education research is that students come to the science classroom with many initial ideas that are different from those of scientists.1 In chemistry education, considerable research in the last four decades has been devoted to identifying and classifying students’ misconceptions and alternative conceptions.2 A wide range of misconceptions and alternative conceptions about basic chemistry concepts held by students from elementary school to university have been identified. Researchers have also been studying the sources of misconceptions and alternative conceptions.3 One commonly agreed primary source of student misconceptions and alternative conceptions is the inability of students to make connections among different levels of representations.4−7 This agreement originated in work by Johnstone,8,9 who suggested that there are three domains of chemistry: 1. Macroscopic: Comprising tangible and visible phenomena, which may or may not be part of students’ everyday experiences 2. Submicroscopic: Comprising particulate level ideas, which can be used to describe the movement of electrons, molecules, and atoms 3. Symbolic: Comprising a large variety of pictorial representations, chemical symbols, equations, algebraic, and computational forms Connections among the three domains are one of the essential characteristics of chemistry and chemistry © 2012 American Chemical Society and Division of Chemical Education, Inc.

learning. Chemists seamlessly transit among various representations of the domains; however, students generally have considerable difficulties doing so. Research indicates that a major difference between novices (e.g., high school students) and experts (i.e., chemistry teachers and chemists) is in the ability to engage in multiple representations simultaneously.6 The inability to connect different chemical representations has been shown to limit students’ development of a strong conceptual understanding of chemical phenomena and concepts.10,11 To help students overcome this difficulty, chemical educators have been using computer models to help students make connections among the three representations.10,12,13 Research has found that, compared to chalkboards, computer modeling (e.g., animations, simulations) can make the submicroscopic representation dynamic, visual, and interactive, which helps students understand submicroscopic behaviors and construct more scientific conceptions.11 While chemistry teaching and learning based on computer models and modeling has shown promise to improve student understanding of chemistry concepts, assessment of student conceptual understanding involving computer models and modeling has lagged behind, Published: January 12, 2012 335

dx.doi.org/10.1021/ed100852t | J. Chem. Educ. 2012, 89, 335−345

Journal of Chemical Education



Article

ceiling and floor effects (i.e., highest is 100% and lowest is 0%). They are also dependent on the test and group of students taking it. For example, a test is easier for students who are more able, and a student’s score is lower on a harder test. Therefore, in classical test theory, item difficulty and person ability are mutually dependent. One consequence of this dependency is that a measurement instrument developed using the classical test theory needs to be revalidated whenever the target sample is different from the original validation sample, a very common scenario. Rasch measurement overcomes this dependence problem, because item difficulties, thus the test difficulty, remain invariant no matter what sample is involved in the initial validation. These desirable properties of Rasch parameter estimates (Bn and Di) come with strong requirements on data, thus on items of the measurement instrument producing data. There are two essential and closely related requirements of Rasch measurement: unidimensionality and local independence. The unidimensionality requirement states that only one latent trait exists in item responses. The local independence requirement states that the correlation between item responses is due solely to the examinees’ latent abilities, thus when the latent ability is controlled, no statistically significant correlation exists between items in responses. Local independence is a necessary but not sufficient condition for unidimensionality. In order to meet the above requirements, data must fit the model, that is, the expectation of model-data-fit. A satisfactory model-data-fit must exist in order for the ability and item difficulty parameter estimates to be trustworthy. The process of using Rasch measurement to develop a measurement instrument is to go through cycles of development and revision so that student responses to questionsdatawill fit the Rasch model. The process is a systematic process in which items are purposefully constructed according to a hypothesized theory and empirically tested by applying a Rasch model in order to result in a set of items that meet the requirements of the Rasch model. Using Rasch measurement to developing measurement instruments entails the following 10 steps:16

because no computer modeling-based instruments are currently available to assess students’ conceptual understanding of chemistry concepts. The need for developing computer modeling-based measurement instruments is both theoretical and pedagogical. Theoretically, a fundamental requirement for valid assessment is the agreement between the measurement task and the cognitive model that informs teaching and learning (ref 14, pp 44−51). That is, if students develop their understanding of chemistry concepts through computer models and modeling, measurement tasks should also be presented in the same context of computer models and modeling. Pedagogically, when assessment and instruction are aligned with each other, they are more likely to provide a synergy to increase student learning. The purpose of this paper is to introduce Rasch measurement to develop a computer modeling-based measurement instrument using matter as an example. The specific questions to be answered by this paper are as follows: 1. What is the typical process of developing a computer modeling-based measurement instrument using Rasch measurement? 2. What validity and reliability evidence can Rasch measurement provide to support the use of a computer modeling-based measurement instrument?

RASCH MEASUREMENT Rasch measurement refers to a measurement theory based on an equation originally developed by a Danish mathematician, Georg Rasch.15 Rasch believed that when a person responds to a test item, a mathematical relationship governs the probability of the person correctly answering that particular test item. Specifically, for any item i with a difficulty Di that can be scored as right (X = 1) or wrong (X = 0), the probability (Pni) of a person n with an ability Bn to answer the item correctly can be expressed as P(X = 1|Bn , Di) =

e(Bn − Di) 1 + e(Bn − Di)

(1)

1. Define the construct that can be characterized by a linear trait. 2. Identify the behaviors corresponding to different levels of the defined construct. 3. Define the outcome space of student behaviors. 4. Pilot test with a representative sample of the target population. 5. Apply the Rasch model. 6. Review item fit statistics and revise items if necessary. 7. Review the Wright map and add or delete items if necessary. 8. Repeat steps 4−7 until a set of items fit the Rasch model and define a scale. 9. Establish validity and reliability claims for the measurement instrument. 10. Develop documentation for the measurement instrument.

The above equation can be rewritten as follows:

⎛ P ⎞ ln⎜ ni ⎟ = Bn − Di ⎝ 1 − Pni ⎠

(2)

Equation 2 is the well-known Rasch model for dichotomously scored items. The Rasch model establishes how the probability for a person to answer an item correctly is determined by the difference between that person’s latent ability and the item’s difficulty. The bigger the difference is, the more likely the person will answer the item correctly. Variations of this Rasch model have been applied to other test item formats (e.g., rating scale), resulting in different Rasch models (e.g., partial-credit Rasch model). Bn and Di in the Rasch model have two important properties. First and foremost, Bn and Di are on a true interval scale; they range from −∞ to +∞. Second, Bn and Di are mutually independent. These highly desirable properties of Rasch measures provide benefits over the classical test theory. In classical test theory, both item difficulty and student ability are based on the percentage of correct responses. Percentage correct statistics are not strictly on an interval scale because of

The following sections describe the process of developing a computer modeling-based measurement instrument on matter in order to demonstrate the application of these steps. 336

dx.doi.org/10.1021/ed100852t | J. Chem. Educ. 2012, 89, 335−345

measurement instrument to be developed is for use of grades 9−10 students who are at the beginning of chemistry study, only levels 1−3 were relevant to this study. The learning progression shown in Figure 1 combines four aspects of matter conceptions (i.e., composition and structure of matter; physical properties and change; chemical properties and change; and conservation), and involves students’ understanding of three chemical representations (i.e., macroscopic, submicroscopic, and symbolic). To further elaborate the learning progression to guide our development of the measurement instrument, we developed a description for each level that contains more detailed student understandings (Table 1). Table 1 presents the three levels of understanding for the four aspects of matter on which students develop understanding in chemistry. Figure 1 and Table 1 present a framework or theory on how students may progress on understanding matter. This framework acts as a guide for the subsequent steps of our study, including developing the computer model, assessment items, and scoring rubrics of constructed questions.

337

Submicroscopic: chemical change and molecule, atom and ion change; electron losing or gain; electron configuration change

Macroscopic: perceptual change (e.g., disappearance, color change, mass change, smell); transformation of matter; new substance produced Preliminary submicroscopic: new substance produced; new molecule produced; molecule changes its structure

Conservation

Submicroscopic: conservation of atoms; conservation of elements; conservation of ions (e.g., solution); conservation of electrons, protons, neutrons

Macroscopic: matter can be transformed, but not created or destroyed; conservation of mass in physical change (e.g., state change, shape change, temperature change, pressure change, separating a mixture or mixing) Preliminary submicroscopic: conservation of mass in chemical change; conservation of total mass of particles in chemical reaction system

Representation

Macroscopic and preliminary microscopic representation, simple scientific terms: macroscopic crude submicroscopic descriptions (particulate nature of matter); inaccurate usage of chemical language (simple, incorrect, or without essential understanding) Macroscopic, submicroscopic, and symbolic representation: connecting macroscopic observation with submicroscopic explanations (particle models and theories) and chemical language

Mostly macroscopic representation and everyday language, simple chemical symbols: describing things and change by macroscopic observation, experiences, and intuition; preference for using everyday language

Define the Construct

Macroscopic, observable: temperature, pressure, volume, mass, color, shape, smell, concentration; perceptual property (e.g., hard, dense, thick, rough, soluble); phase and phase change; separating, mixing, dissolving, motion, volatility Macroscopic, preliminary submicroscopic: physical change (e.g., temperature, pressure, dissolving, evaporation) and particle change (e.g., motion, distance, volume, mass, color, particles’ shape, particles’ structure) Submicroscopic: physical change and molecule, atom, ion, proton, neutron, electron change

Chemical Property or Change

DEVELOPING A COMPUTER MODELING-BASED MEASUREMENT INSTRUMENT ON MATTER

Level 3: Recognizing, describing, presenting, and differentiating particles, including atoms, molecules, and ions

Macroscopic, perceptual, intuitive: subject, material, matter; ingredients, origin, or sources; substance, element, compound, pure substance, mixture; macroscopic, stable, or motion continuous Macroscopic, preliminary submicroscopic: small particles (e.g., dot, drop, chuck, grain, molecule, atom); element, compound, mixture, elements in pure substance; stable vs motion, continuous vs discontinuous Submicroscopic, abstract, invisible: molecule, atom, ion, element; proton, neutron, electron; structure of molecule, atom; motion, discontinuous; atomic number, mass number of molecule/atom

Level 1: Describing, illustrating, and expressing macroscopic ideas on the particulate structure of matter Level 2: Recognizing particulate nature of matter in terms of properties and changes of matter

Physical Property or Change

Figure 1. High school students’ learning progressions on matter.

Composition

In defining the construct to be measured, one important consideration is that the construct has a theoretically unidimensional trait from a lower level to a higher level.17−19 The construct to be measured in this study is understanding of matter, and the linear trait underlying this construct is learning progression on understanding of matter.20,21 Learning progression assumes that there are various levels of understanding associated with a science concept, and students progress through these levels to increase their understanding. Research in the past has identified a few progression patterns of students’ understanding on matter.22−25 Based on this literature, we hypothesized a learning progression of matter understanding specifically for high school students that includes six levels (Figure 1). Because the intention of the computer modeling-based

Level



Table 1. Learning Progressions of Matter

Journal of Chemical Education Article

dx.doi.org/10.1021/ed100852t | J. Chem. Educ. 2012, 89, 335−345

Journal of Chemical Education

Article

Figure 2. (A) Interface window; (B) information window of chemical reaction model.

Table 2. Correspondence between Items and Levels of Understanding Pilot Test I Items Level

Multiple Choice

Level 3 Level 2 Level 1

T1Q13, T1Q14, T1Q15, T1Q16, T1Q17 T1Q7, T1Q8, T1Q9, T1Q10, T1Q11 T1Q1, T1Q2, T1Q3, T1Q4, T1Q5

Pilot Test II Items Constructed Response

Multiple Choice

Constructed Response

T1Q6, T1Q12 T1Q18

T2Q11, T2Q12, T2Q13, T2Q14, T2Q15 T2Q6, T2Q7, T2Q8, T2Q9, T2Q10 T2Q1, T2Q2, T2Q3, T2Q4, T2Q5

T2Q16, T2Q17 T2Q18

Identify the Behaviors of the Defined Construct

A set of instructions to manipulate the model was also provided to facilitate students in exploring the model. Before students answered assessment questions, they needed to explore the computer model either individually or in pairs. On average, it took students approximately 15−20 min to become familiar with the model. It is important to note that the computer model developed in the present study is intended to provide a context for assessment questions, not for helping students make connections among the three chemical representations. In other words, the computer model is simply a component of the measurement instrument; it is not a teaching or learning tool. After students interacted with the computer model, they were then asked to respond to assessment questions on paper. These questions included 15 multiple-choice and 3 constructed-response questions. The 15 multiple-choice questions were created to target the first three levels of understanding matter in Figure 1 with 5 questions per level. Each question related to one aspect of matter (e.g., composition and structure, physical property and change, chemical property and change, and conservation). Table 2 shows the correlation between items and levels of understanding. The three constructedresponse items were included to provide students with a different mode to demonstrate their understanding. These 18 items were modified and further improved after the first cycle of pilot testing, discussed subsequently. Box 1 shows two sample questions.

Once the construct is defined in terms of a progression, the next step is to decide on the type of items to solicit examinees’ responses. Because our focus in the present study is to develop a computer modeling-based measurement instrument, assessment questions must be based on computer modeling, namely, students must interact with computer models in order to answer questions. We decided to use NetLogo26 to develop a computer model. NetLogo is powerful for representing submicroscopic behaviors (e.g., molecular-level interactions) and building connections among three chemical representations (i.e., macroscopic, submicroscopic, and symbolic). It is a programmable multiagent modeling environment, particularly well suited for modeling complex systems over time. We developed a NetLogo model on chemical reactions as the context for assessment of students’ understanding of matter. Figure 2A shows the computer model interface. The model simulates a chemical reaction between oxygen and hydrogen. In the interface, students can change the number of reagents and temperature by manipulating the sliders and then observe what happens from different mini windows in the interface: the agent window showing random movement and interaction of reactants to produce products, which represents a submicroscopic perspective of the chemical reaction system. The monitors and the plotting window showing the real-time change of the number of reactants and the product in the system relates to a student’s hands-on lab, for example, combustion of hydrogen in air or oxygen represents a macroscopic perspective of the chemical reaction system. The symbolic representation is provided in the information window (Figure 2B). This window describes rationales about chemical reactions, for example, the chemical reaction equation. The NetLogo model can be run as a Java applet inside a Web browser, which makes it easier for students to access and run.

Define Outcome Space of Items

Once a test specification was defined, an initial item pool was then created and item scoring keys and rubrics were developed. These items and their scoring keys and rubrics defined the outcome space of items. The initial items formed a draft measurement instrument for pilot testing. After the first cycle of 338

dx.doi.org/10.1021/ed100852t | J. Chem. Educ. 2012, 89, 335−345

Journal of Chemical Education

Article

difficulty of level k of item i, that is, the step or threshold for an examinee receiving score k instead of k − 1. After the pilot testing, examinees’ responses to items were entered into the computer, and Rasch analysis began. Several computer programs are available specifically developed for Rasch analysis, such as Winsteps, Quest/ConQuest, and RUMM. Bond and Fox17 state that (ref 17, p 300) [A]lthough each of the Rasch software programs has its own disciples, but no one program incorporates the advantages of all the packages, or avoids the tolerable shortcomings of its own particular estimation procedure.... The usual Rasch analytical procedures ... produce estimates that are generally equivalent, for all practical analytical purposes. Winsteps (version 3.42) program,28 a popular Rasch software program with many diagnostics, was used to conduct Rasch analysis in the present study.

pilot testing, the scoring keys and rubrics were revised (see subsequent discussion).

Iterations of Revision, Pilot Testing and Applying the Rasch Model

Conduct Pilot Testing

Pilot testing is administering the draft measurement instrument to a representative sample of the target population so that data can be collected for Rasch analysis. Although a random sample from the population is ideal, what is important for Rasch measurement is the spread of examinees along the measured construct. That is, an important consideration is to ensure that the range of examinees’ abilities on the measured construct matches the range of item difficulties. The measurement instrument was initially developed in Chinese. Because the intention was to develop this instrument for use both in China and in the United States so that comparison studies could be conducted, the Chinese instrument was translated into English by the first author, who is fluent in both Chinese and English (first language is Chinese; studied in the United States for close to three years). The second author, who is also fluent in both Chinese and English (first language is Chinese; has been teaching in Canada and the United States for more than 20 years), reviewed the translation to ensure accuracy. During pilot test I, the first draft of the measurement instrument was given to 112 high school students in 5 classes (grades 9 to 11) in June 2009 in both the United States and China (the Chinese version was given to Chinese students and the English version to the U.S. students). Three classes were from the United States (n = 56) and two classes from China (n = 56). After applying the Rasch model to these data, some items were revised; new items were also added and some items were removed. During pilot test II, the revised version of the measurement instrument was given to eight classes of high school students (three classes with 114 students in grade 9 and five classes with 216 students in grade 10) in China in October 2009 (n = 330) and four classes of high school students in grades 10−12 in the United State in Spring 2010 (n = 73). Students in each class in China and the United States were in grade-level classes and had varied ability levels.

Revisions to items and the instrument are based on Rasch analysis results, including item fit statistics, differential item functioning, item category structure, person−item map, and dimensionality. Common item fit statistics include the mean square residual (MNSQ) and the standardized mean square residual (ZSTD). Both MNSQ and ZSTD are based on the difference between what is observed and what is expected by the Rasch model. MNSQ is a simple squared residual, while ZSTD is a normalized t score of the residual. There are two ways to sum MNSQs and ZSTDs over all persons for each item, which produce four fit statistics. INFIT statistics (INFIT MNSQs and INFIT ZSTDs) are weighted means by assigning more weights to those persons’ responses close to the probability of 50/50, while the OUTFIT statistics (OUTFIT MNSQs and OUTFIT ZSTDs) are unweighted means of MNSQs and ZSTDs over all persons. Thus, OUTFIT statistics are more sensitive to extreme responsesoutliers. The rule of thumb is that items with good model-data-fit have INFIT and OUTFIT MNSQs within the range of 0.7−1.3, and INFIT and OUTFIT ZSTDs within the range of −2 to +2 (ref 17, pp 285−286). Fit statistics based on pilot test I data suggested that some items did not fit the model well. Revisions to those items were made accordingly. For example, T1Q2 during pilot test I was revised to become T2Q2 during pilot test II. The two versions of the item are shown in Box 2.

Apply the Rasch Model

Because the assessment questions included both multiplechoice and constructed-response questions, the partial-credit Rasch model27 was used. The model takes the following form:

⎞ ⎛ P ln⎜ nik ⎟ = Bn − Dik ⎝ 1 − Pnik ⎠

(3)

where Pnik is the probability for student n with an ability Bn responding at level k of item i successfully, and Dik is the 339

dx.doi.org/10.1021/ed100852t | J. Chem. Educ. 2012, 89, 335−345

Journal of Chemical Education

Article

Table 3. Item Categories for T1Q2 and T2Q2 Item (N)

Choice

Score

Frequency

Chosen, %

Av Measure

SE

OUTFIT MNSQ

PTMEA CORR

T1Q2 (N = 110)

A C B A C

0 1 0 0 1

6 104 18 25 280

5 95 6 8 87

−0.03 0.75 −0.81 −0.18 0.29

0.54 0.08 0.14 0.13 0.05

1.0 0.9 0.6 1.1 1.0

−0.20 0.20 −0.27 −0.12 0.28

T2Q2 (N = 323)

Table 3 presents the fit statistics of T1Q2 and T2Q2. Choices of A and B were not plausible to students, because the majority of students (95%) selected the correct choice, C. After we revised the stem of T1Q2, choices A and B were able to attract some students. The standard error (SE) values indicate how the estimates of item/ person are precise. The closer an SE value is to 0, the better. Table 3 shows that the SE values of choices for T2Q2 also became smaller. The statistic PTMEA CORR (point−measure correlation) refers to correlation between students’ scores or points on the item and their Rasch ability measures; the higher and the more positive the correlation is, the better. Table 3 shows that PTMEA CORR values for the correct choice (C) were positive, while PTMEA CORR values for incorrect choices were negative, which is desirable. Also, PTMEA CORR increased from 0.20 to 0.28 after revision, suggesting that the revision improved item discrimination. Table 4 presents a summary of revisions made to the items based on Rasch item fit statistics, response patterns, and other

difficult for both Chinese students and U.S. students. Countryrelated DIF was examined based on pilot test I data. Table 5 presents the results of DIF analysis. There are two statistical tests for DIF: the t-test and Mantel−Haenszel test.29 From Table 5, we can see that 10 items (T1Q3, T1Q5, T1Q7, T1Q10, T1Q12, T1Q13, T1Q15, T1Q16, T1Q17, T1Q18) had DIF (p < 0.05) based on the t-test (p < 0.05), and seven items (T1Q1, T1Q5, T1Q7, T1Q10, T1Q13, T1Q14, T1Q15) had DIF based on the Mantel−Haenszel test (p < 0.05). Among those items, some items were significantly more difficult for Chinese students than the U.S. students (e.g., T1Q1, T1Q5, and T1Q7), and some items were significantly more difficult for the U.S. students than Chinese students (e.g., T1Q10 and T1Q13). Because several items had DIF, Rasch analysis of future pilot test data needs to be conducted separately for Chinese students and U.S. students. Because pilot test II data contained results from 330 Chinese students and only 73 U.S. students, we excluded U.S. data and only used Chinese student data for analysis in pilot test II. Constructed-response questions and their scoring rubrics were also revised based on Rasch analysis output. Inter-rater reliability studies were undertaken prior to the scoring of three constructed-response questions in pilot test II. Two knowledgeable raters scored the responses to items T2Q16, T2Q17, and T2Q18 by a random subsample of students (n = 52) from the pilot test II sample. The proportion of exact agreement on four categories (i.e., 0, 1, 2, 3) between two raters on each item was 0.88 (T2Q16), 0.85 (T2Q17), and 0.90 (T2Q18), respectively, and the Cohen’s κ coefficients for these three items were 0.83, 0.78, and 0.87, respectively. Thus, the overall inter-rater reliability was high. The first author scored the rest of the students’ responses to the questions. Table 6 shows the revised scoring rubric for T2Q16, and Figure 3 shows the category structure of the scoring rubric based on Rasch analysis of pilot test II. Each score in the scoring rubric had its own distinct probability curve (i.e., a peak), suggesting that the four scoring categories were reasonable. This is because when each score in the scoring rubric has its own distinct probability curve, only students whose abilities are in the range under the curve are more likely to obtain that score. However, the category structure of the initial scoring rubric before revision based on Rasch analysis of pilot test I data showed that the probability curve of score 2 was subsumed by that of score 3, suggesting that score 2 in the original rubric was not appropriate. Revisions to the question and scoring rubric improved the quality of the item, as shown in the graph in Figure 3. Table 7 presents fit statistics for the 18 items in the revised measurement instrument based on pilot test II data. Standard error of measurement (SEM) is the standard error for each item difficulty measure; the closer SEM values are to 0, the better. SEM values for all items were below 0.18. MNSQ values for most items were in the acceptable range, except for items T2Q10 and T2Q15 on OUTFIT. ZSTD values for T2Q3,

Table 4. Summary of Revisions to Items after Pilot Test I Original Item for Pilot Test I T1Q1 T1Q2 T1Q3 T1Q4 T1Q5 T1Q7 T1Q8 T1Q9 T1Q10 T1Q11 T1Q13 T1Q14 T1Q15 T1Q16 T1Q17

Revisions Stem and choices were revised to be clearer and easier for students. Stem was revised so the choices were more plausible to students. Choice D was revised to increase its clarity. Choices C and D were revised to be more plausible to students. Changed from level 1 to level 2 of understanding. Revised to better match level 2 of understanding. Stem was revised so the direction of the stem was clearer. Removed due to being too easy; a new item was added. Removed due to not matching level 2; a new item was added. Changed from targeting level 2 to level 1 of understanding. Stem and choices were revised to increase item’s difficulty. Stem and choices were revised to increase item’s difficulty. Stem was revised to increase item’s difficulty. Stem and choices were revised to increase item’s difficulty. Stem and choices were revised to increase item’s difficulty.

New Item for Pilot Test II T2Q1 T2Q2 T2Q3 T2Q4 T2Q9 T2Q7 T2Q10 T2Q6 T2Q8 T2Q5 T2Q11 T2Q15 T2Q13 T2Q14 T2Q12

Rasch analysis results discussed below based on pilot test I data (see the Supporting Information). It can be seen that some items were changed substantially while others only slightly. We also analyzed differential item functioning (DIF) to identify potential item bias against particular groups of students. Specifically, we would like to ensure that items are equally 340

dx.doi.org/10.1021/ed100852t | J. Chem. Educ. 2012, 89, 335−345

Journal of Chemical Education

Article

Table 5. Differential Item Functioning Results for Pilot Test I DIF

Joint

Item

Chinese Studentsa

Measure (SE) U.S. Studentsb

Contrast

SE

t Values

DF

P Values

P Values

Mantel−Haenszel Sizec

T1Q1 T1Q2 T1Q3 T1Q4 T1Q5 T1Q6 T1Q7 T1Q8 T1Q9 T1Q10 T1Q11 T1Q12 T1Q13 T1Q14 T1Q15 T1Q16 T1Q17 T1Q18

1.42(0.32) −2.81(0.62) 0.50(0.30) −0.73(0.34) 0.50(0.31) 0.80(0.25) 0.50(0.30) 1.77(0.34) −0.56(0.32) 0.08(0.30) −0.72(0.34) 0.11(0.30) −0.65(0.33) −0.80(0.35) 0.94(0.30) −1.04(0.35) −0.01(0.30) 0.28(0.31)

0.65(0.28) −2.07(0.60) 1.44(0.29) −1.76(0.53) −1.26(0.44) 0.63(0.18) −0.96(0.39) 1.77(0.30) −1.76(0.53) 1.08(0.28) −0.33(0.33) 1.02(0.24) 0.37(0.31) −0.23(0.32) −0.96(0.39) 0.44(0.30) −1.07(0.41) 1.16(0.20)

0.77 0.74 −0.94 1.03 1.77 0.17 1.46 0.00 1.19 −1.00 −0.39 −0.91 −1.02 −0.57 1.90 −1.48 1.06 −0.88

0.43 0.86 0.42 0.63 0.54 0.31 0.49 0.45 0.62 0.41 0.47 0.39 0.45 0.47 0.49 0.46 0.51 0.37

1.82 0.86 −2.25 1.64 3.30 0.57 2.98 −0.01 1.93 −2.44 −0.84 −2.35 −2.25 −1.21 3.85 −3.21 2.08 −2.42

109 108 106 102 104 80 108 108 107 107 106 86 98 106 109 105 105 70

0.072 0.390 0.027 0.104 0.001 0.573 0.004 0.994 0.058 0.016 0.405 0.021 0.027 0.228 0.000 0.002 0.040 0.018

0.045 0.895 0.217 0.842 0.004 0.992 0.001 0.520 0.891 0.020 0.929 0.062 0.004 0.006 0.001 0.424 0.666 0.232

0.21 −d −0.30 0.69 0.32 −1.10 1.71 +d +d −0.23 0.29 −d −1.50 −1.72 1.12 −0.90 0.46 −d

a

N = 56. bN = 56. cSize is the standardized DIF. dNegative and positive signs indicate the direction of item bias (standardized DIF size was not estimable in these instances).

Table 7. Fit Statistics of Items of Pilot Test IIa

Table 6. Scoring Rubric for T2Q16 in Pilot Test II Description Level

Score 0

Level 1

1

Level 2

2

Level 3

3

a. Not relevant to item b. Incorrect c. Only simply describing air in macroscopic view (e.g., “air is invisible”, “air is necessary for breath”) Describing air at both macroscopic properties and naive submicroscopic levels Describing air by preliminary view of particulate nature of matter Describing air in different structure of particles

0

1 2 3

Item

Measure

SEM

INFIT MNSQ ZSTD

OUTFIT MNSQ ZSTD

T2Q1 T2Q2 T2Q3 T2Q4 T2Q5 T2Q6 T2Q7 T2Q8 T2Q9 T2Q10 T2Q11 T2Q12 T2Q13 T2Q14 T2Q15 T2Q16 T2Q17 T2Q18

0.20 −1.94 −0.41 −0.93 −0.54 −0.16 0.65 −0.01 −0.39 −1.08 0.90 0.13 0.31 1.07 2.48 0.38 −0.72 −0.01

0.12 0.17 0.12 0.13 0.12 0.12 0.12 0.12 0.12 0.13 0.13 0.12 0.12 0.13 0.18 0.08 0.07 0.08

1.05 0.97 1.12 0.86 0.96 1.06 1.19 0.99 0.86 1.12 1.10 1.02 0.90 0.97 1.15 0.92 0.84 0.88

1.02 0.91 1.18 0.76 0.97 1.15 1.24 0.97 0.86 1.36 1.16 0.99 0.88 1.04 1.31 0.91 0.82 0.87

1.2 −0.3 2.7 −2.5 −0.9 −1.6 3.9 −0.2 −3.4 1.8 1.9 0.4 −2.6 −0.4 1.2 −1.2 −2.3 −1.6

0.5 −0.4 2.6 −2.7 −0.4 2.5 3.8 −0.6 −2.2 3.0 2.2 −0.2 −2.3 0.6 1.5 −1.2 −2.3 −1.8

PTMEA CORR 0.33 0.28 0.22 0.49 0.39 0.28 0.17 0.39 0.51 0.14 0.26 0.37 0.49 0.38 0.11 0.61 0.61 0.61

a

These results report data from students responding to items in pilot test II, N = 330.

improvement to these items may be necessary. All items’ PTMEA CORR values were positive with a range of 0.11−0.61. Based on the fit statistics, overall items in the revised measurement instrument fit the model reasonably well. Local independence relates to the correlations among items after students’ abilities are controlled for; it is indicated by a MNSQ value below 0.7. Student responses to items with extremely low MNSQ values (e.g., below 0.7) fit the Rasch model too wellthat is, overfitting; these results are too good to be true, suggesting that those items may be highly correlated with other items, an instance of local dependence. If local independence does not exist, those items may be redundant and could be removed from the instrument. Fit statistics in

Figure 3. Category probability graph of the scoring rubric for T2Q16.

T2Q4, T2Q6, T2Q7, T2Q9, T2Q10, T2Q13, and T2Q17 were a little beyond the acceptable range, suggesting that further 341

dx.doi.org/10.1021/ed100852t | J. Chem. Educ. 2012, 89, 335−345

Journal of Chemical Education

Article

Figure 4. Wright map based on (A) pilot test I and (B) pilot test II.

Figure 4B shows the Wright map for the revised measurement instrument based on pilot test II data (Chinese students). The 18 items overall spread apart more along the variable from low to high levels. Except for two gaps between T2Q14 and T2Q15 and between T2Q2 and T2Q10, no large gap exists among other items. Items intended for the same level of understanding of matter were more or less close to each other. For example, T2Q2, T2Q3, T2Q4, and T2Q5 for level 1 were close to the bottom; T2Q11, T2Q13, T2Q14, and T2Q15 for level 3 were close to the top; T2Q6, T2Q8, and T2Q9 for level 2 were in the middle. A few questions (e.g., T2Q1, T2Q7, T2Q10, and T2Q12) of different levels were mixed. Most questions were targeting students’ abilities well, except that a few more questions were needed to fill the gap between T2Q15 and T2Q14, and between T2Q2 and T2Q10 to precisely differentiate students within these ability ranges. Overall, the match between the student ability range and the item difficulty range of the revised instrument was improved, even though further improvement would still be needed. Figure 5 shows the dimensionality analysis of the revised measurement instrument in pilot test II. It can be seen that most items had a loading (i.e., correlation) within the −0.4 to +0.4 range; one item (b-T2Q12) was close to the edge of the range; three items (A-T2Q17, B-T2Q18, a-T2Q1) were

Table 7 show that no item has a MNSQ value below 0.7, suggesting that the local independence requirement of the Rasch measurement was met. Figure 4A shows the combined person−item map, which is also called the Wright map,18,30,31 based on the initial draft instrument in pilot test I. The Wright map shows the locations of item and person parameter estimates along a common logit interval scale. On the left side we see how students’ ability estimates distribute, and on the right side we see how the 18 items distribute from the easiest (bottom) to the most difficult (top). From Figure 4A, we can see that one item (T1Q2) was too easy for students; it should be removed or made more difficult. There were too many items in some areas, but too few in other areas. That is, some questions cluster (e.g., T1Q1, T1Q3, T1Q18, T1Q12, T1Q6, and T1Q10); they should appropriately spread apart to match different ability levels of students. Two gaps30,31 were found among items, one between T1Q2 and T1Q4 and another between T1Q8 and T1Q1, which indicates that a few items were needed at these levels to provide better differentiation among students. Finally, most questions located from −1 to 1 along the scale; a few questions should be made more difficult or some more difficult items within the difficulty range of 1−3 logits should be added. 342

dx.doi.org/10.1021/ed100852t | J. Chem. Educ. 2012, 89, 335−345

Journal of Chemical Education

Article

could be reviewed to decide whether the defined levels of understanding are reasonable. In addition to these changes for the next round of pilot testing, interviews should also be conducted with selected students on their reasoning for item choices. These interviews can ensure that students respond to the items according to their understandings of the intended construct. In fact, one anonymous reviewer of this manuscript pointed out some possible different from expected ways of reasoning students might use to respond to questions such as T2Q1, T2Q2, T2Q5, T2Q6, T2Q8, T2Q12, and T2Q15. In order to ensure that students respond to the questions based only on their understanding of matter, students of representative ability levels (low, medium, and high) may be asked to think aloud to answer the questions. Revisions to the items should be made if students do not respond as anticipated. Establish Validity and Reliability Claims

Reliability is a property of person and item measures in Rasch measurement. The person separation index indicates overall precision in person measures as compared to errors. A person separation index is the ratio of true standard deviations to error standard deviations in person measures. Thus, a ratio greater than 1 indicates more true variance than error variance in person measures, and the larger this ratio is, the more precise the person measures. This person separation index can also be converted to a Cronbach’s α equivalent value with a range of 0−1 (ref 17, p 284). Table 8 shows the summary statistics for the revised measurement instrument based on pilot test II data. It can be seen that the personal separation index was 1.39, with an equivalent Cronbach’s reliability coefficient (α value) of 0.66. This person reliability was not very high, but acceptable for low-stake classroom assessment. Item separation index was very high (i.e., 7.27), and its corresponding Cronbach’s α value was 0.98, indicating very reliable item difficulty estimation. Both person and item separation indices and reliability coefficients increased significantly from pilot test I to pilot test II. Specifically, the person Cronbach’s α value was 0.15 higher than that in pilot test I, and item separation index increased from 3.87 to 7.27. The variation of students’ ability in pilot test I (5.98) and pilot test II (4.93) indicate that the increase in reliabilities and separations in pilot test II is most likely owing to the improvement in the measurement instrument, instead of the larger sample. As an additional measure of reliability, Rasch measurement produces an SEM result for every individual person and item. Different persons and items have different SEMs; persons and items with measures closer to their means have smaller SEMs than those further from the means. Based on individual SEMs for persons and items, it is also possible to calculate the overall SEMs for an entire instrument, as shown in Table 7. Overall, SEM values for persons and items were small. In terms of validity, Rasch measurement provides pertinent evidence for the construct validity of the measures. First,

Figure 5. Factor analysis of residuals based on pilot test II data.

out of the range. Overall, no significant constructs seem to be underlying the residuals, suggesting a moderately strong construct underlying the measurement instrument (Rasch measures explained 41.3% total variance). Items mentioned earlier require further investigation for measurement disturbance to improve the unidimensionality of this measurement instrument. Next Iteration of Instrument Development

The analysis of pilot test II showed that, although improvement from pilot test I in terms of item and instrument qualities was apparent, further revisions to items and the instrument and a new round of pilot testing are still necessary. Specifically, fit analysis suggested that a few questions (i.e., T2Q3, T2Q4, T2Q6, T2Q7, T2Q9, T2Q10, T2Q13, T2Q15, and T2Q17) could improve on their fit to the Rasch model; further improvement to these items may be necessary. Three items (T2Q1, T2Q17, T2Q18) should also be closely examined in terms of their underlying constructs because they did not conform to the same unidimension as other items did. The revised instrument may be given to both Chinese students and U.S. students to conduct DIF analysis again. A few new items with appropriate difficulties need to be developed to fill the gap between T2Q2 and T2Q10 and between T2Q14 and T2Q15, as shown on the Wright map (Figure 4B). T2Q2 should be removed or made more difficult. A few more difficult items than T2Q15 can be added to assess the most able students. Because of some mismatch between the item ordering and the hypothesized learning progression levels (e.g., T2Q1, T2Q7, T2Q10, and T2Q12), these items should be further reviewed and revised to reflect the intended levels of understanding. Alternatively, the learning progression theory

Table 8. Summary Statistics of Persons and Items in Pilot Test II INFIT

OUTFIT

Parameter (N)

SEM

MNSQ

ZSTD

MNSQ

ZSTD

Separation

Reliability

Persons (330) Items (18)

0.49 0.12

0.97 1.00

−0.1 0.0

1.03 1.02

0.1 0.1

1.39 7.27

0.66 0.98

343

dx.doi.org/10.1021/ed100852t | J. Chem. Educ. 2012, 89, 335−345

Journal of Chemical Education

Article

because using Rasch measurement to develop instruments is theory-based, items have been purposely developed according to the progression of the defined construct; they should represent the content domain of the construct. Second, Rasch measurement is based on individual response patterns that reflect the reasoning processes of individuals involved in answering questions. Good model-data-fit indicates that student reasoning about items of the measurement instrument was coherent. Third, when data fit the Rasch model, the evidence indicates that items measure the intended unidimensional construct. Therefore, it is reasonable to claim for the construct validity of measures of an instrument developed using Rasch measurement. As for other types of validity evidence, for example, criterionrelated and consequence-related, additional evidence needs to be collected to support the claims. In any case, measurement instruments developed using Rasch measurement are more likely to produce higher criterion-related and consequencerelated validity evidence. This is because Rasch measurement produces interval measures of individual persons; using Rasch ability measures of individuals is likely to result in a strong correlation with measures of other variables as the result of reduced error of measurement and increased statistical power for rejecting null hypotheses. Consequence-related validation can be facilitated by providing clear interpretations of Rasch scale scores. For example, different competences related to a construct may be tied to specific ranges of Rasch scale scores.

Table 9. Conversion between Raw Scores (0−24) and Rasch Scale Scores Raw Score

Rasch Scale Score

Raw Score

Rasch Scale Score

0 1 2 3 4 5 6 7 8 9 10 11 12

0.00 13.02 21.04 26.10 29.95 33.15 35.94 38.47 40.81 43.03 45.17 47.25 49.32

13 14 15 16 17 18 19 20 21 22 23 24

51.39 53.49 55.66 57.92 60.33 62.93 65.82 69.15 73.16 78.43 86.73 100.00

Table 10. Items and Range of Each Levela Level

Items

1 2

T2Q1, T2Q2, T2Q3, T2Q4, T2Q5, T2Q17 T2Q6, T2Q7, T2Q8, T2Q9, T2Q10, T2Q16, T2Q18 T2Q11, T2Q12, T2Q13, T2Q14, T2Q15

3

Minimum Maximum 29.83 38.68

51.84 56.53

51.15

75.39

a

The constructed-response questions (T2Q16, T2Q17, and T2Q18) are assigned to a level with a mean difficulty of items closest to that of the constructed-response question.

Develop Documentation for Measurement Instrument Use

Rasch scale score is within the range of only one level, then this student may be considered as having reached that level of understanding.

The final stage in developing measurement instruments is developing documentation. Documentation provides information to aid users in the appropriate application of the measurement instrument. Important information included in the documentation should cover aspects such as the intended uses of the measurement instrument, the definition of the construct, the process of developing the instrument (including pilot testing), and guidelines for administering the measurement instrument and reporting individual scores. Because it is unrealistic to expect users to conduct Rasch analysis when using the measurement instrument, providing a conversion table of raw scores to Rasch scale scores would be helpful. By referencing the conversion table, users can find out the equivalent Rasch scale score for each raw score without conducting Rasch analysis. Users may use Rasch scale scores in the subsequent statistical analyses. It has to be noted that Rasch scale scores of individuals do not have to be negative. Because Rasch scale scores are interval, in order to make the Rasch scale score interpretation more intuitive, Rasch computer analysis programs allow users to specify any range to report Rasch scale scores. For example, the score range can be from 0 to 100, or in the same range as the raw scores, or with a mean of 500 and a standard deviation of 100. Table 9 shows the conversion of raw scores to Rasch scale scores within the range of 0−100 as produced by Winsteps software. The conversion from raw scores to Rasch scale scores is nonlinear. To determine the overall levels of students’ understanding of matter as defined in Figure 1, Table 10 shows the items and the item difficulty range grouped by levels of understanding of matter. From the table, we can see how the development of the three levels of understanding overlap. If a student’s Rasch scale score is within the range of two levels, then this student may be considered in transition between these two levels in terms of the development of understanding of matter. If a student’s



CONCLUSION In this article, we have demonstrated how to use Rasch measurement to develop a computer modeling-based instrument to measure students’ understanding of matter. We have answered the following two questions: (i) What is the typical process of developing a computer modeling-based measurement instrument using Rasch measurement? and (ii) What validity and reliability evidence can Rasch measurement provide to support the use of a computer modeling-based measurement instrument? Because raw scores are nonlinear, it is more meaningful to construct Rasch linear measures and use them for consequent statistical analyses. Specifically for the computer modeling-based measurement instrument, results suggest that items fit the model reasonably well. The 18 items spread across a wide range of students’ abilities, from a lower level to a higher level of understanding. A strong unidimensional construct underlies the instrument. The revisions of scoring rubrics for constructed-response questions, and modifications to both multiple-choice questions and constructed-response questions have improved category structures, decreased standard errors of measurement of measures, and increased overall fit to the model. Overall the instrument measures possess adequate validity and reliability. Developing a measurement instrument using Rasch measurement is an iterative process. Although preliminary evidence is available to support the validity and reliability of the computer modeling-based measurement instrument developed in this study, the instrument can be further improved by continuous revisions, pilot testing, and applying the Rasch model. This process can also be applied to develop other measurement instruments (computer 344

dx.doi.org/10.1021/ed100852t | J. Chem. Educ. 2012, 89, 335−345

Journal of Chemical Education

Article

(19) Wright, B. D. Fundamental Measurement for Psychology. In The New Rules of Measurement: What Every Educator and Psychologist Should Know; Embretson, S. E., Hershberger, S. L., Eds.; Lawrence Erlbaum Associates: Hillsdale, NJ, 1999. (20) National Research Council. Taking Science to School: Learning and Teaching Science in Grades K−8; National Academies Press: Washington, DC, 2007. (21) Smith, C. L.; Wiser, M.; Anderson, C. W.; Krajcik, J. Meas.: Interdiscip. Res. Perspect. 2006, 4, 1−98. (22) Liu, X.; Lesniak, K. M. Sci. Educ. 2005, 89, 433−450. (23) Liu, X.; Lesniak, K. J. Res. Sci. Teach. 2006, 43, 320−347. (24) Liu, X. J. Chem. Educ. 2007, 84, 1853−1856. (25) Claesgens, J.; Scalise, K.; Wilson, M.; Stacy, A. Sci. Educ. 2009, 93, 56−85. (26) Wilensky, U. NetLogo. Center for Connected Learning and Computer-Based Modeling, Northwestern University: Evanston, IL, 1999; http://ccl.northwestern.edu/netlogo/ (accessed Dec 2011). (27) Masters, G. N. A Rasch Model for Partial Credit Scoring. Psychometrika 1982, 47, 149−174. (28) Linacre, J. Winsteps. http://www.winsteps.com/ (accessed Dec 2011). (29) Linacre, J.; Wright, B. D. Item Bias: Mantel−Haenszel and the Rasch Model; University of Chicago: Chicago, IL, 1987 (Memorandum 39). http://www.rasch.org/memo39.pdf (accessed Dec 2011). (30) Wright, B. D.; Stone, M. H. Best Test Design; MESA Press: Chicago, IL, 1979. (31) Wright, B. D.; Masters, G. N. Rating Scale Analysis; MESA Press: Chicago, IL, 1982.

modeling-based, or not) on other science concepts or in other chemistry contexts, and Rasch measurement can be used to help establish validity and reliability evidence of a measurement instrument. The information obtained from Rasch measurement is also informative for revising items and scoring rubrics. The iterative process of revision and pilot testing guided by Rasch measurement is promising for developing measurement instruments in chemistry, including computer modeling-based ones. All the techniques described in this article apply to rating scales, too.



ASSOCIATED CONTENT

S Supporting Information *

NetLogo computer model; computer model-based assessment of student understanding of matter. This material is available via the Internet at http://pubs.acs.org.



AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected].



ACKNOWLEDGMENTS The authors thank Yuane Jia, Yebing Li, Dan Wei, Meiling Luo, Shiyun Sun, Li Xiao, and Huaying Ni of China, and Theodor Fuqua and Gail Zichitella of the United States for their assistance with data collection, and Guangchun Mi for her assistance with scoring constructed-response questions for the inter-rater reliability study.



REFERENCES

(1) Wandersee, J. H.; Mintzes, J.; Novak, J. Research on Alternative Conceptions in Science. In Handbook of Research on Science Teaching and Learning; Gabel, D., Ed.; Macmillan: New York, 1994; pp 177− 210. (2) Kind, V. Beyond Appearances: Students’ Misconceptions about Basic Chemical Ideas, 2nd ed.; Durham University: Durham, U.K., 2004; http://www.rsc.org/images/Misconceptions_update_tcm18-188603. pdf (accessed Dec 2011). (3) Sirhan, G. J. Turk. Sci. Educ. 2007, 4, 2−20. (4) Gabel, D. L.; Bunce, D. M. Research on Problem Solving: Chemistry. In Handbook of Research on Science Teaching and Learning; Gabel, D., Ed.; Macmillan: New York, 1994; pp 301−26. (5) Gabel, D. J. Chem. Educ. 1999, 76, 548−54. (6) Kozma, R. B.; Russell, J. J. Res. Sci. Teach. 1997, 34, 949−68. (7) Ü nal, S.; Ç alık, M.; Ayas, A.; Coll, R. K. Res. Sci. Technol. Educ. 2006, 24, 141−172. (8) Johnstone, A. H. Sch. Sci. Rev 1982, 64, 377−379. (9) Johnstone, A. H. J. Chem. Educ. 1993, 70, 701−705. (10) Stieff, M. J. Chem. Educ. 2005, 82, 489−493. (11) Yezierski, E. J.; Birk, J. P. J. Chem. Educ. 2006, 83, 954−960. (12) Ardac, D.; Akaygun, S. J. Res. Sci. Teach. 2004, 41, 317−337. (13) Snir, J.; Smith, C. L.; Raz, G. Sci. Educ. 2003, 87, 794−830. (14) National Research Council. Knowing What Students Know: The Science and Design of Educational Assessment; National Academic Press: Washington, DC, 2001; 44−51. (15) Rasch, G. Probabilistic Models for Some Intelligence and Attainment Tests; University of Chicago Press: Chicago, 1980; Danmarks Paedogogiske Institut: Copenhagen, 1960. (16) Liu, X. Using and Developing Measurement Instruments in Science Education: A Rasch Modeling Approach; IAP Press: Charlotte, NC, 2010. (17) Bond, T.; Fox, C. Appling the Rasch Model: Fundamental Measurement in the Human Sciences; Lawrence Erlbaum Associates: Mahwah, NJ, 2001/2007. (18) Wilson, M. Constructing Measures: An Item Response Modeling Approach; Lawrence Erlbaum Associates: Hillsdale, NJ, 2005. 345

dx.doi.org/10.1021/ed100852t | J. Chem. Educ. 2012, 89, 335−345