Using Item Response Theory To Assess Changes in Student

Department of Psychology, University of Georgia, Athens, Georgia 30602. J. Chem. Educ. , 2010, 87 (11), pp 1268–1272. DOI: 10.1021/ed100422c. Public...
9 downloads 8 Views 647KB Size
Research: Science and Education edited by

Diane Bunce The Catholic University of America Washington, DC 20064

Using Item Response Theory To Assess Changes in Student Performance Based on Changes in Question Wording Kimberly D. Schurmeier Department of Chemistry, Georgia Southern University, Statesboro, Georgia 30458 Charles H. Atwood* Department of Chemistry, University of Georgia, Athens, Georgia 30602 *[email protected] Carrie G. Shepler School of Chemistry and Biochemistry, Georgia Institute of Technology, Atlanta, Georgia 30332-0400 Gary J. Lautenschlager Department of Psychology, University of Georgia, Athens, Georgia 30602

Five years of longitudinal data for General Chemistry student assessments at the University of Georgia have been analyzed using the psychometric analysis tool, item response theory (IRT). From our analysis, we have determined that certain changes in question wording on exams can make significant differences in student performance on the questions. Because our analysis encompasses data from over 6100 students, statistical uncertainty on the analysis is extremely small. IRT provides us with a new insight into student performance on our assessments that is also important to the chemical education community as a whole. In this paper, IRT is used in conjunction with computerized testing to understand how nuances in question wording impact student performance. Item Response Theory Over the last two decades, IRT has become a commonly used psychometric tool to analyze test results (1-7). Embretson and Reise (8) and Baker (9) provide excellent reference resources for complete descriptions of IRT. Because most chemical educators are familiar with classical test theory (CTT), the major similarities and differences between IRT and CTT that help in the understanding of our test data analysis are described below. Both methods provide insight into certain nonoverlapping aspects of assessment. CTT is suited for all sample sizes and simpler to perform. With CTT, the mean, median, and Gaussian probability distribution for the test can be calculated. For each test item with CTT, an item discrimination factor, a comparison of the performance of the top quartile (or quintile) of students versus the bottom quartile (or quintile), can be found. IRT is applicable for larger sample sizes (preferably 200 or more students) and involves choosing an IRT model, adjusting the model parameters, followed by computer iteration until the model converges on the data. However, CTT has a greater dependence than IRT on the subject group whose exams are being analyzed as well as upon the nature of the examination. In other words, if the same group of individuals were given two different assessments on the same subject using different questions, CTT analysis 1268

Journal of Chemical Education

_

_

would yield different results for the two assessments. When model assumptions prove reasonable, IRT analysis is independent of the individuals assessed and the assessment items used. Rather than assigning a mean and median, IRT assigns each student an “ability level” based upon that student's responses to the assessment items. To a first approximation, a student's IRT ability level is a measure of their proficiency on the material being assessed. Furthermore, each item on the examination is also assigned a corresponding difficulty level based upon those students with a given ability level (and higher) who correctly answered the question. After the IRT analysis is performed, each test item's difficulty indicates how that item discriminates between students within the entire ability range. At the University of Georgia (UGA), the computerized testing system, JExam, is used to administer the examinations (three examinations per semester) during the semester (10, 11). The sample pool discussed in this paper consists of data from 96 exams given over 5 years to approximately 6100 students. In the 2005-2006 academic year, some 585 different questions were used on the examinations given in sequential semesters. Computerized hour exams consisted of 444 of the 585 questions. The remaining 141 questions were given on paper multiple-choice final exams. Each student responded to 25 questions per exam in the first semester, and 20 questions per exam in the second semester, for a total of 279 questions per student. In 2005-2006, the first exam was administered to 1369 students. The second-semester final exam was administered to 713 students. Owing to the large sample size, IRT is more appropriate than CTT for this analysis. None of the test items discussed in this paper were administered to fewer than 200 students. Several of the test items have been administered to more than 5000 students over the time span that JExam has been used at UGA. Basics of Item Response Theory A full description of IRT is well beyond the scope of this paper. For details, we refer readers to Embretson and Reise (8) and Baker (9). However, some rudiments must be described here.

_

Vol. 87 No. 11 November 2010 pubs.acs.org/jchemeduc r 2010 American Chemical Society and Division of Chemical Education, Inc. 10.1021/ed100422c Published on Web 08/30/2010

Research: Science and Education Table 1. Statistics for Individual Items Difficulty (Standard Error)

Asymptote (Standard Error)

χ2 Values

Item Number

Slope (Standard Error)

1*

1.718 (0.523)

1.305 (0.165)

0.141 (0.053)

2.4

7.0

295

2

1.254 (0.188)

-1.337 (0.256)

0.093 (0.062)

8.1

8.0

713

3

1.794 (0.191)

-0.581 (0.121)

0.141 (0.064)

10.2

8.0

1053

4

1.300 (0.320)

1.541 (0.155)

0.199 (0.042)

5.9

9.0

1053

5*

0.833 (0.197)

1.186 (0.252)

0.073 (0.049)

7.2

7.0

295

6

0.633 (0.220)

2.877 (0.671)

0.212 (0.073)

9.8

8.0

713

7*

1.099 (0.231)

-0.695 (0.256)

0.093 (0.062)

2.2

7.0

271

8*

1.265 (0.274)

1.225 (0.182)

0.054 (0.036)

3.9

6.0

271

DF

Number of Students

IRT involves the estimation of the structural parameters for each item, and concomitantly, those item parameter estimates are used to determine person ability estimates. Each question was fit by the IRT program Bilog MG 3 (12, 13) to an item characteristic curve (ICC) using the basic IRT equation, eq 1: PðθÞ ¼ c þ ð1 - cÞ

1 1 þ e - aðθ - bÞ

ð1Þ

where b is the item difficulty parameter; a is the item discrimination parameter; c is the pseudoguessing parameter, and θ is the person ability level. Item difficulty and person ability are measured on a common metric referred to as the ability scale. The guessing parameter, c, accounts for the possibility of a student guessing the correct answer. Theoretically, a multiplechoice question with four possible answers has a c value ≈ 0.25. For questions other than multiple-choice questions, Bilog-MG 3 calculates this guessing parameter close to, but not exactly, zero. Calculated a, b, and c values for each test item discussed in this paper are given in Table 1. The maximum likelihood estimation (MLE) procedure was used to fit the data based on the students' answers (13-16). A question's empirical fit to the model is determined by the χ2 value statistics and the degrees of freedom (DF). The probability of each question fitting the chosen model was also calculated. If the aforementioned diagnostics reveal that a question does not fit the model well, then the question either does not discriminate equally among students with the same ability or the wrong model was chosen. Bilog-MG 3, like other IRT programs, generates an item characteristic curve by independently fitting the data for each test item (Figure 1) (13-16). In IRT, an ideal question has a large slope (around 1 or greater) and lies somewhere within a useful region on the target population's ability scale (usually between -3 and þ3 for Bilog-MG 3). An example of an ideal ICC is shown in Figure 1. The slope, a, of this curve is 2.566, indicating that this particular question is highly discriminating. This ICC also has an item difficulty, b, of 1.074 (within a range of -3 to þ3), indicating that students with an ability of 1.074 have a probability, calculated using eq 2, of correctly answering the question (9). PðθÞ ¼

ð1 þ cÞ 2

ð2Þ

Students with ability less than 1.074 have a decreasing probability of correctly answering the question, but students with ability greater than 1.074 have an increasing probability of correctly answering the question. Also notice that the guessing parameter, c, for this question (indicated by the lower asymptote)

r 2010 American Chemical Society and Division of Chemical Education, Inc.

_

Figure 1. Ideal item characteristic curve; the slope of this curve is 2.566. This question has an item difficulty of 1.074 (b) and lies on the 0.066 (c) asymptote.

is 0.066. This indicates that all students, no matter their ability level, have a 0.066 (6.6%) probability of correctly answering the question through guessing. In general, tests constructed for large audiences must contain questions with a variety of difficulty levels to permit assessment of students across the entire ability range. Ideally, the a value for each question will be around 1 or greater, giving high discrimination for each question. Questions that do not fulfill these criteria may be considered poorly constructed. In that case, even students with the highest abilities cannot answer the question correctly, while many of the lower-ability students correctly guess the answer despite having garnered only minimal subject knowledge. Prior to the start of the 2005-2006 academic year, exams from the previous four years were analyzed using IRT. This analysis enabled us to find the questions with the best discrimination values and to determine accurately which specific topic database questions are equivalent (5, 17, 18). Nonequivalent questions on a specific topic were removed from use in the exams given during the 2005-2006 academic year. For this and all subsequent analyses, the dichotomous IRT model was used. This model scores students' responses for each question as either correct or incorrect, with no partial credit given. For the purposes of this study, the model was chosen based on the fit to the data, rather than selecting only the questions that fit a given model (1). To accurately fit our data, a three-parameter model, employing the a, b, and c item parameters, was required.

pubs.acs.org/jchemeduc

_

Vol. 87 No. 11 November 2010

_

Journal of Chemical Education

1269

Research: Science and Education Table 2. Ability Needed for Grade during the Fall 2005-Spring 2006 Academic Year at UGA A

B

C

D

F

1.6414

1.03988

0.438372

-0.16314