A Psychometric Analysis of the Chemical Concepts Inventory - Journal

Feb 12, 2013 - This paper was published ASAP on February 12, 2013, with the data in columns 3 and 4 of Table 2 interchanged. ... The classical test th...
0 downloads 9 Views 818KB Size
Article pubs.acs.org/jchemeduc

A Psychometric Analysis of the Chemical Concepts Inventory Jack Barbera* Department of Chemistry and Biochemistry, University of Northern Colorado, Greeley, Colorado 80639, United States S Supporting Information *

ABSTRACT: The Chemical Concepts Inventory (CCI) is a multiple-choice instrument designed to assess the alternate conceptions of students in high school or first-semester college chemistry. The instrument was published in 2002 along with an analysis of its data from a test population. This study supports the initial analysis and expands on the psychometric data available. The CCI was given to over 2500 students from four universities. Data were analyzed using classical test theory and Rasch model methods. The classical test theory analysis found the CCI to have acceptable internal consistency (Cronbach’s α = 0.73, pretest, and 0.76, posttest) and good test−retest reproducibility (Pearson correlation = 0.79, posttest). Most items were found to have good difficulty and discrimination values; however, a few had values that may warrant further evaluation. The Rasch analysis showed that the instrument and the items nicely fit the Rasch model. The data showed good separation reliability and that the instrument overall is at an appropriate level for the population. However, several gaps were found along the range of item difficulties, leading to less accurate estimations of student abilities. Overall, the CCI was found to be suitable for the large-scale assessment of students’ alternate conceptions. KEYWORDS: High School/Introductory Chemistry, First-Year Undergraduate/General, Chemical Education Research, Misconceptions/Discrepant Events, Testing/Assessment, Undergraduate Research, Quantitative Analysis FEATURE: Chemical Education Research



Why Do We Need an Evaluation of an Instrument That Has Been Published for over 10 Years?

INTRODUCTION 1

The Chemical Concepts Inventory (CCI) is designed to assess students’ alternate conceptions of chemistry topics typically encountered in high school or first-semester college chemistry courses. To create this inventory, Mulford and Robinson1 used the methodology established by Treagust2 and drew upon chemistry textbooks as well as an American Chemical Society (ACS) General Chemistry Examination for the breadth of content coverage for the instrument. Students’ alternate conceptions on the identified content were gathered from the literature. These conceptions were used for the basis of the items’ distractors. The final version of the instrument contains 22 multiple-choice items. More than half of the items are in linked pairs, where the first item probes content knowledge of a specific topic and the second item probes the reasoning for the response. Since its publication in 2002, the CCI has been used in several studies and is frequently cited. A search of the Web of Science database uncovers more than 65 citations (accessed May 2012) to the CCI. This instrument has been used in studies probing the alternate conceptions of students3−5 as well as instructors.6 Many authors refer to the CCI as a measure of common student misconceptions7−16 or to establish examples of student understanding.17−25 Some authors refer to the CCI as a source of items4,5,26−29 or as an example of item types30−32 when designing their own instrument. © 2013 American Chemical Society and Division of Chemical Education, Inc.

Despite the widespread use and citation of the CCI, there is little information regarding the psychometric properties of the inventory and its items. In their original publication,1 Mulford and Robinson report percentage correct and distractor frequencies for each item and Cronbach’s α values for both the pretest and posttest administrations. These values provide limited information for potential users regarding the validity and reliability of results generated with the instrument and its items. This leaves the potential users with very few points of reference with which to gauge the instrument’s performance. An extension of the analysis of the CCI’s psychometric properties beyond those currently published will provide more robust information about the instrument and its functionality. This information can then be used as a set of reference points for users when evaluating the validity and reliability of their data sets. Establishing the validity and reliability of data generated by assessment instruments and items is paramount in evaluating the efficacy of teaching practice. Several recent reports champion the need for evidence-based curriculum reform within science education.19,33,34 These reports suggest the use of “widely available reliable and valid instruments”34 and that progress toward reform goals relies on “research methods appropriate for investigating human thinking, motivation, and learning”.33 Holme, et al. suggest Published: February 12, 2013 546

dx.doi.org/10.1021/ed3004353 | J. Chem. Educ. 2013, 90, 546−553

Journal of Chemical Education

Article

that a driving force for “the cycle of assessment-enhanced reform” is the process by which quality assessment is disseminated.19 These reports go on to suggest various types of evidence that practitioners can use to make informed curricular choices, many of which require the administration of an assessment instrument. Data gathered with these assessment instruments can then be used to make inferences about the population, teaching practices, and the curriculum under investigation as part of the reform effort. Therefore, the results of an assessment should be scrutinized for claims of validity and reliability such that accurate inferences can be made. This study used a combination of classical test theory (CTT)35 and Rasch modeling35 methods in evaluating the psychometric properties of data gathered using the CCI. The combination of techniques is used to not only provide readers with the familiar analyses of CTT, but to also assess dimensions beyond what CTT alone is capable of. Data presented from the CTT analysis include percentage correct responses, item discrimination, item difficulty, internal consistency, and test− retest reliability. Rasch analysis will be used to evaluate the dimensionality of the instrument, local independence of the items, separation reliability, the match between the difficulty of the items and the ability of the students, and the fit of the items to the dichotomous Rasch model.36,37

spreadsheet program. Data from students who did not provide consent were removed from each data set. Students who did not respond to all 22 items were also removed from the data set. The values listed in Table 1 represent the useable number of student responses from each school. As the CCI is designed to assess the alternate conceptions of first-semester college chemistry students in general, all data were combined into large master pretest and posttest data sets. This was done to provide better fit-statistics and increase the range of student abilities for the Rasch model analysis. A dichotomous response data set was produced from the raw polytomous data, in which each student’s response was scored as either correct or incorrect. Data Analysis

Several analyses were conducted on the data sets. A spreadsheet program was used to determine the percentage correct, item difficulty, and item discrimination for each of the 22 items. The Winsteps program38 was used to evaluate the pretest and posttest data using the Rasch model. The Rasch model is a probabilistic model; its creator best describes its principle as follows:39 [A] person having a greater ability than another person should have the greater probability of solving any item of the type in question, and similarly, one item being more difficult than another means that for any person the probability of solving the second item is the greater one. The Rasch model assumes that there is a linear measure of both items and persons.36 For items the model measures their difficulty, D; for persons the measure is ability, B. The relationship between these measures is expressed below in the general Rasch model equation.36



POPULATION The CCI was administered during the Fall 2011 semester to students enrolled in a first-semester general chemistry course at four different universities in the United States (Table 1). Each Table 1. Number of Students at Each University for Each Administration of the CCI School A B C D

Region in the U.S.a Mountain West Mountain West East North Central Pacific West

Carnegie Classificationb

Pretest (n)

Posttest (n)

Test−Retest (n)

DRU

310

286

118

RU/VH

699

623



Master’s L

617

458



1399

1317



RU/VH

Pni(x = 1) =

(1)

Equation 1 shows that the probability of correctly answering an item, Pni(x = 1), is a function of the difference between a person’s ability, Bn, and the difficulty of the item, Di. If the difference is zero, the probability is 50%. As the difference increases in the positive direction, the person has a higher probability of answering correctly. The additive nature of this measurement model allows for direct comparisons between the ability of the population and the difficulty of the items on the instrument. This allows for investigation of the match between instruments, items, and target populations.

a

U.S. Census division. bDRU, doctoral/research university; RU/VH, research university (very high activity); Master’s L, master’s college/ university (larger institution).



RESULTS AND DISCUSSION The CCI was developed to evaluate student misconceptions both at the start of and after a typical college-level course in general chemistry. Therefore, the psychometric properties need to be established for each administration separately. The results of the pretest and posttest administrations will be presented in parallel below. These are not matched data sets; our intention is only to establish the properties of the instrument and its items for the two different administrations. Data regarding pretest to posttest gains on a matched data set are the focus of a separate study and manuscript.40

school gave the exam as both a 30-min pretest and posttest. At each school, the pretest was given within the first two weeks of the course and the posttest was given within three weeks of the final exam. Students in all sections at each school participated; data presented represents only those students who provided consent to do so. Schools were selected based on the course size, classification, location, and on their willingness to participate in the data collection. A subset of the students at school A were given a second posttest two weeks after they were given the first posttest. These two administrations were used to evaluate the test− retest reliability35 of the CCI posttest data.



exp(Bn − Di) 1 + exp(Bn − Di)

Classical Test Theory: Percentage Correct

Each data set was used to calculate the percentage of students who correctly answered each item (see the Supporting Information for a complete list of items); these data are presented in Table 2 along with the original results presented by Mulford and Robinson.1 The data sets from the two studies

METHODOLOGY

Data Processing

Student responses were gathered using bubble-in response sheets; these sheets were scanned and processed using a 547

dx.doi.org/10.1021/ed3004353 | J. Chem. Educ. 2013, 90, 546−553

Journal of Chemical Education

Article

details of how the CCI items function with this target population.

Table 2. Pretest and Posttest Percentage Correct Values for Each Item on the CCI

Classical Test Theory: Item Difficulty and Item Discrimination

Correct Responses, % Current Data

a

Literature Data

a

Item

Pretest

Posttest

Pretest

Posttest

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

40 36 54 70 12 48 87 88 30 34 33 68 70 23 66 33 32 50 67 25 14 22

42 45 64 72 25 52 91 91 36 36 36 73 75 22 73 44 37 55 72 25 15 23

37 40 67 73 11 39 89 88 28 36 35 69 71 25 77 33 29 50 61 32 25 19

34 47 72 74 21 45 92 91 30 44 42 74 75 32 79 38 34 54 66 34 26 25

Item difficulty (p) measures the portion of examinees who answered the item correctly. Difficulty values range from 0.0 to 1.0, with a higher value indicating a greater proportion of examinees answering correctly, indicating an easier item.35 Item discrimination (D) is a measure of how well the item distinguishes between high- and low-performing examinees. As suggested by Kelley41 for data sets larger than 200, discrimination values were calculated using the upper and lower 27% criteria. In this, participants are ordered by their total score, and the extremes are used to more accurately determine how well an item differentiates between performers. Discrimination values can range from −1.0 to 1.0. Negative values indicate a problem with the item or that it was miskeyed. As discrimination values increase above zero, the item can better distinguish between the high and low performers. An item’s discrimination value is somewhat dependent on its difficulty value. For example, if an item has a very high difficulty value (indicating an easy item), then it is not likely to be very discriminant between different performance levels. The difficulty and discrimination values for each administration are plotted in Figure 1. The suggested range of item difficulty is 0.25−0.75 for concept inventories.42 Discrimination values greater than or equal to 0.30 will be deemed acceptable, those below 0.30 may require further investigation to determine their lack of discriminant ability.43,44 Boundary ranges are indicated in Figure 1, gray areas indicate items with values outside of the suggested criteria. Items 7 and 8 (two-tiered item probing conservation of mass; see the Supporting Information for a complete list of items) have difficulty values greater than 0.75 on both the pretest and posttest administrations. While these items have acceptable discrimination values on the pretest administration, they drop below 0.30 in the posttest data. These characteristics indicate that as both pretest and posttest items they do not produce much useful data pertaining to student misconceptions about the conservation of matter. The item pair shows that around 90% of students answer the items correctly on both pretest and posttest. Mulford and Robinson suggested1 that these items may be prompting simple recall and not actually addressing student conceptions of the conservation of matter. Items 9 (bond energy) and 22 (macroscopic and microscopic properties) are the least discriminating items in both data sets. While

Literature data from ref 1.

show similarities in terms of the range of scores and the relative success on individual items. Spearman’s ρ correlation coefficients between the two data sets (percentage correct for each item) are 0.97 and 0.95 for the pretest and posttest data, respectively. The percentage correct data presented in Table 2 helps to validate the original results presented in Mulford and Robinson’s development paper. To date, only one other study has published data on how each of the CCI items functions in comparison to the original literature data.1 Kruse and Roehrig used the CCI to assess teachers’ conceptions.6 While other studies have used individual CCI items to assess the same target population as Mulford and Robinson (firstsemester college chemistry students), to date, ours is the only one that provides complete results to support their findings. In subsequent sections, the manuscript will provide more in-depth

Figure 1. Difficulty (p) and discrimination (D) values for all 22 CCI items: (A) pretest; and (B) posttest data. The gray areas indicate items with values outside of the suggested criteria for item difficulty and discrimination. 548

dx.doi.org/10.1021/ed3004353 | J. Chem. Educ. 2013, 90, 546−553

Journal of Chemical Education

Article

were administered the CCI twice at the end of the semester. The two posttests occurred two weeks apart and are used to compare the reproducibility of the data through a test−retest calculation. This method compares respondent’s responses on each administration; the match between responses is evaluated using a Spearman’s ρ correlation coefficient.35 The test−retest value for this data set is 0.79 (p < 0.01), indicating reasonably good reproducibility of the data.

neither is the most difficult item, they do not distinguish between high and low performers on the instrument. Item 18 (conservation of mass) is the most discriminating item in both data sets, indicating a strong split between high and low performers. Based on difficulty and discrimination values, items 5, 7, 8, 9, 14, 20, 21, and 22 require further investigation of their item functioning. Without further information from a qualitative study of these items, attaching meaning to their results would be purely speculative.

Rasch Model: Unidimensionality and Local Independence

Classical Test Theory: Reliability

The Rasch model is based on two assumptions: that the scale is unidimensional (measures a single latent trait); and that the items are locally independent (success or failure of one item is not related to the success or failure of other items).35 These structures are highly related to one another and are evaluated using Rasch measurement programs.36 As the CCI contains item pairs (two-tier), which by design are related, these pairs are scored together in the Rasch analysis. Therefore, an item pair is only scored as correct if both tiers are answered correctly. Treagust has used this type of data analysis in the evaluation of other two-tier chemistry instruments.45 It is done here to not violate the Rasch model assumption of local independence. A latent trait is “a hypothetical and unobserved characteristic or trait (e.g., verbal ability, knowledge of history, or extroversion)”.46 For the CCI, the latent trait can be defined as “conceptual chemistry knowledge”. Therefore, an evaluation of the unidimensionality of the CCI looks for trends in the data that do not correspond to this trait. Unidimensionality within the Rasch measurement model is evaluated by principle component analysis of residuals. The residuals are calculated from the difference in the observed response and the

Two forms of reliability for the data generated from the administrations of the CCI were estimated. Internal consistency of both data sets was evaluated through a Cronbach’s α calculation. This value is a measure of the consistency within a person’s responses to the items on an assessment instrument.35 Data are said to have “acceptable” internal consistency if the α value is between 0.70 and 0.80, “good” consistency between 0.80 and 0.90, and “excellent” above 0.90.35 Both the pretest and posttest data sets have α values in the acceptable range (Table 3). These pretest and posttest values are in the same Table 3. Reliability Values of Pretest and Posttest CCI Data Literature Dataa

Current Data Measures Cronbach’s α Values Test−Retestc a

Pretest 0.73 

b

Posttest c

0.76 0.79d

Pretest

Posttest

0.704 

0.716 

Literature data from ref 1. bN = 3025. cN = 2684. dN = 118; p < 0.01.

category as those reported by Mulford and Robinson in their study. A subset of the participants from school A (n = 118)

Figure 2. Wright maps of pretest and posttest CCI data, showing vertical plots of the distribution of logit values for both person ability and item difficulty. 549

dx.doi.org/10.1021/ed3004353 | J. Chem. Educ. 2013, 90, 546−553

Journal of Chemical Education

Article

Table 4. Item Fit Statistics for Pretest and Posttest CCI Data Pretest Data, N = 3025

Posttest Data, N = 2684

Item

Difficulty Valuea

Infit, MNSQb

Outfit, MNSQb

Item

Difficulty Valuea

Infit, MNSQb

Outfit, MNSQb

1 2 3 4 5 6 7.8 9 10.11 12.13 14 15 16.17 18.19 20.21 22

−0.02 0.17 −0.72 −1.58 1.91 −0.42 −2.77 0.50 0.48 −1.31 0.99 −1.35 1.46 −0.45 2.12 1.01

1.03 0.90 0.90 0.88 0.98 0.89 0.90 1.31c 1.25 0.88 1.02 1.04 0.96 0.89 0.89 1.17

1.08 0.91 0.86 0.79 1.19 0.87 0.85 1.49c 1.38d 0.85 1.16 1.08 0.94 0.87 0.75 1.34d

1 2 3 4 5 6 7.8 9 10.11 12.13 14 15 16.17 18.19 20.21 22

0.18 0.05 −0.94 −1.42 1.18 −0.34 −2.90 0.53 0.66 −1.40 1.42 −1.50 1.28 −0.40 2.25 1.35

1.04 0.86 0.88 0.91 0.95 0.86 0.94 1.33c 1.27 0.91 1.00 1.03 0.97 0.88 0.83 1.21

1.08 0.85 0.80 0.82 0.92 0.80 0.94 1.55c 1.44d 0.88 1.14 1.04 1.01 0.84 0.65d 1.60d

a Difficulty values are reported using the logit scale, ranging from −4.0 to +4.0, with 0 indicating an item of average difficulty. bThe mean-squares (MNSQ) value is a χ2 statistic divided by its degrees of freedom, so that the expected value is close to 1.0. MNSQ values of 0.70−1.30 indicate the acceptable fit range; Infit (weighted) or Outfit (unweighted) values outside this range indicate poor fit of the data to the Rasch model. cItem 9 shows poor model fit of both the pretest and posttest data; these results only indicate that the data from item 9 do not fit the model of an item being easier for higher ability students. The item 9 data do not indicate whether the problem is with the item itself or with the actual knowledge of the students. d These items have Outfit values outside the recommended ranges. As these items have corresponding Infit values within the acceptable ranges, their data are not seen as being particularly problematic.

than average ability or items that are more difficult than the average difficulty. Correspondingly, negative values indicate persons of lower ability or easier (less difficult) items. It is common to compare these variables visually by evaluating a Wright map36 of the data. A Wright map is a vertical plot of the distribution of logit values for both person ability and item difficulty; Figure 2 shows the Wright maps for the pretest and posttest data of this study. The values (4 to −4) to the left of each plot are the logit scale; the dashed vertical line separates the person ability data (left side) from the item difficulty data (right side). The abilities of the participants are indicated by symbols (# or •), items are labeled by their item number (or item pair) on the CCI. The capital letters running along the dashed vertical axis indicate the placement of the mean (M), one standard deviation (S), and two standard deviations (T) for each variable. A person’s placement along the vertical plot indicates his or her ability. A person would have a 50% probability of correctly answering an item directly across the dashed centerline from his or her ability (i.e., same logit value). A person has a lower probability of correctly answering items higher up on the plot than his or her ability position and a higher probability of correctly answering items below his or her position. The Wright maps from the CCI data provide several details about each data set. First, the maps show the symmetric distribution of person abilities for each administration of the CCI. Abilities range from lows around −4 logits to highs around +4 logits, with most participants falling within ± one standard deviation from the mean. This distribution shows that the data set is not skewed or bimodal with respect to the ability of the participants. Second, the maps show the range of item difficulties for each administration. In both sets, item difficulties range from very easy items (around −2.8 logits) to more difficulty items (around +2.1 logits). The maps also show several items of similar difficulty (e.g., 12.13 and 15 pretest, or

probability of a correct response as predicted by the Rasch model. This analysis looks for relationships between item variance once the main component (describing the single latent trait) is removed. In the pretest data, three items have correlated variance leading to a unique component. Items 9, 22, and the item 10.11 pair load together. In the posttest data, two items (9 and the 10.11 pair) load together in a unique component. While these unique components might suggest that the CCI is not unidimensional, this characteristic is not allor-nothing. The “dimensions” formed by these items are very weak (eigenvalues less than 2 for both pretest and posttest).47 On the basis of these values, the dimensions are not seen as a threat to the unidimensionality of the CCI data. Ultimately, evaluating unidimensionality requires knowledge of the instrument construct in order to evaluate whether it is “unidimensional enough” for its intended purpose.37 Local independence is evaluated within the Rasch model by the use of residual correlations among the items.47 An analysis of local independence for the pretest and posttest data did not reveal any significant correlations between items. The largest correlation observed was −0.18, items with correlation >0.7 are considered highly locally dependent.47 These results taken together indicate that the CCI data do not violate the assumptions of the Rasch model and that the analysis is appropriate for use with the data set. Rasch Model: Item Difficulty and Person Ability

In the Rasch model, the difficulty of items and the ability of respondents are placed on the same log-odds (logit) scale.37 Placing both variables on the same scale allows for direct comparisons between them. This feature is paramount to the utility and power of a Rasch analysis. No other analysis method allows a direct comparison between a person’s ability and the difficulty of the items on an instrument. On the logit scale, zero (0) represents a person of average ability or an item of average difficulty. Positive values on the scale indicate persons of higher 550

dx.doi.org/10.1021/ed3004353 | J. Chem. Educ. 2013, 90, 546−553

Journal of Chemical Education

Article

12.13, 15, and 4 posttest) as well as gaps along the difficulty range. Gaps in the range of difficulties are problematic. In the Rasch model, the accuracy of evaluating a person’s ability relies on the location of items with difficulties in the same range of the scale. When gaps exist, there is more error in determining a person’s ability. For example, in the pretest data set (Figure 2), there are several students whose abilities fall within the gap between item 7.8 (−2.75 logits) and items 12.13 and 15 (−1.31 and −1.35 logits, respectively). There will be more error in determining the ability of these students when compared to the students higher up on the scale who are more closely bracketed by items. The last significant feature of a Wright map is the ease of establishing the match between the ability of participants and the difficulty of an assessment. This is accomplished by evaluating the mean values (M) of each variable as well as the overlap of the distributions. The maps in Figure 2 show that the means differ by 0.52 logits for pretest data and 0.22 logits for posttest data. These small differences indicate that the difficulty of the CCI is appropriately matched to the ability of the population in each administration. An evaluation such as this is critical when evaluating the appropriateness of an assessment for a population.

acceptable ranges, their data is not seen as being particularly problematic. This conclusion is due to the fact that outliers in the data set heavily influence outfit values, most often due to guessing. Based on these fit parameters, the data suggests that most of the pretest and posttest data fit the Rasch model appropriately. This implies that each item (regardless of its difficulty value) functions reasonably well for most students. In addition to evaluating the MNSQ values as an indication of fit, data can also be evaluated by the z-statistic (ZSTD). The ZSTD values are often used to test the hypothesis that data fit the model “perfectly”. Values of ZSTD are not reported in Table 4 as these values are overinflated owing to the very large (N > 2500) number of participants. With this size data set, even the smallest deviation from a “perfect” fit shows up as being significant.36 It is recommended that MNSQ values be evaluated prior to consideration of ZSTD values and that “ZSTD is only useful to salvage non-significant MNSQ [values] when sample size is small or test length is short”.38 Rasch Model: Reliability

In the Rasch model, reliability can be evaluated through the “person separation reliability”.38 Person separation reliability is similar to Cronbach’s α and is computed using an individual’s standard deviation from the mean and the root-mean-square measurement error.48 Separation reliability is an estimate of the “reproducibility of relative measure location”.38 That is, it estimates the consistency of the items on the instrument in measuring a person’s ability. The separation reliability of the pretest and posttest data is 0.70 and 0.75, respectively. These values indicate “acceptable” consistency in measuring a person’s ability.36

Rasch Model: Item Fit

The Rasch model assumes that students of higher ability will have a higher probability of getting an item correct than students of lower ability. This mathematical model is used to establish both an item’s difficulty and a person’s ability. Within a Rasch analysis, the fit of an item to the model is evaluated. Item fit can be used to identify problematic items. Poor item fit may be due to several reasons and needs to be evaluated on a case-by-case basis. Results from the pretest and posttest administrations of the CCI were evaluated for fit to the Rasch model. Table 4 shows the placement of each item along the logit scale (difficulty value) as well as its model fit parameters. Model fit is evaluated by two residual analyses, the unweighted (outfit) and weighted (infit) indices. Outfit is sensitive to the unexpected behavior (response patterns) of persons whose ability is far from the items difficulty. Infit is weighted to be less sensitive to these unexpected response patterns. The magnitude of the fit for both indices is estimated by the mean-squares (MNSQ) value, which is a χ2 statistic divided by its degrees of freedom, such that the expected value is close to 1.0. Values of MNSQ greater than 1.0 indicate more variation between the response patterns observed and those expected by the model. The acceptable fit range is indicated by MNSQ values between 0.70 and 1.30, infit or outfit values outside this range indicate poor fit of the data to the Rasch model.36 On the basis of the fit indices, only item 9 shows poor model fit of both the pretest and posttest data. As both the infit and outfit values are outside the recommended range, this item is problematic for most students. Item 9 deals with bond energy and addresses the misconception of bond breaking releasing energy. These results only indicate that the data from item 9 do not fit the model of an item being easier for higher-ability students. These data do not indicate whether the problem is with the item itself or with the actual knowledge of the students; further investigation would be needed to make accurate inferences. Three other items (10.11, 20.21, and 22) have outfit values outside the recommended ranges. As these three items have corresponding infit values within the



SUMMARY OF RESULTS Data gathered from 3025 students (pretest) and 2684 students (posttest) were analyzed using both classical test theory and Rasch model methods. This was collected from four different schools, reflecting a similar target population to that used by Mulford and Robinson in their study.1 Table 2 shows the percentage correct data for each item on the CCI for both studies. While a few items show performance differences between data sets, remarkable similarities emerge in terms of the range of scores and the relative success on individual items. This result supports the original percentage correct data, showing that those data are not an artifact of the population initially used to study the instrument. As the CCI is intended to gauge student conceptions of a variety of chemistry topics, item level data beyond percentage correct becomes vital in making more accurate inferences. The difficulty and discrimination for each of the 22 items on the CCI are represented in Figure 1. These data provide quantitative information about the item functioning, allowing future users to gauge the quality of each item prior to use or as a comparison for other data collected. For example, item 9 (bond energy) has difficulty values less than 0.40 (few students correctly respond to the item), and is the least discriminating item in each data set. Without these data, it might be inferred that only students of the highest ability are getting the item correct; however, based on the discrimination values, it is clear that this would be an incorrect inference. Several CCI items (5, 7, 8, 9, 14, 20, 21, and 22) have difficulty and discrimination values outside of the recommended range for a concept inventory. These items may not produce useful data and 551

dx.doi.org/10.1021/ed3004353 | J. Chem. Educ. 2013, 90, 546−553

Journal of Chemical Education

Article

exist on an instrument, a student’s consistency in response among those items can be used as a more robust way to evaluate his or her understanding of the concept. However, this evaluation is not possible when topics or concepts are only addressed by a single item. The Chemical Concepts Inventory is one of the most widely known instruments within chemistry education; however, it is also one of the least studied instruments. While the psychometric results presented here support and augment those originally reported, several areas are still open to investigation. As the CCI is purported as a measure of student conceptions, parallel studies are required to investigate how well the instrument measures any single conception. This can be done from a qualitative standpoint by investigating the response reasoning of students in interview settings. In addition, the consistency of student responses on related items can be investigated. As several concepts are only covered by a single CCI item, an external source of items would be required in some areas. Items from the CCI can also be used as the start of smaller “single-concept” inventories. While the CCI provides breadth, “single-concept” inventories would provide depth of understanding on individual concepts of importance within general chemistry.

require further investigation prior to being used to support or refute student learning. Other additions to the psychometric data for the CCI are the Rasch analysis results presented from this study. The Rasch results show that the pretest and the posttest data do not violate the assumptions of unidimensionality or local independence and that most items nicely fit the Rasch model. Item 9 was the only item that did not fit based on both fit indices. Three other items have high fit values when considering outlier students (outfit) but are within the acceptable ranges for the weighted fit values (infit). Wright maps of the Rasch results (Figure 2) show good correlation between the range of student abilities in the first-semester chemistry population and the difficulty range of CCI items. While a few gaps exist in the difficulty range of the items, the similarity in mean values for person ability and item difficulty indicates that the CCI is at an appropriate level of difficulty for the population under investigation. Further support of the original results comes in the area of internal consistency. A comparison in Cronbach’s α values shows the pretest and posttest values from this study (0.73 and 0.76, respectively) are in the same category as the values of 0.704 and 0.716 reported in the original study. Additional reliability data for the CCI is presented in the test−retest analysis of posttest data and in the Rasch analysis of person separation reliability. The test−retest data indicates reasonably good reproducibility (correlation value of 0.79) of the results. The Rasch person separation reliability values of 0.70 (pretest) and 0.75 (posttest) indicated good consistency in the CCI items’ placement of students along the logit scale.



ASSOCIATED CONTENT

S Supporting Information *

Items and distractors from the current version of the Chemical Concepts Inventory, showing the correct answers and the percentage of students choosing each response shown. This material is available via the Internet at http://pubs.acs.org.





CONCLUSIONS Results from data gathered in this study overlap with those published by Mulford and Robinson1 in several areas. These similarities help to validate the original study, showing that the percentage correct and internal consistency values are similar across various general chemistry populations. While a few items require further investigation, the additional classical test theory and Rasch model analyses presented indicate that the Chemical Concepts Inventory functions reasonably well as an instrument. In summary, the CCI is a suitable instrument for large-scale assessment of students’ alternate conceptions, providing the classroom instructor or chemical education researcher with initial information regarding student understanding. While the original results, and those presented here, substantiate the functioning of the items, users are cautioned about only using data from the Chemical Concepts Inventory in making inferences about student understanding. Results from any multiple-choice instrument cannot be fully understood without triangulation. For example, if a student gets an item about the conservation of mass incorrect, does this mean that he or she does not understand this concept? A multiple-choice response can be the result of several variables other than student understanding. Students could be misinterpreting the item wording or related images; they could be distracted during the instrument administration; or they could simply not be taking the assessment seriously. While several features of instrument and item analysis are sensitive to these aspects and can be used to evaluate if a problem exists, they do not provide information regarding the nature of the problem. Many of these issues can be investigated and understood through further quantitative analysis and parallel qualitative studies. For example, if several items addressing conservation of mass

AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. Notes

The authors declare no competing financial interest.



REFERENCES

(1) Mulford, D. R.; Robinson, W. R. An Inventory for Alternate Conceptions among First-Semester General Chemistry Students. J. Chem. Educ. 2002, 79 (6), 739−744. (2) Treagust, D. F. Development and Use of Diagnostic Tests To Evaluate Students’ Misconceptions in Science. Int. J. Sci. Educ. 1988, 10, 159−169. (3) Cacciatore, K. L.; Sevian, H. Incrementally Approaching an Inquiry Lab Curriculum: Can Changing a Single Laboratory Experiment Improve Student Performance in General Chemistry? J. Chem. Educ. 2009, 86 (4), 498−505. (4) Mayer, K. Addressing Students’ Misconceptions about Gases, Mass, and Composition. J. Chem. Educ. 2011, 88 (1), 111−115. (5) Regan, A.; Childs, P.; Hayes, S. The Use of an Intervention Programme To Improve Undergraduate Students’ Chemical Knowledge and Address Their Misconceptions. Chem. Educ. Res. Pract. 2011, 12 (2), 219−227. (6) Kruse, R. A.; Roehrig, G. H. A Comparison Study: Assessing Teachers’ Conceptions with the Chemical Concepts Inventory. J. Chem. Educ. 2005, 82 (8), 1246−1250. (7) Bernholt, S.; Parchmann, I. Assessing the Complexity of Students’ Knowledge in Chemistry. Chem. Educ. Res. Pract. 2011, 12 (2), 167− 173. (8) Branan, D.; Morgan, M. Mini-Lab Activities: Inquiry-Based Lab Activities for Formative Assessment. J. Chem. Educ. 2010, 87 (1), 69− 72. 552

dx.doi.org/10.1021/ed3004353 | J. Chem. Educ. 2013, 90, 546−553

Journal of Chemical Education

Article

(28) Stains, M.; Talanquer, V. A2: Element or Compound? J. Chem. Educ. 2007, 84 (5), 880−883. (29) Streveler, R. A.; Miller, R. L.; Santiago-Roman, A. I.; Nelson, M. A.; Geist, M. R.; Olds, B. M. Rigorous Methodology for Concept Inventory Development: Using the “Assessment Triangle” To Develop and Test the Thermal and Transport Science Concept Inventory (TTCI). Int. J. Eng. Educ. 2011, 27 (5), 968−984. (30) Chandrasegaran, A. L.; Treagust, D. F.; Mocerino, M. Emphasizing Multiple Levels of Representation To Enhance Students’ Understandings of the Changes Occurring during Chemical Reactions. J. Chem. Educ. 2009, 86 (12), 1433−1436. (31) Smith, K. C.; Nakhleh, M. B.; Bretz, S. L. An Expanded Framework for Analyzing General Chemistry Exams. Chem. Educ. Res. Pract. 2010, 11 (3), 147−153. (32) Villafane, S. M.; Loertscher, J.; Minderhout, V.; Lewis, J. E. Uncovering Students’ Incorrect Ideas about Foundational Concepts for Biochemistry. Chem. Educ. Res. Pract. 2011, 12 (2), 210−218. (33) Singer, S. R.; Nielsen, N. R.; Schweingruber, H. A. DisciplineBased Education Research: Understanding and Improving Learning in Undergraduate Science and Engineering; National Academies Press: Washington, DC, 2012. (34) Towns, M. H.; Kraft, A. Review and Synthesis of Research in Chemical Education from 2000−2010 (Paper Commissioned by the Board on Science Education); The National Academies: Washington, DC, 2011. http://www7.national-academies.org/bose/DBER_ Towns_October_Paper.pdf (accessed Jan 2013). (35) Crocker, L.; Algina, J. Introduction to Classical and Modern Test Theory; Cengage Learning: Mason, OH, 2008. (36) Bond, T. G.; Fox, C. M. Applying the Rasch Model, 2nd ed.; Routledge: New York, 2007. (37) Embretson, S. E.; Reise, S. P. Item Response Theory for Psychologists; Psychology Press: Lawrence Erlbaum Associates Inc.: Mahwah, NJ, 2000. (38) Linacre, J. M. WINSTEPS Rasch-Model Computer Program, 3.72.3; 2010. (39) Rasch, G. Probabilistic Models for Some Intelligence and Attainment Tests; University of Chicago Press: Chicago, IL, 1980. (40) Pentecost, T.; Barbera, J. Measuring Learning Gains in Chemical Education: A Comparison of Two Methods. J. Chem. Educ., submitted for publication, 2012. (41) Kelley, T. L. The Selection of Upper and Lower Groups for the Validation of Items. J. Educ. Psychol. 1939, 30, 17−24. (42) Kline, T. J. B. Psychological Testing: A Practical Approach to Design and Evaluation; Sage: Thousand Oaks, CA, 2005. (43) Ebel, R. L.; Frisbie, D. A. Essentials of Educational Measurement; Prentice Hall: Englewood Cliffs, NJ, 2006. (44) Thorndike, R. M. IRT and Intelligence Testing: Past, Present and Future. In The New Rules of Measurement: What Every Psychologist and Educator Should Know; Embretson, S. E., Hershberger, S. L., Eds.; Erlbaum: Mahwah, NJ, 1999. (45) Chandrasegaran, A. L.; Treagust, D. F.; Mocerino, M. The Development of a Two-Tier Multiple-Choice Diagnostic Instrument for Evaluating Secondary Students’ Ability To Describe and Explain Chemical Reactions Using Multiple Levels Of Representation. Chem. Educ. Res. Pract. 2007, 8 (3), 293−307. (46) Allen, M. J.; Yen, W. M. Introduction to Measurement Theory; Waveland Press: Long Grove, IL, 1979. (47) Linacre, J. M. A User’s Guide to Winsteps, 3.70.1; Winsteps.com: Beaverton, OR, 2010. http://www.winsteps.com/ index.htm (accessed Jan 2013). (48) Wright, B. D.; Masters, G. Rating Scale Analysis; Mesa: Chicago, IL, 1982.

(9) Canpolat, N. Turkish Undergraduates’ Misconceptions of Evaporation, Evaporation Rate, and Vapour Pressure. Int. J. Sci. Educ. 2006, 28 (15), 1757−1770. (10) Cliff, W. H. Chemistry Misconceptions Associated with Understanding Calcium and Phosphate Homeostasis. Adv. Physiol. Educ. 2009, 33 (4), 323−328. (11) Costu, B.; Ayas, A.; Niaz, M. Promoting Conceptual Change in First-Year Students’ Understanding of Evaporation. Chem. Educ. Res. Pract. 2010, 11 (1), 5−16. (12) Kerr, S. C.; Walz, K. A. Holes” in Student Understanding: Addressing Prevalent Misconceptions Regarding Atmospheric Environmental Chemistry. J. Chem. Educ. 2007, 84 (10), 1693−1696. (13) Nyachwaya, J. M.; Mohamed, A.-R.; Roehrig, G. H.; Wood, N. B.; Kern, A. L.; Schneider, J. L. The Development of an Open-Ended Drawing Tool: An Alternative Diagnostic Tool for Assessing Students’ Understanding of the Particulate Nature Of Matter. Chem. Educ. Res. Pract. 2011, 12 (2), 121−132. (14) Othman, J.; Treagust, D. F.; Chandrasegaran, A. L. An Investigation into the Relationship between Students’ Conceptions of the Particulate Nature Of Matter and Their Understanding of Chemical Bonding. Int. J. Sci. Educ. 2008, 30 (11), 1531−1550. (15) Villafane, S. M.; Bailey, C. P.; Loertscher, J.; Minderhout, V.; Lewis, J. E. Development and Analysis of an Instrument To Assess Student Understanding of Foundational Concepts Before Biochemistry Coursework. Biochem. Mol. Biol. Educ. 2011, 39 (2), 102−109. (16) Waner, M. J. Particulate Pictures and Kinetic-Molecular Theory Concepts: Seizing an Opportunity. J. Chem. Educ. 2010, 87 (9), 924− 927. (17) Canpolat, N.; Pınarbaşı, T.; Sözbilir, M. Prospective Teachers’ Misconceptions of Vaporization and Vapor Pressure. J. Chem. Educ. 2006, 83 (8), 1237−1242. (18) Chandrasegaran, A. L.; Treagust, D. F.; Waldrip, B. G.; Chandrasegaran, A. Students’ Dilemmas in Reaction Stoichiometry Problem Solving: Deducing the Limiting Reagent in Chemical Reactions. Chem. Educ. Res. Pract. 2009, 10 (1), 14−23. (19) Holme, T.; Bretz, S. L.; Cooper, M.; Lewis, J.; Paek, P.; Pienta, N.; Stacy, A.; Stevens, R.; Towns, M. Enhancing the Role of Assessment in Curriculum Reform in Chemistry. Chem. Educ. Res. Pract. 2010, 11 (2), 92−97. (20) Marbach-Ad, G.; McAdams, K. C.; Benson, S.; Briken, V.; Cathcart, L.; Chase, M.; El-Sayed, N. M.; Frauwirth, K.; Fredericksen, B.; Joseph, S. W.; Lee, V.; McIver, K. S.; Mosser, D.; Quimby, B. B.; Shields, P.; Song, W.; Stein, D. C.; Stewart, R.; Thompson, K. V.; Smith, A. C. A Model for Using a Concept Inventory As a Tool for Students’ Assessment and Faculty Professional Development. CBE Life Sci. Educ. 2010, 9 (4), 408−416. (21) Potgieter, M.; Ackermann, M.; Fletcher, L. Inaccuracy of SelfEvaluation As Additional Variable for Prediction of Students at Risk of Failing First-Year Chemistry. Chem. Educ. Res. Pract. 2010, 11 (1), 17− 24. (22) Potgieter, M.; Davidowitz, B. Preparedness for Tertiary Chemistry: Multiple Applications of the Chemistry Competence Test for Diagnostic and Prediction Purposes. Chem. Educ. Res. Pract. 2011, 12 (2), 193−204. (23) Rushton, G. T.; Hardy, R. C.; Gwaltney, K. P.; Lewis, S. E. Alternative Conceptions of Organic Chemistry Topics among FourthYear Chemistry Students. Chem. Educ. Res. Pract. 2008, 9 (2), 122− 130. (24) Tan, K. C. D.; Treagust, D. F.; Chandrasegaran, A. L.; Mocerino, M. Kinetics of Acid Reactions: Making Sense of Associated Concepts. Chem. Educ. Res. Pract. 2010, 11 (4), 267−280. (25) Wood, C.; Breyfogle, B. Interactive Demonstrations for Mole Ratios and Limiting Reagents. J. Chem. Educ. 2006, 83 (5), 741−748. (26) Halakova, Z.; Proksa, M. Two Kinds of Conceptual Problems in Chemistry Teaching. J. Chem. Educ. 2007, 84 (1), 172−174. (27) Jurisevic, M.; Glazar, S. A.; Pucko, C. R.; Devetak, I. Intrinsic Motivation of Pre-Service Primary School Teachers for Learning Chemistry in Relation to Their Academic Achievement. Int. J. Sci. Educ. 2008, 30 (1), 87−107.



NOTE ADDED AFTER ASAP PUBLICATION This paper was published ASAP on February 12, 2013, with the data in columns 3 and 4 of Table 2 interchanged. The corrected version was reposted on May 1, 2013.

553

dx.doi.org/10.1021/ed3004353 | J. Chem. Educ. 2013, 90, 546−553