Understanding the State of the Art for Measurement in Chemistry

Mar 29, 2013 - Department of Chemistry, University of South Florida, Tampa, Florida 33620, United States .... Thomas Holme Mary EmenikeSolomon HughesK...
0 downloads 0 Views 378KB Size
Article pubs.acs.org/jchemeduc

Understanding the State of the Art for Measurement in Chemistry Education Research: Examining the Psychometric Evidence Janelle A. Arjoon, Xiaoying Xu, and Jennifer E. Lewis* Department of Chemistry, University of South Florida, Tampa, Florida 33620, United States ABSTRACT: Many of the instruments developed for research use by the chemistry education community are relatively new. Because psychometric evidence dictates the validity of interpretations made from test scores, gathering and reporting validity and reliability evidence is of utmost importance. Therefore, the purpose of this study was to investigate what “counts” as psychometric evidence within this community. Using a methodology based on concepts described and operationalized in the Standards for Educational and Psychological Testing, instruments first published in the Journal between 2002 and 2011, and follow-up publications reporting the use of these instruments, were examined. Specifically, we investigated the availability of evidence based on test content, response processes, internal structure, relations to other variables, temporal stability, and internal consistency. Findings suggest that our peer review and reporting practices value some types of evidence while neglecting others. Results of this study serve as an indication of the need for the chemistry education research community to view gathering psychometric evidence as a collective activity and to report such evidence at the level of detail appropriate to inform future research. KEYWORDS: Chemical Education Research, Testing/Assessment FEATURE: Chemical Education Research



INTRODUCTION

essentially examining the state of the art for measurement in chemistry education research. Quantitative measurement in social research assigns a number to a particular attribute for a specific person via an instrument, enabling comparisons about that person’s performance.8 In education, interpretations follow, and often have real consequences, for example, informing student placement decisions or curriculum changes. Validity and reliability are the two most important aspects of measurement.7 While having existing instruments is convenient, these instruments should not be used without sufficient evidence of their ability to produce valid and reliable scores in the desired context. If the instruments’ scores are not valid and reliable, the resulting interpretations will also be invalid, and can lead to potentially detrimental decisions for students and for the research or educational enterprise. Access to instruments producing scores that can lead to valid interpretations is crucial for quantitative researchers in chemistry education. Potential instrument users must examine psychometric evidence to determine whether an instrument yields scores that can lead to valid interpretations. Collecting psychometric evidence begins during an instrument’s development and can continue long after the instrument becomes available.9 This study presents an examination of select psychometric evidence informed by the sources of evidence described in the Standards for Educational and Psychological Testing (hereafter referred to as the Standards).10 Evidence based on test content, response processes, internal structure, relations

1

Research is more powerful if it is theory-based, and chemistry education research is a theory-based discipline.2−4 As an emerging applied discipline, chemistry education research uses theories from more mature disciplines such as psychology, sociology, and philosophy.1 It has been suggested that the use of theory creates blinders, causing researchers to ignore anything that does not seem to fit.5 However, the use of an established theoretical base for research has several benefits, most importantly, the ability to tie ideas to existing knowledge, making research more comprehensible and methodological.6 To become a more mature discipline, chemistry education researchers must continue to conduct theory-based research, and eventually develop and adapt theories unique to chemistry education. An important aspect of any research-based discipline is the ability to pose and to answer research questions. Even when guided by theoretical frameworks, the ability to answer a research question is only as good as the instrument(s) used to gather the research data.7 High-quality instruments improve the ability to answer research questions, while low-quality instruments impede research. In the qualitative tradition, the researcher may be considered the instrument; in the quantitative tradition, the instruments exist external to the researcher and can be examined as separate contributors to the research effort in light of guidelines for sound measurement. It is therefore important to determine what is known about the instruments that are available for use by the chemistry education research community, © 2013 American Chemical Society and Division of Chemical Education, Inc.

Published: March 29, 2013 536

dx.doi.org/10.1021/ed3002013 | J. Chem. Educ. 2013, 90, 536−545

Journal of Chemical Education

Article

Figure 1. Schematic representation of concepts in the theoretical framework.

to be strongest in science and technology.13 Although this database does not offer information about the journals indexed, time-span, or distribution of records by discipline, it generates search results that typically fall outside of commercial tools.14 Thus, using Google Scholar for a supplemental search provides a more comprehensive search than relying on only a commercial tool such as Web of Science. Psychometric evidence gathered during the initial and follow-up searches was examined via a conceptual framework comprising best practice in measurement. The three authors established consensus for each instrument with respect to each conceptual category by examining the published research reports.

to other variables, replicate administrations, and internal consistency will be examined, leading to specific recommendations for researchers desiring to use instruments published in the Journal of Chemical Education (hereafter referred to as the Journal) between 2002 and 2011. Implications regarding the de facto standards for measurement in our community are also discussed.



METHOD

Sample Selection

This study examines the psychometric evidence available for instruments published in the Journal between 2002 and 2011. This peer-reviewed academic monthly is the official journal of the Division of Chemical Education of the American Chemical Society (ACS). Articles published in the Journal span multiple interests in chemistry education including “chemical content, activities, laboratory experiments, instructional methods, and pedagogies”.11 The international readership includes “instructors of chemistry from middle school through graduate school” and other chemistry professionals concerned with the teaching and learning of their discipline.11 Thus, instruments published in the Journal are likely to be specifically relevant to chemistry education. Because sound measurement in chemistry education is crucial, a steady increase in Journal publications reporting psychometric evidence has been observed, leading us to examine the frequency and quality of evidence reported in recent years. Only new instruments published between 2002 and 2011 were included in this study. The title, abstract, and keywords of each peerreviewed Journal publication during this time period were searched, specifically including articles concerning the development of new instruments as well as articles reporting the use of these instruments. Because articles reporting the use of instruments originally published in the Journal may have appeared in other venues, a follow-up search using the Web of Science and Google Scholar databases was conducted. Web of Science indexes over 8300 science journals across 150 disciplines and over 4500 social science journals.12 For Google Scholar, journal coverage appears



CONCEPTUAL FRAMEWORK Our conceptual framework uses definitions and operationalizations of validity and reliability as described in the Standards.10 The American Educational Research Association (AERA), the American Psychological Association (APA), and the National Council on Measurement in Education (NCME) have jointly developed the Standards, revising them four times since 1966, with the most recent edition published in 1999. These Standards were, therefore, in existence when the instruments in the sample were published. The Standards provide criteria for evaluating tests, testing practices, and the effects of test use, where “test” is broadly construed to include affective scales, surveys, and traditional knowledge tests.10 In addition to representing the consensus of the three sponsoring organizations, the Standards have been cited in major court decisions, including a Supreme Court case in 1988,15 lending the Standards additional authority. Concepts from the Standards that provide our framework for decisionmaking are described below and shown in Figure 1. Sources of Validity Evidence

Validity refers to “the degree to which evidence and theory support the interpretations of test scores entailed by proposed uses of tests”.10,16−18 While the concept of validity may seem straightforward, multiple sources of validity evidence are relevant. These sources of evidence illuminate different aspects of validity, yet they all represent the unitary concept of 537

dx.doi.org/10.1021/ed3002013 | J. Chem. Educ. 2013, 90, 536−545

Journal of Chemical Education

Article

validity.10,18−20 Checking validity evidence each time an instrument is used is important, because, as Nunnally21 indicates, problems with measurement quality arise not only from the instrument itself but also from sampling error associated with people. In other words, the same instrument can produce different validity results for different groups of people. Our conceptual framework for validity includes four sources of validity evidence as described in the Standards: evidence based on test content, response processes, internal structure, and relations to other variables.

instrument may prescribe the intended construct as unidimensional or multidimensional; regardless, the item relationships should be consistent with the underlying structure. Evidence based on internal structure establishes the degree to which the items on the instrument conform to the hypothetical construct.28 Two methods that can be used to collect this evidence are factor analysis and differential item functioning. Factor analysis examines interrelationships by exploring the simple patterns among responses to individual test items.29 The assumption is that items measuring the same construct will load on the same “factor” and item scores will be correlated with one another. There are two types of factor analysis. Exploratory factor analysis (EFA) is used to explore the underlying factor structure of a set of measured variables without imposing a preconceived structure on the outcome, while confirmatory factor analysis (CFA) provides an estimate of the degree of fit of the hypothesized model to the set of measured variables.30 Without evidence from factor analysis, it is invalid to sum item scores to create a factor score. The second approach used to gather evidence based on the internal structure of an instrument is known as differential item functioning (DIF). DIF indicates that an item functions significantly differently for distinct subgroups of examinees (e.g., grouped by sex, race or ethnicity, socioeconomic status, etc.), even though there is no real difference between the groups on the construct being measured.31,32 Finding DIF essentially reveals an unintended dimension associated with the instrument scores. Checking for DIF is important because these otherwise hidden construct-irrelevant differences in scores produce erroneous interpretations of simple group-mean differences. Within classical test theory, DIF can be detected using Mantel− Haenszel (MH) statistics or logistic regression; for item response theory, comparisons of fit parameters for the distinct groups are used. While it may not always be feasible to check for DIF before an instrument’s initial publication, others can check for DIF following subsequent usage of the instrument.10

Evidence Based on Test Content

According to the Standards, “test content refers to the themes, wording, and format of the items, tasks, or questions on a test, as well as the guidelines for procedures regarding administration and scoring”,10 which is consistent with the definition of content validity used in other studies.22,23 Content validity assesses the extent to which a domain, which can be either as concrete as a body of knowledge or as abstract as a psychological state, is represented by the intended measure.24 Establishing evidence of content validity is crucial, as interpretations of test scores concerning a particular domain are only valid if the test adequately captures that domain. Typically, this type of evidence is established by a panel of domain experts (distinct from the item writers) judging whether the items appropriately sample the domain of interest,9 with a focus on sufficient and accurate representation. Because the observed test scores are inseparably linked to how students interpret and respond to the test items, in order to have further evidence of validity, the test must also demonstrate the ability to sample the cognitive processes required to answer items related to that domain.25 Therefore, it is important to examine evidence based on students’ response processes. Evidence Based on Response Processes

Response processes are the underlying cognitive activities that respondents use to answer a question.26 Examining this type of evidence ensures that the processes being used by respondents to answer test questions are those intended by the test developer, or, in the case of multiple graders for a nonobjective test, that the way in which graders are rating respondents is consistent with the intent of the test developer. This type of evidence is obtained by analyzing individual responses.10 Cognitive interviews are a useful tool for examining response processes. These are in-depth interviews used to get feedback from study participants about items on an instrument,27 and provide insight into respondents’ thought processes when responding to test items. For example, if an item asks about interest in chemistry, some respondents may perceive interest in chemistry as being about preparing for a chemistry-related career, while other respondents may perceive the item as being about liking or disliking chemistry. Cognitive interviews allow the test developer to explore the respondents’ understanding of the nuances of the item. While interviews may provide the most detailed information about student response processes, other methods to gather information from respondents may also be helpful. For example, providing a text box below a Likert or multiple-choice item for students to explain their response, asking students to show calculations, or using eye-tracking are all ways to verify students’ correct interpretation of an item.

Evidence Based on Relations to Other Variables

Another way to examine the validity of a measure’s scores is by investigating the relationships between the construct of interest and other variables. The nomological network is an organizational scheme for evidence of relationships among variables. It is the interlocking system of laws that constitute a theory.33 Here, empirical evidence is used to build “lawful” sets of relationships between the construct being measured and other theoretically relevant constructs.33,34 Thus, the nomological network is essentially a web of interconnected constructs. For a relatively new construct, the nomological network has not been established; the network becomes apparent as researchers learn more about the construct itself.33 In addition to the four sources of validity evidence included in our conceptual framework, a fifth source, consequences of testing, is also discussed in the Standards. Consequences of testing refers to both the intended and unintended consequences of test use. Potential unintended consequences of test use, for example, can include bias or loss of opportunity for specific groups of people.35 The concept of consequential validity is considered to be contentious36 and the role of consequences in validity has not been clearly established.37 The Standards include a helpful discussion of evidence based on test consequences, highlighting that the distinction between social policy and validity is important. With respect to validity, only those unintended consequences that can be traced to invalidity (such

Evidence Based on Internal Structure

The internal structure of an instrument concerns the relationships among instrument items.10 A conceptual framework for an 538

dx.doi.org/10.1021/ed3002013 | J. Chem. Educ. 2013, 90, 536−545

Journal of Chemical Education

Article

Table 1. The Sample: Instruments Originally Published in the Journal from 2002−2011 Year 2002

Author(s)

2006

Mulford and Robinson Bauer Lee et al. Ozkaya et al.

2007 2008

Grove and Bretz Barbera

2005

Bauer Hockings 2009

2010 2011

Chatterjee et al. Cooper and SandiUrena Lacosta-Gabari et al. Reardon et al. Bell and Volckmann Cheung Knaus et al. Oloruntegbe and Ikpe Stains et al. Vachliotis et al.

Instrument

Construct Assessed

Items

Response Scale

Chemistry Concept Inventory

Chemistry content knowledge

22

Multiple-choice

Chemistry Self-Concept Inventory Student Conceptual Understanding Test Conceptual and Problem Solving Tests on Galvanic Cells Chemistry Expectations Survey Colorado Learning Attitudes about Science Survey Attitude toward the Subject of Chemistry Inventory Attitudinal survey

Self-concept in chemistry Stoichiometry Content knowledge and calculations about galvanic cells Cognitive expectations for chemistry Attitude about chemistry

40 14 18 and 9

Semantic differential Two-tier multiple-choice Multiple-choice

47 50

Five-point Likert Five-point Likert

Attitude toward chemistry

20

Semantic differential

41

Five-point Likert

Inquiry Laboratory Attitude Survey Metacognitive Activities Inventory

Attitude toward group learning in chemistry Attitude toward inquiry laboratories Metacognitive skillfulness

19 27

Five-point Likert Five-point Likert

Groundwater Pollution Test

Attitude toward groundwater pollution

19

Five-point Likert

Chemistry Course Perceptions Questionnaire Knowledge Surveys for General Chemistry I and II Guided Inquiry Scale Cognitive Complexity Rating Science concepts learned at school experienced at home Structure of Motion and Matter Survey Systemic Assessment Questions

Perception toward chemistry

7

Likert

Confidence in problem-solving related to content Teacher beliefs about using guided inquiry Complexity and cognitive load Thirteen chemistry concepts

126 and 77

Three-point scale

30 N/A N/A

Seven-point Likert Four-step process Questionnaire, checklist

Particulate nature of matter ideas Understanding of organic reactions

3 9

Open-ended questions Multiple choice, fill in the blank, open

evidence is based on replicate administrations and internal consistency of the instrument, each of which is described below.

as construct underrepresentation or construct-irrelevant variance) are relevant. Because other sources of validity evidence also involve probing for construct underrepresentation (evidence based on test content) and construct-irrelevant variance (evidence based on internal structure), we have not included evidence based on consequences of testing as a separate category. Thoroughly examining the role of consequences in measurement would be beyond the scope of this study. Validity evidence is only one aspect of an instrument’s scores. To obtain meaningful test scores, there should be some degree of stability in test-takers’ responses if the measure is given multiple times.10 If the same test yields markedly different scores for each administration to the same individuals, we cannot reliably interpret those scores. An instrument must measure the construct of interest with some degree of consistency if we hope to make valid interpretations. Therefore, examining reliability evidence is important.

Evidence Based on Temporal Stability

To ensure that a test consistently measures test-takers’ performance on the construct of interest, multiple test administrations are needed. Gathering this evidence involves administering the test to the same group of individuals twice, and calculating the correlation between the scores from the first and second administration.10,38,39 This method of estimating reliability assumes that there will be some degree of stability in the scores obtained over repeated test administrations. This assumption is warranted only when there has been no treatment or maturation between administrations that would be expected to change the respondents’ status with respect to the construct of interest. To the extent that this is true, the correlation between respondents’ scores on both test administrations, the test−retest coefficient, indicates the consistency of scores over replicate administrations.

Reliability

Reliability concerns the consistency of a measure and procedures used to score that measure when it is administered to the same individuals multiple times.10,24 Having consistent scores over multiple occasions ensures that the scores can be reliably obtained and are not due to chance. For many reasons, including changes in effort and test anxiety, obtaining perfect reliability of scores over multiple occasions is unlikely.10 However, because users cannot confidently make generalizations from unreliable data, reliability information should still be reported. According to the Standards, reliability information may be reported as variances or standard deviations of measurement error, coefficients, or test information functions based on item response theory. The instruments in the sample report only reliability information based on coefficients, hence only this method will be discussed here. The most commonly reported

Evidence Based on Internal Consistency

Another type of reliability estimate is an internal consistency coefficient. Essentially, people who respond to one test item in a particular way will be more inclined to respond to other, similar items in the same way. That is, items that are related or measuring the same construct will be correlated,40 providing a measure of internal consistency. Cronbach’s α41 is the most commonly used measure of internal consistency reliability,42 although it has been critiqued for underestimating test reliability7 and for having little value when used by itself.43 Regardless, the usefulness of α to estimate an instrument’s internal consistency has been documented42 and those using classical test theory continue to report this coefficient. Reporting Cronbach’s α values that are aligned 539

dx.doi.org/10.1021/ed3002013 | J. Chem. Educ. 2013, 90, 536−545

Journal of Chemical Education

Article

validity, place it in a different category. Within the framework established by the Standards, this evidence is commonly gathered by seeking input from a panel of experts in the relevant content domain. Test developers of nine of the instruments reporting evidence based on test content appear to have employed some form of expert panel review to gather the evidence.45−50,53−55 For two instruments, the panel of experts comprised faculty members and graduate students.50,53 Other test developers also report teachers55 and faculty members46−48 as content experts, while others use the language “chemistry experts”,54 “chemical education experts”,54 “chemical education researchers”,45 or simply “expert”.49 In one case, the developer reports deciding that expert panel review would be unnecessary because the instrument was only slightly modified from an established instrument in another field.52 As previously mentioned, if an expert panel is assembled for review purposes, it should be different from the item writers and others on the development team. Although it can be inferred in many cases for the sample, either by subtraction or from described expertise, that the expert panels must at least have included people not otherwise engaged in the instrument development process, it would be advisable to be unambiguous. Explicitly stating that the expert panel excludes the instrument developers and item writers strengthens claims of validity based on test content. Evidence based on test content can potentially be assembled in more subtle ways. For example, multiple follow-up publications have reported using the Chemistry Concept Inventory or items from it (Table 2).40,56−68 To the extent that chemistry instructors and researchers chose this instrument or items from it to assess student learning in multiple studies constitutes an after-the-fact expert panel review, the Chemistry Concept Inventory may be considered to have the most evidence based on test content for any instrument in the sample. One of these follow-up publications58 even explicitly gathered validity evidence related to item format, another aspect of evidence based on test content as described in the Standards. It is overwhelmingly the case for our sample that content domain experts have been drawn from the ranks of chemistry faculty, chemistry education researchers, and graduate students in those fields. In only one case was it explicitly stated that an expert from another field, educational psychology, was consulted.53 While this is not surprising, given the primary audience for the instruments, it may be appropriate to consider the possible benefits of including experts in cognitive science, psychology, or other domains when an instrument is designed to measure a construct that has a long history in those other domains. Ensuring that the constructs measured by instruments used in chemistry education research are recognizable to experts in other fields will continue to advance the field’s theory base in productive ways. In that light, Bauer’s decision52 to adapt an instrument from another field and forego additional expert panel

with the internal structure of an instrument is important; for example, an instrument with subscales should report the α value for each subscale. Typically, the cited desired value is 0.7 or above;28 however, there is no particular value of α that is considered to be acceptable. Actually, the purpose for which the test is being used determines the value of α that is considered desirable.44



RESULTS AND DISCUSSION Our initial search in the Journal yielded 18 publications reporting the development of 20 new instruments during this time period (Table 1). The follow-up search yielded 14 publications using all or part of the same instrument, the Chemistry Concept Inventory (Table 2), and 5 publications using other instruments Table 2. Cited Uses of the Chemistry Concept Inventory Year

Author(s)

2005

Kruse and Roehrig Wood and Breyfogle Halakova and Proksa Jurisevic et al.

2006 2007 2008 2009

2010

2011

Cacciatore and Sevian Marais and Combrinck Cokadar Costu Potgieter et al. Mayer Potgieter and Davidowitz Regan, Childs, and Hayes Van Duzor Villafane et al.

Details of Use Intact instrument Fewer than six items, incorporated into larger test Fourteen items, modified into two forms with seven pictorial and seven verbal items Seven items as part of a larger test, six items modified, one item unchanged Intact instrument Intact instrument and a single item A single item incorporated into larger test A single item incorporated into larger test At least six items, incorporated into larger test Seven items as a stand-alone test At least six items, incorporated into larger test Twelve items, incorporated into larger test “Modified” version given A single item incorporated into larger test

(Table 3). All 37 publications were used to examine the psychometric evidence associated with the 20 instruments listed in Table 1. Findings are discussed based on the instrument for which evidence is reported. Validity Evidence

Evidence Based on Test Content. Evidence based on test content was reported by the test developers for 11 of the instruments.45−55 For instruments lacking this evidence, an important step for future users would be to gather evidence before using the instrument in a planned study, and then to publish the evidence with the study results. Interestingly, although Bauer51,52 explicitly labels an examination of group differences as content validity evidence, the Standards, while agreeing that this type of evidence is relevant for Table 3. Cited Uses of the Other Instruments in the Sample Year

Authors

Instrument

Details of Use

2008 2009 2010 2011

Cooper et al. Lewis et al. Sanabria-Rios and Bretz Brandriet et al. Xu and Lewis

Metacognitive Activities Inventory Chemistry Self-Concept Inventory Chemistry Expectations Survey Attitude toward the Subject of Chemistry Inventory (ASCIv2) Attitude toward the Subject of Chemistry Inventory

As intact instrument As intact instrument, scores subjected to cluster analysis Translated into Spanish (QUIMX); tested in Puerto Rico Intact modified instrument as published by Xu and Lewis (2011) Both intact and modified into a two-factor instrument (ASCIv2)

540

dx.doi.org/10.1021/ed3002013 | J. Chem. Educ. 2013, 90, 536−545

Journal of Chemical Education

Article

evidence needs to be collected for the eleven instruments lacking this evidence, even for those that are designed to be unidimensional. For the nine instruments reporting evidence based on internal structure, different approaches were used, each with both merits and potential for improvement. The CHEMX instrument used faculty responses to obtain this evidence.70 Because this instrument is intended for students, further study of item relationships with student response data is desirable. Adams et al. conducted a “reduced-basis” factor analysis,48 a method that is not typically used for psychometric analyses. Repeating the analysis using one of the more common methods for comparison with the results from this more unusual approach revealed some discrepancies.74 Among the seven remaining instruments, this evidence was most commonly gathered via EFA. Two studies produced unintended dimensions or only partially meaningful interpretations based on the item groups emerging from EFA. Cooper and Sandi-Urena reported unexpected EFA results in the Supporting Information. While they did not find these results useful,53 their publication empowers others to compare EFA results from other samples and to explore possible refinements to the instrument. Vachliotis et al. provided even more detailed information, reporting the hypothesis tested, emerging factors, amount of the observed variance accounted for by these factors, and factor loadings for each item following varimax rotation.73 Again, these factor analysis results provide useful information for others who may be interested in measuring similar constructs. In two studies, EFA found distinct and meaningful factors, but they were inconsistent with the initially intended scales. For the Groundwater Pollution Test, measuring attitude toward groundwater pollution, the authors started with four factors but their EFA identified three prominent factors.47 For Hockings et al.’s 41-item survey measuring student attitude toward group learning in chemistry, the four emerging factors from EFA are also different from the initially intended factors.71 The only instrument for which the emerging factors and initially intended factors were identical is the Chemistry Self-Concept Inventory, with its previously noted use of well-developed constructs from an instrument in another field.52 Although in these three cases the EFA results were interpretable, when an instrument is designed based on a theoretical or conceptual model, CFA is a more appropriate method. While EFA illuminates the internal structure of an instrument at the development stage, CFA allows the verification of an instrument’s intended factors.75−77 Unlike EFA, CFA provides fit indices to examine the overall fit of a model, offers information about sources of measurement error, and indicates how model modifications may improve the fit. CFA also permits the testing of alternative models, allowing developers to find the most parsimonious way to interpret the data, which is pivotal for the interaction between theory and measurement. For example, a possible next step for the Chemistry Self-Concept Inventory would be to compare different possible models via CFA to determine whether the data supports a generalized self-concept factor. The results, positive or negative, would speak directly to theory regarding the proposed contextual nature of self-concept. Considering the instruments in this sample, CFA is under-used by chemistry education researchers. Only one instrument, the Guided Inquiry Scale, included CFA results in the original publication.72 Evidence based on internal structure can be reported from multiple sites to establish that the intended structure works for

review is an acknowledgement of the desirability of working within established theoretical frameworks. Evidence Based on Response Processes. The test developers discuss gathering evidence based on response processes for six of the instruments in the sample.45,47−50,53 Without this information, there is uncertainty about proper interpretations of items. This evidence can be gathered any time an instrument is used; ideally, it is gathered every time and reported so that, over time, interpretations are supported by a well-developed understanding of the range of responses in a particular population. Researchers wishing to use instruments currently lacking information about response processes can incorporate the gathering of this information into research protocols and therefore help to build the body of psychometric evidence. The potential benefits of using cognitive interviews to gather this evidence have been reported.69 Unlike clinical interviews, which are conducted during the instrument’s development, cognitive interviews are conducted following initial item development, to determine whether respondents are interpreting the items as desired. Cognitive interviews were used by the developers of four instruments in the sample.45,48−50 Mulford and Robinson reported only that students interpreted the test items as they intended.45 Similarly, Oloruntegbe et al. indicated only that the results for students who were interviewed did not differ from the results for students who were not interviewed.49 According to the Standards, the results of cognitive interviews can enrich the definition of a construct. Therefore, only knowing how students interpreted test items is insufficient. Test users should be aware of more detailed information such as the kinds of questions asked during the interviews. Two test developers provided such valuable information, including sample questions and descriptions of the interview process.48,50 Additionally, Adams et al.48 provided specific details about wording changes and items for removal, which is closely aligned with the intent described in the Standards. Lacosta-Gabari et al. reported asking students to write a paragraph discussing the reason for their response to each of approximately 30 Likert-scale items on a trial version of an instrument.47 While limited, this information enabled the researchers to confirm that students needed the full five-point scale to express the range of possible responses, and to examine the level of similarity among the reasons for each point on the scale. Although the instrument was subsequently changed, details about the nature of the changes were not given. Similarly, Cooper and Sandi-Urena report asking respondents to write comments in blank space about items during pilot testing, and they, too, did not provide information about the content of those comments or how those comments influenced the instrument items.53 Because different contexts may trigger different responses, potential users of even these six instruments still have opportunity to collect and report this type of evidence during their own planned studies. For the instruments currently lacking evidence based on response processes, the field is wide open. Asking students to comment on the meaning of items, either via interviews or with open-response options, deepens understanding of what is being measured. Explicitly discussing the data gathered, to illuminate response processes and the emergent implications, is imperative for developing a community understanding of the constructs measured by a given instrument. Evidence Based on Internal Structure. Nine of the instruments in our sample reported evidence based on the internal structure of the instrument.47,48,51−53,70−73 Therefore, 541

dx.doi.org/10.1021/ed3002013 | J. Chem. Educ. 2013, 90, 536−545

Journal of Chemical Education

Article

Bauer developed the Chemistry Self-Concept Inventory and made comparisons among three student populations, hypothesizing an inverse relationship between exposure to chemistry and chemistry self-concept.52 In a follow-up publication, Lewis et al.87 reported findings to support Bauer’s results. Lewis et al. ordered student groups according to their performance on the First-Term General Chemistry ACS Exam88 and reported a clear relationship between self-concept and ACS exam performance, with higher self-concept being related to higher ACS exam scores, suggesting a difference in self-concept among groups of varying chemistry ability. The results from scores obtained with the instrument agree with the underlying theory. For the Attitude toward the Subject of Chemistry Inventory, Bauer gathered evidence by calculating the correlation with course grade and results demonstrated a weak association, as predicted.51 Followup publications by other authors78,79 found the relationships among variables to be consistent with Bauer’s and other studies.89−91 As demonstrated, collective evidence can support the building of a nomological network. Researchers who plan to use quantitative measures can help to develop a comprehensive theoretical framework specific to the context of chemistry education research by continuing to report relationships with other theoretically meaningful variables when instruments are used.

different samples. Follow-up publications are commonly used to report this information from additional uses of the instrument. The Attitude toward the Subject of Chemistry Inventory51 had the most evidence based on internal structure from multiple sites. Originally, Bauer used EFA and reported items loading strongly on three factors and weakly on two factors.51 A follow-up publication by others reported factor analysis results for both the original and revised version of the instrument, including a CFA to estimate how well their two-factor model fit the data. The CFA results showed that the model had a reasonable fit for the data, supporting the modified instrument, ASCIv2.78 Brandriet et al. used the two-factor instrument at another institution, also confirming the model via CFA with their data.79 The three reports collectively provide a good example of how multiple researchers can contribute to the exploration of psychometric evidence with different samples. Interestingly, none of the instruments in the sample reported the results of DIF as evidence based on internal structure. Nevertheless, the use of DIF should be encouraged as it allows researchers to check for the existence of both intended and unintended group-mean differences. A few recent studies in the Journal do use DIF to detect possible test bias.80−82 Importantly, claims of difference cannot be upheld without an examination of potential DIF for the relevant instrument scores. Evidence Based on Relations to Other Variables. The developers of 19 instruments in the sample performed analysis of relationships to other variables to answer research questions. Only one instrument, the Groundwater Pollution Test, was published without a report of results on the relationship to external variables, but the authors mentioned that the instrument can be used to measure attitude of different groups or to track attitudinal change.47 For most instruments, the developers did not call their analysis of relationships an aspect of validity evidence. Because such information helps potential users understand how the variable(s) being measured relate to external variables, or how students from different groups or abilities may vary on the intended construct, analyzing relationships provides evidence based on relations to other variables as in the Standards. In an interdisciplinary field such as chemistry education research, the ability to determine relationships among variables and build a theoretical network is particularly important. Various analytic methods have been conducted to establish relations with other variables, including χ2 analysis,50,54,71 analysis of variance or its simpler analog, the ttest,45,48,51−53,55,70−72,83,84 multiple regression,55,71 and correlation analysis,51,52,84,85 depending on whether the measured variable(s) and external variables are categorical or continuous. Taking the network idea further, Reardon et al. performed path analysis to examine the relationships of chemistry course perceptions with many other variables.46 Accumulating evidence based on relations with multiple variables is critical to support network building. This evidence has been gathered for three instruments on multiple occasions. For the Metacognitive Activities Inventory (MCAI), the developers reported a low (r = 0.16) yet significant correlation between the MCAI and grade-point average, which was in agreement with the degree of relationship in the literature.53 In a follow-up publication,86 the developers and another author examined the convergence from two methods, the self-reported MCAI and the Interactive MultiMedia Exercises (IMMEX), to measure metacognition ability in chemistry problem solving. They found the correlation between two methods significant but not high (r = 0.2).

Reliability Evidence

Evidence Based on Temporal Stability. Evidence based on replicate administrations was established for seven of the instruments in the sample.47−53,78 Of these instruments, Adams et al. report calculating the test−retest coefficient using scores obtained from two different samples.48 According to multiple sources, the scores must be obtained by administering the test twice to the same group of individuals.10,38,39 Therefore, evidence based on replicate administrations needs to be gathered for CLASS-Chemistry48 as well as for the other instruments in the sample. Five of the other six instruments in the sample reporting this evidence also have test−retest coefficients only as reported in the original publication. Lacosta-Gabari et al. reported a test−retest coefficient of 0.6447 and Oloruntegbe and Ikpe reported a coefficient of 0.72.49 Cooper and Sandi-Urena reported test− retest coefficients of 0.53 and 0.51 for the main study and the replication study, respectively.53 Bauer provided a test−retest coefficient for each subscale, ranging from 0.64 to 0.90.52 Stains et al. also provided a range of coefficients, but, because of the way in which the Structure and Motion of Matter survey was scored, they used Goodman and Kruskal’s γ coefficient.50 These coefficients ranged from 0.69 to 0.95. Justification for the cases in which it is appropriate to use the γ coefficient has been proposed by Ritchey.92 The instrument with the most available evidence based on replicate administrations is the Attitude toward the Subject of Chemistry Inventory.51 Initially, Bauer reported test−retest coefficients ranging from 0.47 to 0.74 for each of the three factors, the Emotional Satisfaction subset, and the Fear item.51 Xu and Lewis reported an overall test−retest coefficient of 0.9 for their two-factor version of the instrument.78 The sample size is small and the authors fail to follow Bauer’s correct practice of reporting the coefficient for each scale. Generally, although the developers of six instruments report some information about temporal stability from replicate administrations, all would benefit from additional attention to this issue. Determining test− 542

dx.doi.org/10.1021/ed3002013 | J. Chem. Educ. 2013, 90, 536−545

Journal of Chemical Education

Article

Additionally, the findings of this study show that, within the chemistry education community, some types of evidence were seldom reported. Focusing on the two sources of reliability evidence present in the sample, evidence based on replicate administrations was reported quite infrequently. While this information can be difficult to collect as it requires access to the same respondents within a short time interval, this type of evidence is critical to developing an understanding of what degree of change in a measured construct is simply due to random factors. If researchers do not have a sense of construct stability over time, studies that attempt to measure change in the construct over time, such as those investigating the effects of curricular changes, lack valuable baseline data. The difference between test−retest coefficients and internal consistency coefficients is important to acknowledge. Cronbach’s α is much easier to determine, as demonstrated by the large number of instruments in the sample reporting Cronbach’s α estimates, but it is not a substitute for a test−retest coefficient. Regardless, it is salient to recognize that both types of reliability estimates are necessary for each measured construct; in other words, when subscale scores for an instrument are the focus of interpretation, reliability estimates associated with those subscale scores should be reported. As the field continues to develop with respect to measurement, generalizability theory may also be a useful approach to understanding reliability in different contexts. In terms of sources of validity evidence, evidence based on response processes is the least common. Further, of the six instruments in the sample that have published information on response processes, only two include details that enable potential users to understand how the information was gathered and how it may have influenced the instrument. Although the focus of this study has been quantitative measures, qualitative research skills are necessary to fully understand an instrument’s scores. In light of the Standards’ acknowledgement that information about response processes enriches the understanding of a construct, it seems clear that in-depth interviews with respondents are key to ensuring that instruments for chemistry education research are measuring constructs resonant with the real world of student experiences. In addition, cognitive interviews with chemistry students regarding their responses to instruments developed for other fields are likely to be a rich source of nascent theory relevant for research specific to chemistry education. Still, in the absence of in-depth interviews, it is possible to contribute to the collective understanding of response processes for a particular instrument by gathering and reporting on open-ended responses or investigating eye-tracking behavior. More common in the sample was evidence based on internal structure, with half of the instruments reporting some investigation of a factor structure. To be able to report and interpret composite scores for a particular subscale, this evidence must be available to the test user. Factor analysis, especially CFA, is strongly recommended for development and verification of theory in chemistry education research based on empirical results obtained from instrument scores. Regardless of which approach to factor analysis is employed, when scores do not produce expected interpretable factors, both the underlying concepts that were designed into the instrument and the instrument itself are simultaneously exposed to critique and revision. The willingness of instrument developers to provide raw factor analysis results creates excellent opportunities for continued exploration of the intended constructs by the larger research community. This iterative process is key for developing solid conceptualizations of constructs to support robust theory building. In terms of other

retest coefficients cannot be thought of as the responsibility of the instrument developer only, as the coefficients may be different for different samples. Evidence Based on Internal Consistency. Evidence based on internal consistency was reported for 15 of the instruments in the sample.45−48,51−53,55,70−73,84,85 Thirteen of the instruments reporting this evidence used Cronbach’s α. Lacosta-Gabari et al. reported an overall Cronbach’s α value of 0.74 and 0.68 for the main study and replication study, respectively.47 Similarly, Bell and Volckmann report overall Cronbach’s α values for multiple administrations of their knowledge survey instruments, with values consistently above 0.95.84 Adams et al.48 and Lin et al.55 reported only an overall coefficient of 0.89 and 0.77, respectively. Others reported more detailed information. Cheung indicated that the α coefficients for each of three subscales fell within a narrow range, 0.74−0.79.72 Bauer and Hockings et al. reported an α coefficient for each subscale in the respective instruments, and Bauer also reported an α coefficient for the Emotional Satisfaction set even though it was not considered to be a subscale.51,52,71 Although Grove and Bretz analyzed faculty responses to collect psychometric evidence rather than responses from the target population, they reported an overall α coefficient of 0.97 and coefficients for each cluster ranging from 0.73 to 0.89.70 The second way in which this evidence was reported employed the Intra-Class Correlation (ICC) coefficient. Knaus et al. reported this evidence using a “two-way mixed ICC with the consistency definition”.85 The use of the ICC coefficient by Knaus et al. as a measure of internal consistency reliability analogous to Cronbach’s α is consistent with what is described by Shrout and Fleiss93 as acceptable use of the coefficient. Knaus et al. reported ICC values above 0.82 for the multiple raters with either 45 items (9 raters) or 72 items (4 raters).85 In addition to evidence reported in the original publications, further evidence was published for two of the instruments in the sample. Lewis et al. used the Chemistry Self-Concept Inventory52 to measure students’ self-concept in a general chemistry course, reporting Cronbach’s α values ranging from 0.65 to 0.93,87 suggesting that Bauer’s instrument functions consistently in multiple settings. Further evidence for the Attitude toward the Subject of Chemistry Inventory51 was reported in follow-up publications by others.78,79 Additionally, Xu and Lewis78 and Brandriet et al.79 reported α coefficients for the modified version of Bauer’s instrument. Xu and Lewis obtained α coefficients above 0.79 for each subscale and Brandriet et al. reported α coefficients ranging from 0.74 to 0.84. Although Bauer’s two instruments have reports of Cronbach’s α from other researchers, even for these instruments it would be desirable for researchers to report this information in order to further develop the pool of evidence related to internal consistency of scores obtained from instruments in different contexts.



CONCLUSIONS AND IMPLICATIONS The findings of this study serve as an indicator of the state of measurement in chemistry education research. While some publications reported psychometric evidence at the level of detail described in the Standards, others used nonstandard procedures. This suggests that there is a gap in what we, as a community, know about measurement and what is considered standard practice in measurement, and that there is a need for measurement education within the community. 543

dx.doi.org/10.1021/ed3002013 | J. Chem. Educ. 2013, 90, 536−545

Journal of Chemical Education

Article

an instrument functions. If scores from quantitative measures are to be interpreted as part of a research study, psychometric evidence should be gathered and reported. In that light, all developers and users of the instruments in our sample should be commended for reporting some psychometric evidence. It is our hope that this study has illuminated multiple ways for researchers to contribute to the existing pool of evidence. As members of the chemistry education research community, if we all continue to gather even small amounts of evidence whenever possible and report that evidence at a level of detail consistent with the Standards, we will continue to move the community forward in accordance with our values as a theory-guided, data-driven field.

potential unintended factor structures, it was notable that, although many instruments in the sample were associated with explorations of group differences, none were accompanied by DIF studies. While it may be possible to conduct DIF investigations of existing data sets, it is also reasonable for others in the research community to investigate DIF with new data. Those investigations are critically important if claims about group differences with respect to instrument scores are to be upheld. Evidence based on test content was also reasonably common in the sample. This suggests that, as a community, chemistry education researchers value checking whether the instrument represents the construct of interest or domain adequately. Nevertheless, it would be more desirable if this evidence existed for all of the instruments in the sample, and if specification of outside expertise was explicit in all cases. As chemistry education research continues to connect with theories from other disciplines, the importance of having researchers from those disciplines on expert panels to provide content review grows. The most-often reported validity evidence for this sample was evidence based on relations to other variables. All instruments save one47 established this type of evidence according to the procedures described in the Standards. This suggests that, as a community, chemistry education researchers consider exploring relationships among variables common practice. However, there is still room for exploration. While most instrument developers reported the group differences on the measured constructs, few reported the degree to which the measured constructs related to other variables. Only one instrument has had its scores explored in a network of constructs via path analysis.46 Exploring relationships among variables is necessary for building nomological networks to support theory development. As an interdisciplinary field, chemistry education research still relies on the use of theories developed in other fields.4 If the chemistry education research community expects to develop theories of its own, researchers must explore the relationships among variables and report the findings to the community. By accepting this challenge, researchers will begin to build a nomological network that comprises relationships among constructs that are relevant to chemistry education specifically. On a basic level, in order to make valid interpretations from instruments’ scores, chemistry educators need to understand the quality of psychometric evidence associated with the instruments they want to use. The simple existence of an instrument, as has been shown here, in no way implies that it is “validated” and “reliable,” even if such concepts were possible to attribute to an instrument rather than to its scores in a particular context. Instead, instruments generally have a few sources of reliability and validity evidence explored in one or two contexts, but are lacking in other evidence and other contexts. This circumstance does not mean that chemistry educators should refrain from using available instruments, but rather that educators should interpret scores with caution, and stay informed about potential changes emerging from further studies with the instrument. When possible, chemistry educators can also consider collaborations that contribute to ongoing research into the meaning of an instrument’s scores. In addition to evidence being reported by the instrument developer, researchers have a key role to play in ensuring that the evidence accumulates over time and in diverse contexts. No single publication could provide even the limited number of sources of evidence referenced in this study, let alone in the diverse contexts that are desirable for a full understanding of how



AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. Notes

The authors declare no competing financial interest.



REFERENCES

(1) Abraham, M. In Nuts and Bolts of Chemical Education Research; Bunce, D. M., Cole, R. S., Eds.; American Chemical Society: Washington, DC, 2008; p 47. (2) Herron, J. D.; Nurrenbern, S. C. J. Chem. Educ. 1999, 76, 1353. (3) Bunce, D. M.; Robinson, W. R. J. Chem. Educ. 1997, 74, 1076. (4) Bunce, D. M.; Gabel, D.; Herron, J. D.; Jones, L. J. Chem. Educ. 1994, 71, 850. (5) Kuhn, T. S. The Structure of Scientific Revolutions; 2nd ed.; University of Chicago Press: Chicago, IL, 1970. (6) Novak, J. D. A Theory of Education; Cornell University Press: Ithaca, NY, 1977. (7) Miller, M. B. Struct. Equation Model.: Multidiscip. J. 1995, 2, 255. (8) National Council on Measurement in Education. http://ncme. org/ (accessed Mar 2013). (9) Crocker, L.; Algina, J. Introduction to Classical and Modern Test Theory; 2nd ed.; Wadsworth Publishing Company: Belmont, CA, 2006. (10) American Educational Research Association; American Psychological Association; National Council on Measurement in Education. Standards for Educational and Psychological Testing; American Educational Research Association: Washington, DC, 1999. (11) Journal of Chemical Education. http://pubs.acs.org/journal/ jceda8 (accessed Mar 2013). (12) Thomson Reuters. Web of Science Fact Sheet. http:// thomsonreuters.com/content/science/pdf/Web_of_Science_ factsheet.pdf (accessed Mar 2013). (13) Thompson Rivers University Library. Google Scholar. http:// libguides.tru.ca/content.php?pid=83255 (accessed Mar 2013). (14) Meier, J. J.; Conkling, T. W. J. Acad. Libr. 2008, 34, 196. (15) Watson, v. Fort Worth Bank and Trust. 487 U.S. 977. (1988). Retrieved from http://laws.findlaw.com/us/487/977.html (accessed Mar 2013). (16) Cronbach, L. In Educational Measurement, 2nd ed.; Thorndike, R., Ed.; American Council on Education: Washington, DC, 1971; p 443. (17) Kane, M. T. Psychol. Bull. 1992, 112, 527. (18) Messick, S. Am. Psychol. 1995, 50, 741. (19) Messick, S. Educ. Res. 1989, 18, 5. (20) Cronbach, L. J. In Test Validity; Wainer, H., Braun, H., Eds.; Erlbaum: Hillsdale, NJ, 1988; p 3. (21) Nunnally, J. C. Psychometric Theory, 2nd ed.; McGraw-Hill: New York, 1978. (22) Messick, S., Ed. Test Validation, 3rd ed.; American Council on Education, National Council on Measurement in Education: Washington, DC, 1993. (23) Anastasi, A. Psychological Testing; MacMillan: New York, 1988. (24) Carmines, E. G.; Zeller, R. A. Reliability and Validity Assessment; Sage: Beverly Hills, CA, 1979. 544

dx.doi.org/10.1021/ed3002013 | J. Chem. Educ. 2013, 90, 536−545

Journal of Chemical Education

Article

(25) Gorin, J. S. Educ. Res. 2007, 36, 456. (26) Embretson, S. E. Psychometrika 1984, 49, 175. (27) Willis, G. B. Cognitive Interviewing: A Tool for Improving Questionnaire Design; Sage Publications: Thousand Oaks, CA, 2005. (28) Murphy, K. R.; Davidshofer, C. O. Psychological Testing: Principles and Applications, 6th ed.; Prentice Hall: Upper Saddle River, NJ, 2005. (29) Thompson, B. Score Reliability: Contemporary Thinking on Reliability Issues; Sage Publications: Thousand Oaks, CA, 2003. (30) Hong, T.; Purzer, S. E.; Cardella, M. E. J. Eng. Educ. 2011, 100, 800. (31) Angoff, W. In Differential Item Functioning; Holland, P. W., Wainer, H., Eds.; Routledge: Hillsdale, NJ, 1993; p 3. (32) Woods, C. M. Appl. Psychol. Meas. 2011, 35, 536. (33) Cronbach, L. J.; Meehl, P. E. Psychol. Bull. 1955, 52, 281. (34) Moss, P. A. Educ. Res. 1994, 23, 5. (35) Practitioner’s Guide to Assessing Intelligence and Achievement; Naglieri, J. A.. Goldstein, S., Eds.; Wiley: Hoboken, NJ, 2009. (36) Brennan, R. L. Educational Measurement, 4th ed.; Praeger: Westport, CT, 2006. (37) Kane, M. T. J. Educ. Meas. 2001, 38, 319. (38) McIntire, S. A.; Miller, L. A. Foundations of Psychological Testing: A Practical Approach, 2nd ed.; Sage Publications: Thousand Oaks, CA, 2006. (39) Bless, C.; Higson-Smith, C. Fundamentals of Social Research Methods, an African Perspective, 3rd ed.; Juta: Lansdowne, South Africa, 2000. (40) Villafane, S. M.; Bailey, C. P.; Loertscher, J.; Minderhout, V.; Lewis, J. E. Biochem. Mol. Biol. Educ. 2011, 39, 102. (41) Cronbach, L. J. Psychometrika 1951, 16, 297. (42) Peterson, R. A. J. Consum. Res. 1994, 21, 381. (43) Sijtsma, K. Psychometrika 2009, 74, 107. (44) Velayutham, S.; Aldridge, J.; Fraser, B. Int. J. Sci. Educ. 2011, 33, 2159. (45) Mulford, D. R.; Robinson, W. R. J. Chem. Educ. 2002, 79, 739. (46) Reardon, R. F.; Traverse, M. A.; Feakes, D. A.; Gibbs, K. A.; Rohde, R. E. J. Chem. Educ. 2010, 87, 643. (47) Lacosta-Gabari, I.; Fernandez-Manzanal, R.; Sanchez-Gonzalez, D. J. Chem. Educ. 2009, 86, 1099. (48) Adams, W. K.; Wieman, C. E.; Perkins, K. K.; Barbera, J. J. Chem. Educ. 2008, 85, 1435. (49) Oloruntegbe, K. O.; Ikpe, A. J. Chem. Educ. 2011, 88, 266. (50) Stains, M.; Escriu-Sune, M.; Molina Alvarez de Santizo, M. L.; Sevian, H. J. Chem. Educ. 2011, 88, 1359. (51) Bauer, C. F. J. Chem. Educ. 2008, 85, 1440. (52) Bauer, C. F. J. Chem. Educ. 2005, 82, 1864. (53) Cooper, M. M.; Sandi-Urena, S. J. Chem. Educ. 2009, 86, 240. (54) Chatterjee, S.; Williamson, V. M.; McCann, K.; Peck, M. L. J. Chem. Educ. 2009, 86, 1427. (55) Lee, S. T.; Treagust, D.; Lin, H. J. Chem. Educ. 2005, 82, 1565. (56) Kruse, R. A.; Roehrig, G. H. J. Chem. Educ. 2005, 82, 1246. (57) Wood, C.; Breyfogle, B. J. Chem. Educ. 2006, 83, 741. (58) Halakova, Z.; Proksa, M. J. Chem. Educ. 2007, 84, 172. (59) Jurisevic, M.; Glazar, S. A.; Pucko, C. R.; Devetak, I. Int. J. Sci. Educ. 2008, 30, 87. (60) Cacciatore, K. L.; Sevian, H. J. Chem. Educ. 2009, 86, 498. (61) Marais, F.; Combrinck, S. S. Afr. J. Chem. 2009, 62, 88. (62) Cokadar, H. Asian J. Chem. 2010, 22, 137. (63) Costu, B.; Ayas, A.; Niaz, M. Chem. Educ. Res. Pract. 2010, 11, 5. (64) Potgieter, M.; Ackermann, M.; Fletcher, L. Chem. Educ. Res. Pract. 2010, 11, 17. (65) Mayer, K. J. Chem. Educ. 2011, 88, 111. (66) Potgieter, M.; Davidowitz, B. Chem. Educ. Res. Pract. 2011, 12, 193. (67) Regan, A.; Childs, P.; Hayes, S. Chem. Educ. Res. Pract. 2011, 12, 219. (68) Van Duzor, A. G. J. Sci. Educ. Technol. 2011, 20, 363. (69) Desimone, L. M.; Le Floch, K. C. Educ. Eval. Policy Analysis 2004, 26, 1. (70) Grove, N.; Bretz, S. L. J. Chem. Educ. 2007, 84, 1524.

(71) Hockings, S. C.; DeAngelis, K. J.; Frey, R. F. J. Chem. Educ. 2008, 85, 990. (72) Cheung, D. J. Chem. Educ. 2011, 88, 1462. (73) Vachliotis, T.; Salta, K.; Vasiliou, P.; Tzougraki, C. J. Chem. Educ. 2011, 88, 337. (74) Heredia, K.; Lewis, J. E. J. Chem. Educ. 2012, 89, 436. (75) Greenbaum, P. E.; Dedrick, R. F. Psychol. Assess. 1998, 10, 149. (76) Kember, D.; Leung, D. Y. P. Br. J. Educ. Psychol. 1998, 68, 395. (77) Netemeyer, R. G.; Bearden, W. O.; Sharma, S. Scaling Procedures: Issues and Applications; Sage Publications: Thousand Oaks, CA, 2003. (78) Xu, X.; Lewis, J. E. J. Chem. Educ. 2011, 88, 561. (79) Brandriet, A. R.; Xu, X. Y.; Bretz, S. L.; Lewis, J. E. Chem. Educ. Res. Pract. 2011, 12, 271. (80) Jiang, B.; Xu, X.; Garcia, A.; Lewis, J. E. J. Chem. Educ. 2010, 87, 1430. (81) Schroeder, J.; Murphy, K. L.; Holme, T. A. J. Chem. Educ. 2012, 89, 346. (82) Wei, S.; Liu, X.; Wang, Z.; Wang, X. J. Chem. Educ. 2012, 89, 335. (83) Ozkaya, A. R.; Uce, M.; Saricayir, H.; Sahin, M. J. Chem. Educ. 2006, 83, 1719. (84) Bell, P.; Volckmann, D. J. Chem. Educ. 2011, 88, 1469. (85) Knaus, K.; Murphy, K.; Blecking, A.; Holme, T. J. Chem. Educ. 2011, 88, 554. (86) Cooper, M. M.; Sandi-Urena, S.; Stevens, R. Chem. Educ. Res. Pract. 2008, 9, 18. (87) Lewis, S. E.; Shaw, J. L.; Heitz, J. O.; Webster, G. H. J. Chem. Educ. 2009, 86, 744. (88) Examinations Institute of the American Chemical Society Division of Chemical Education First-Term General Chemistry Exam; University of WisconsinMilwaukee: Milwaukee, WI, 2002. (89) Germann, P. J. J. Res. Sci. Teach. 1988, 25, 689. (90) Reynolds, A. J.; Walberg, H. J. J. Educ. Psychol. 1992, 84, 371. (91) Salta, K.; Tzougraki, C. Sci. Educ. 2004, 88, 535. (92) Ritchey, F. J. The Statistical Imagination: Elementary Statistics for the Social Sciences, 2nd ed.; McGraw-Hill: Boston, MA, 2008. (93) Shrout, P. E.; Fleiss, J. L. Psychol. Bull. 1979, 86, 420.

545

dx.doi.org/10.1021/ed3002013 | J. Chem. Educ. 2013, 90, 536−545