Test–Retest Reliability of the Adaptive Chemistry ... - ACS Publications

Nov 11, 2015 - produced by a test, inventory, instrument, survey, question- naire, or generically, a .... in the Standards for Educational and Psychol...
0 downloads 0 Views 2MB Size
Article pubs.acs.org/jchemeduc

Test−Retest Reliability of the Adaptive Chemistry Assessment Survey for Teachers: Measurement Error and Alternatives to Correlation Jordan Harshman and Ellen Yezierski* Department of Chemistry & Biochemistry, Miami University 651 East High Street, Oxford, Ohio 45056, United States S Supporting Information *

ABSTRACT: Determining the error of measurement is a necessity for researchers engaged in bench chemistry, chemistry education research (CER), and a multitude of other fields. Discussions regarding what constructs measurement error entails and how to best measure them have occurred, but the critiques about traditional measures have yielded few alternatives. We present a brief review of psychometric reliability from its beginning in the early 1900s as well as a summary of critiques of the test−retest reliability coefficient. Then, we posit our own novel measurement, the zeta-range estimator, to assist in quantifying and accounting for measurement error that those interested in educational researchers will find beneficial. We provide a proof-of-concept using simulated data and then analyze the reliability of items on the Adaptive Chemistry Assessment Survey for Teachers, a survey designed to characterize data-driven inquiry. While the focus is for CER, the zeta-range estimator also holds significant value for those outside of educational research, as future work can expand our proof-of-concept to account for more than two measurements. While this estimator is a great starting place, we discuss its limitations and hope future research can use the ideas presented here to explore new frontiers in measurement error determination. KEYWORDS: High School/Introductory Chemistry, Chemical Education Research, Analytical Chemistry, Testing/Assessment, Gases, Stoichiometry FEATURE: Chemical Education Research



INTRODUCTION In the attempt to make tools that produce measurements about the targeted construct(s), chemistry education researchers have taken a keen interest in psychometric properties of validity and reliability. In a more general sense, chemistry education research (CER) is hardly alone in this field; arguments for valid and reliable measures span across educational,1,2 medical,3−5 and a multitude of other fields of research,6−10 although they may refer to this in terms of accuracy and precision of instruments and measurements. At the benchtop, chemists should be aware that any measurements made are of little value if the accuracy and precision of the instrument used to collect data have not been established. The same holds true for researchers of chemistry education, and is universal consideration of the sciences. The focus of this paper will be the examination of arguments for establishing reliability of data produced by a test, inventory, instrument, survey, questionnaire, or generically, a tool for measurement of one or more constructs of interest to chemists in educational or bench chemistry. The impetus for examining reliability comes from a recent study, which examines the data-driven inquiry practices of high school chemistry teachers (outlined in later section). In reviewing existing research to ensure that appropriate claims © XXXX American Chemical Society and Division of Chemical Education, Inc.

could be made about the reliability of data generated from our survey, it was clear that several authors in CER seek to guide and challenge future development of tools specifically in terms of appropriate evidence of reliability.11−15 Of particular interest is how Brandriet and Bretz14 and Bretz and McClary15 challenge traditional measurements in light of the theoretical lens of knowledge fragmentation,16,17 which is argued to alter the expectation of internal consistency (subsequently measured by the coefficient α18). Our review brought to light the necessity to not only examine and present reliability principles that have unfolded in the field of psychometrics, but also address the critiques of such methods recently proposed in CER to appropriately conduct our own analysis of survey data. We hope that some of the considerations addressed here can positively impact chemistry education researchers interested in evaluating the psychometric reliability of an instrument, of which there appear to be many.19−22 In this paper, we will present a review and critique of traditional measures of test−retest reliability as well as provide a novel means to measure test−retest reliability. Our primary focus is for researchers seeking to develop tools to measure meaningful constructs in CER who might benefit from our

A

DOI: 10.1021/acs.jchemed.5b00620 J. Chem. Educ. XXXX, XXX, XXX−XXX

Journal of Chemical Education

Article

consistency, although Cronbach’s α only addressed internal consistency. In fact, he explicitly referenced the need to design analogous coefficients intended to “deal with such other concepts as like-mindedness, stability of scores, etc.” (p 300), indicating that he recognized that his alpha was only one facet of reliability. While the construct of reliability has been malleable, how it was measured was not. It is evident from this very brief review that stability, internal consistency, and more broadly, reliability, have been evaluated solely by some means of correlation, which is the current suggested procedure in the Standards for Educational and Psychological Testing.35,36 In noneducational fields, reliability is not discussed in terms of internal consistency and stability, but rather follows Cronbach’s definition as an estimation of measurement error, often referred to as precision in chemistry-related fields. This is not to say that educational measurements exclude the notion of measurement error, but traditionally, the terminology for things directly measured has differed from those that cannot be measured directly. As an example in clinical research, two methods are known to measure gestational age (besides time since last menstrual cycle) by examining physical and neurological characteristics of newborn infants (titled after their respective authors, Robinson37 and Dubowitz38). To compare measurement error of the Robsinson and Dubowitz measures, Serfontein and Jaroszewicz performed a correlation between the two scores and determined the two measures were in agreement.39 Countless other methods in a variety of fields have been compared in this way and speak to the importance of comparing measurement error in fields outside of CER and demonstrate the applicability of our presentation to any analytic field that relies upon evaluating measurement error.

arguments and solution as we give a brief history of reliability, examine the limitations of existing measurements, propose alternative methods, and argue the reliability of data produced by our survey. However, we also realize that producing reliable measurements extends beyond CER and into any bench chemistry discipline concerned with making reliable (precise) measurements. Thus, we hope a wide variety of audiences will find the information presented here to be useful. A Brief History of Reliability

The arguments for current measures in reliability have a rich, informative history. For psychometric reliability, the beginning of this history would be with the works of Spearman and Brown in 1910.23,24 Most likely prompted by a newfound interest in the Binet intelligence test25 and certainly influenced by the atthe-time recent advances in correlational measures,26,27 Spearman first defined the reliability coefficient by what we know as the “split-halves” method, where the researcher correlates two halves of the scores on a test.28 It was not until 1922 that the College Entrance Examination Board (now CollegeBoard) defined the construct of reliability as being “internally consistent,” providing one of the earliest accepted definitions of reliability.29,30 However, internal consistency was not the only property ascribed to the term reliability in the 1920s. Rather, what reliability entailed was expanded by work published in 1927 by Thurstone31 when he introduced the Law of Comparative Judgments, which today would be referred to as inter-rater reliability in qualitative research. In the same year, Kelly further conflated the term by proposing that the reliability coefficient be defined as the correlation not between halves of the same test, but rather correlation of two similar (not the exact same) tests given to the same participant.32 This complicated the previously straightforward definition of reliability by adding the notion of “consistency of response” to the existing “internal consistency.” In 1930, Paterson et al. sought to reduce quotidian (day-today) variation in participants’ responses by incorporating the test−retest model from Kelly, but using exactly the same test.33 Three years later in a review of psychometric reliability, Dunlap34 differentiated among the various constructs of reliability, introduced the term “stability”, and equated reliability and internal consistency: The use of the retest coefficient [correlation from first to second administration of the same test] assumes that we are interested primarily in the stability of the subject’s responses, that is, his freedom from the quotidian variability. This is an entirely different problem from the determination of the internal consistency, or reliability, of the test [Oxford commas added for clarity].

Critique of Test−Retest Correlation

The test−retest correlation coefficient has advantages of being simple to use and understand as well as it provides a good indication of correlation. Under the assumption that correlated responses from test to retest administrations equates to a good indication of reliability, this coefficient is ideal. However, as alluded to in the Introduction, a number of studies have discussed the limitations of traditionally used statistics based on correlations within the context of reliability. For example, the gestational age method comparison study was criticized heavily for the authors’ use of correlation:40−42 The correlation coefficient will therefore partly depend on the choice of subjects. For if the variation between individuals is high compared to the measurement error, the correlation will be high, whereas if the variation between individuals is low the correlation will be low.

(Dunlap, p 446)

For some time after, reliability implied internal consistency while stability (defined then as freedom from day-to-day variation) was separate from reliability. The term “reliability” eventually adopted both constructs of internal consistency and stability after Dunlap’s differentiation, although we were unable to identify some of the works that lead to this. However, it is very likely that Cronbach’s seminal introduction of the coefficient α had a large role.13 In this article that has been cited over 24 000 times (according to Google Scholar), Cronbach generalized reliability in terms of an “appropriate estimate of the magnitude of the error of measurement” which indicates “how accurate one’s measures are” (p 297). This definition of reliability can imply both stability and internal

(Altman and Bland, p 308)

Here, the authors refer to variation within an individual observation (measurement error) and the variation between observations (not related to reliability, but to validity because it reflects the sampling strategy). Since correlations measure both of these constructs, Altman and Bland discourage its use as a valid determination of measurement error.42 Additionally, in addressing test stability, Brandriet and Bretz pointed out that traditional use of test−retest correlations show additional limitations:14 B

DOI: 10.1021/acs.jchemed.5b00620 J. Chem. Educ. XXXX, XXX, XXX−XXX

Journal of Chemical Education

Article

Students who earn the same scores on both test and retest implementations may not necessarily be answering the same questions correctly on both implementations. Additionally, even if a student consistently responds in an incorrect manner to an item, this does not mean that the student is choosing the same distractor. Because the specific distractors are what will be used to inform instruction, knowing the precision of students’ total scores may not be as useful as identifying consistent response choices.

constructs, however, may not be expected to be as stable (e.g., students’ conceptual understanding of bonding). Because of this, the global assumption that inconsistent responses are indicative of a poorly constructed item is potentially erroneous. We do assert that a balanced interpretation of qualitative and quantitative evidence is warranted across any and all measures. Nominal Level Alternatives for Determining Measurement Error

For constructs measured on a nominal scale, we recommend the method described by Brandriet and Bretz.14 This entails calculating the proportion of observations that is and is not the same from one measurement to the next and comparing them to those proportions under a random model via Chi square goodness of fit test. The authors present this analysis within the context of test−retest reliability and compute this statistic for every item of their inventory.14 By doing this, in conjunction with an effect size measure, the authors can provide evidence that the proportion of students responding consistently is greater than would be expected by random chance.

(Brandriet and Bretz, p 1134)

Related to this point, surveys can be designed in a manner where a total score is not meant to be calculated, indicating the need for item-level reliability measures. These surveys sacrifice the simplicity of interpreting a single score as an indicator of some construct(s) for making inferences on a per-item basis. Last, we provide our own challenge to the correlation-centered arguments based on the ability to discern from various hypothetical ratio data sets generated from test−retest conditions. From the hypothetical data sets in Figure 1 (simulated by data plotted in Figure 3), we assert that all but set C and

Ratio Level Alternatives for Determining Measurement Error

In our proposed estimator for ratio data (and ordinal data with more than 5 point scale43), which we call the zeta-range estimator, the researcher must first define a tolerance range that signifies the variability from one measure to another on the same observation to be considered “within acceptable error of measurement,” which is challenging to define. This range is defined as perfect reproducibility (y = x) plus or minus zeta (ζ), as is illustrated in Figure 2.

Figure 1. Five hypothetical data sets reveal that constructs of correlation and stability (low within observation variance) are not the same.

possibly B will yield an excellent correlation coefficient, but not necessarily indicate that measurement error is minimal. Moreover, data sets D and E show evidence for systematic error which is not detected by correlation, a more dire measurement error problem. Data resembling that in Figure 1D may arise if an unanticipated event affects responses in the same way (e.g., if students complete an inventory about a topic, then are exposed to additional, correct information about that topic). Data resembling that in Figure 1E may arise if an unanticipated event affects responses in a systematically varied way (e.g., if students complete an inventory about a topic, then are exposed to correct and incorrect information about that topic, the poorer-performing students may increase in score while the higher-performing students may decrease in score).

Figure 2. Shows area of the zeta-range from y = x and the points inside/outside of the zeta-range, where ζ = 1.

Construct Stability

Considerations for how the ζ range should be defined are addressed later. To illustrate how a probability estimator can be generated, let us assume for now that an appropriate zeta range has been defined. The proportion of the data that lies outside of the ζ range, Pζ, can be determined (points outside of the dotted lines in Figure 2), the calculation of which is given in eq 1. nx , y outside ζ Pζ = NTotal (1)

For the purposes of the study presented here, we outline methods to determine the stability of a construct. Some constructs, such as those that we have measured, are presumed to be relatively stable (meaning respondents are expected respond with relative consistency from test to retest). As such, we presume that if respondents drastically change responses from test to retest, it is due to a poorly functioning item and discuss our proposed method under that assumption. Other

With the nonparametric approach of bootstrapping, originally accredited to Efron44 and widely adopted as a standard statistical technique, it is easy to build a confidence interval of this proportion by resampling from the observed test and retest data. To do this, a program will draw a random sample (with replacement) of the test and retest from the original data. Second, the Pζ will be calculated for this new data set and stored. This process is repeated a large number (10 000) of C

DOI: 10.1021/acs.jchemed.5b00620 J. Chem. Educ. XXXX, XXX, XXX−XXX

Journal of Chemical Education

Article

times. Finally, all of the values of Pζ stored make up the distribution of the Pζ estimator, where, for a 95% confidence interval, the 2.5th and 97.5th percentiles represent the lower and upper bounds for the confidence interval. In theory, this interval contains the “true” proportion of participants that fall outside the ζ range (the proportion that fails to respond within acceptable measurement error) of the population because we assume that the sample we randomly draw from will be representative of the population. This Pζ estimator was not designed to be the final solution, as it has a number of challenges. The first challenge is choosing an appropriate threshold for what constitutes “acceptable measurement error.” Instead of arbitrarily setting a value for ζ, we recommend asking the participants to estimate their own measurement error. For example, consider what teachers reported for the percent of assessment items that assess one single concept versus multiple concepts (I6 of our survey). The teachers we studied gave estimates of their uncertainty in their response by saying things such as “I would say my response is 20% plus or minus 10%.” In this example participant, s/he would need to respond between 10% and 30% on the retest in order to be within his/her own estimation of measurement error. As opposed to setting individual ζ ranges, we took an average and applied it to all participants. However, one could build a second tier of an instrument so that each individual defines his/her own ζ range, set a ζ range based on some standard (e.g., 5% error is acceptable), or define a variable ζ range for different groups of participants based on some factor. Defining the ζ range based on participants has the advantage of being empirically driven and sample-specific, but has the disadvantage of relying on estimates of those participants who may not be able to always accurately judge their own consistency. Defining a standard ζ range could be appropriate, but likely arbitrary in deciding what percent error is acceptable. The second greatest challenge, yet also one of the most interesting features of zeta-range estimator, is the interpretation of the resulting confidence intervals. To a researcher developing a test, we could imagine that if a very small proportion (0%− 15%) of participants fail to respond within a reasonable measurement error for an item, the item under study is likely producing very reliable data. Conversely, if a large proportion of participants (85%−100%) fail to respond within a reasonable measurement error, that item is likely producing very unreliable data. The interpretation of the ranges that lie between, however, is dependent on a large number of questions: What does the researcher plan to do with results to that item? How would greater/lesser measurement error affect those plans given the scale of the item? What is magnitude of the ζ range? What does it mean that so many or so few people do not respond within a measurement error threshold? Is it possible that the construct in question is less stable than originally thought (i.e., the construct has changed for the participants over time)? For example, suppose we are interested in teachers reporting the number of formative assessments they administer in a given semester (we have a similar item on our survey). Assume they report that they can estimate this within plus or minus 3 assessments, but the zeta-range estimator suggests that on a retest administration, 45−55% of teachers will not respond within 3 assessments from what they reported on their original test. Therefore, when reporting results to this item, it is critical to disclose and take this information into consideration. Descriptive statistics such as five-number-summaries (first,

second, and third quartiles; minimum and maximum of the sample) can still be reported for this item, but explicit reference to the imprecision of these results should be made, therefore, affecting the certainty in the conclusions. Inferential statistics, however, are different. With between 45 and 55% of the participants responding within acceptable measurement error, it would be highly inadvisable to include this item in inferential models or tests, such as regression. Without consideration of (and reporting on) measurement error, the inferential models or tests would generate an invalid output disguised as something meaningful. We explicitly caution against establishing any cutoff values whatsoever (e.g., the lower bound of the confidence interval needs to be below 40%). This estimator is dependent on far too many things to suggest that a single cutoff can be established without being completely arbitrary. Instead, this estimator is better used to highlight pertinent results in full transparency of the evidence a researcher has to support the reliability of the data. Applying Reliability Measures to Data from the Adaptive Chemistry Assessment Survey for Teachers

To demonstrate the interpretation of traditional and alternative methods for determining reliability, we will compute these statistics using responses from the Adaptive Chemistry Assessment Survey for Teachers (ACAST). This instrument examines the data-driven inquiry (DDI) practices of high school chemistry teachers and was designed based on data from a qualitative study conducted previously.45 DDI is the process by which teachers set teaching and learning goals, make conclusions about teaching and learning using specified evidence, and take pedagogical action as a result.46−50 For a more detailed review of DDI, see our literature review.51 The ACAST has items that produce nominal, ordinal, and ratio level data. The ACAST also includes two scenarios that are adaptive to teachers’ responses. These scenarios model designing, implementing, and evaluating assessments in gases and stoichiometry in the context of hypothetical classrooms. Although we present sample items which generated the data of interest, the adaptive nature of the survey makes its complete presentation difficult to do in paper form. Thus, a Qualtrics link for the ACAST as it was administered may be found online.52 Our research questions were the following: (1) What is the measurement error of data produced by testing and retesting items on the ACAST? (2) How do estimations of test−retest reliability differ using traditional and novel statistical methods?



METHODS High school chemistry teachers from around the country were invited to complete the ACAST. Invitations were sent out on (a) the National Science Teachers Association (NSTA) listserv for chemistry teachers, (b) each NSTA state chapter listservs (for those leaders who agreed to send the invitation), (c) the American Association of Chemistry Teachers Web page, and (d) invitation cards at the 2014 Biennial Conference for Chemical Education. An incentive drawing for a $50 Amazon.com gift card was completed for every 50 participants that completed the survey. All data collection was research compliant with local IRB policies. A total of 340 teachers completed the survey either completely or with minimal missing data (subjected to imputation by mean or median53). Of these, 62 teachers had complete data for both test and retest administrations of the ACAST (after any imputations, time D

DOI: 10.1021/acs.jchemed.5b00620 J. Chem. Educ. XXXX, XXX, XXX−XXX

Journal of Chemical Education

Article

Figure 3. Five simulated data sets (from left to right, top to bottom) show: strong correlation, moderate correlation, weak/no correlation, strong correlation with positive systematic error, and strong correlation with differential systematic error. Solid red line is y = x, and dotted red lines indicate boundary of zeta-range (ζ = 15).

between administrations was 10−14 days). Estimates of measurement error (zeta-range) were elicited from validation interviews with 14 teachers (not from the test−retest sample). Information about data imputation methods and demographics can be found in the Supporting Information. All analyses and functions were built using R version 3.1.2 (function for computing zeta-range estimator can be found in Supporting Information).54

Table 1. Pearson Correlation and Zeta-Range Confidence Intervals for Each Data Set Traditional Correlation



RESULTS AND DISCUSSION Since we propose the use of our novel zeta-range estimator, we have compared our results with traditional correlation measures for five hypothetical data sets as a proof-of-concept (ZetaRange Estimator Results section). After these results, we present evidence for the test−retest reliability of the items on the ACAST using both traditional and novel methods (Test− Retest Reliability of the ACAST section).

Pζ Confidence Interval

Data Set

r

p(r)

Lower Bound (%)

Upper Bound (%)

A B C D E

0.95 0.77 0.09 0.91 0.92