Chapter 10
Designing Tests and Surveys for Chemistry Education Research 1
Downloaded by STANFORD UNIV GREEN LIBR on July 3, 2012 | http://pubs.acs.org Publication Date: January 3, 2008 | doi: 10.1021/bk-2008-0976.ch010
Kathryn Scantlebury and William J . Boone
2
¹Department of Chemistry and Biochemistry, University of Delaware, Newark, DE 19716 Ohio's Evaluation and Assessment Center, Miami University, Oxford, O H 45056 2
This chapter will provide readers with an overview of how to design and evaluate surveys and tests using 1) pencil and paper techniques and 2) the Rasch psychometric model. The goal is to help chemistry education researchers develop robust tests and surveys that optimize data collection. For paper and pencil test development, we discuss issues such as the importance of item wording, the use of figures, how to best select a rating scale and the impact of text layout. Then, we present an overview of how researchers can use Rasch analysis to 1) guide the initial development of instruments, 2) evaluate the quality of a data set, 3) communicate research findings and 4) carry out longitudinal studies. The careful development of measurement instruments, in addition to the use of Rasch techniques can help optimize what is learned in many chemistry education studies.
© 2008 American Chemical Society
In Nuts and Bolts of Chemical Education Research; Bunce, D., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 2008.
149
150
Downloaded by STANFORD UNIV GREEN LIBR on July 3, 2012 | http://pubs.acs.org Publication Date: January 3, 2008 | doi: 10.1021/bk-2008-0976.ch010
Introduction There are many issues of common concern when one develops tests and surveys. A researcher must decide to pool items for an overall "measure" or to individually examine items and the type of rating scale in an attitudinal survey. In this chapter, we will present some critical issues, which researchers should consider when developing surveys and/or tests for data collection and describe how researchers can use a psychometric model (the Rasch model) to 1) improve the development of tests and surveys, 2) evaluate survey and test data, and 3) communicate research findings. The first part of the chapter can be viewed as being "paper and pencil" in nature, while the second part of this chapter should be viewed as a critical step in the development and analysis of test and survey data. However, researchers also need to respect the steps in test development for the Rasch model to be useful. That is, the Rasch model cannot compensate for poorly designed tests and surveys. We have chosen to organize both the "paper and pencil" and Rasch portion to provide a user-friendly overview of critical issues when designing tests and surveys. Some design issues impact only tests while other issues impact surveys. We have attempted to present a range of issues in this chapter which will aid researchers designing both tests and surveys, however to minimize chapter length we do not present an exhaustive discussion of all possible issues.
Writing Surveys and Tests Pooling Items When researchers design a survey or test, it is important to consider whether or not there is a goal of "pooling" a set of items to produce an overall measure. To understand this issue, consider what is gained when a test of numerous items is presented to a student. If a student completes a 50-item test, then there are 50 items to determine her/his performance. Teachers commonly recognize this as the calculation of the student's total test score that is a more precise assessment of a student's performance than the student's performance on single test items. It is important to mention that "pooling" items together in an effort to increase measurement precision should not be carried out without considering important reliability and validity issues. For example, a chemistry test written to assess knowledge in biochemical principles would be inappropriate i f all the items were focused on physical chemistry topics. Similarly, giving a chemistry test designed for graduate students to first year undergraduates would be unfair. Thus, test developers need to carefully consider which items on a 50-item test would "pool" together to measure different aspects of a student's chemical knowledge.
In Nuts and Bolts of Chemical Education Research; Bunce, D., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 2008.
151 If the goal is to "pool" items, there are specific techniques to design a test, for example, the respondents' breadth and range of knowledge. To achieve this goal, a researcher would develop test items with a mix of difficulty (e.g. easy, medium and hard) — this helps differentiate the performance of the respondents. We have found when authoring items for a test, it is usefiil to predict the difficulty of test items. Figure 1 presents a fictitious five-item test and the location of predicted item difficulty - predicted by the item author.
Downloaded by STANFORD UNIV GREEN LIBR on July 3, 2012 | http://pubs.acs.org Publication Date: January 3, 2008 | doi: 10.1021/bk-2008-0976.ch010
Q l Q4
Q3
Q2
Q5
_
Easy Items
•
Hard Items
Figure 1. Prediction of item difficulty by a test item author.
Authoring items and predicting item difficulty forces test item authors to think in more detail about test development. Secondly, by predicting item difficulty, item authors can learn i f they might have authored too many, or too few, items at one difficulty level. Third, when data is collected, researchers can compare predicted to actual item difficulty placement. This serves as a technique to improve test item authors' understanding of the issues they are investigating. For instance in Figure 1, i f item Q1 is more difficult than predicted, it may suggest a problem in the item's structure. Test item authors can use the mismatch between predicted and actual item difficulty as an indicator to review and revise items. Test item authors can also use these steps for authoring attitudinal survey items. For instance, i f a respondent can indicate that they "agree" or "disagree" with a statement. One should include • •
Items that are easy for respondents to select "agree" and Items that are difficult to agree with.
When researchers develop a wide range of attitudinal items, they maximize the differentiation between the attitudes of respondents, similar to tests containing a wide range of items. Figure 2 presents a schematic that displays a possible distribution of predicted ease with which respondents agree when the researcher uses five survey items. A researcher may have a large and varied number of project goals and as such, a survey or test need not always contain pooled items. For example, in some circumstances researchers might only administer a brief survey or test to respondents and in such a case, it would be too onerous to present a number of
In Nuts and Bolts of Chemical Education Research; Bunce, D., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 2008.
152 test or survey items for each goal. The types of issues presented in Figures 1 and 2 are discussed in Best Test Design (1).
S2
S5
SI S3
S4