Valid and Reliable Assessments To Measure Scale Literacy of

Jun 17, 2014 - Knowing the students' misconceptions or common errors is valuable in writing test items. However, particularly with items that specific...
0 downloads 0 Views 412KB Size
Article pubs.acs.org/jchemeduc

Valid and Reliable Assessments To Measure Scale Literacy of Students in Introductory College Chemistry Courses Karrie Gerlach,† Jaclyn Trate,‡ Anja Blecking,‡ Peter Geissinger,‡ and Kristen Murphy*,‡ †

Brookfield Academy, Brookfield, Wisconsin 53045 United States Department of Chemistry and Biochemistry, University of WisconsinMilwaukee, Milwaukee, Wisconsin 53201 United States



S Supporting Information *

ABSTRACT: An important component of a student’s science literacy is scale and concepts relating to scale including proportion and quantity. Measuring a student’s scale concept has been examined using laboratory studies with one-on-one activities and interviews. However, classwide assessments to measure a student’s scale concept have been limited and largely focused on one component of a student’s scale concept. Classwide assessments that are validated and found to be reliable for a particular group of students can be useful in measuring the effects of interventions or course experiences. Two assessments were developed to measure a student’s scale concept, targeting specific components of scale as described by Gail Jones and co-workers through the Scale Concept Trajectory. One of these assessments examines students’ skills related to scale while the other examines students’ scale concepts. The validation and reliability analyses as well as correlations of these measures to course performance in chemistry for both general chemistry I and preparatory chemistry are presented. KEYWORDS: First-Year Undergraduate/General, Chemical Education Research, Testing/Assessment FEATURE: Chemical Education Research





INTRODUCTION

LITERATURE REVIEW What is scale? According to Lock and Molyneaux, “Scale is a slippery concept, one that is sometimes easy to define but often difficult to grasp...there is much equivocation about scale, as it is at the same time a concept, a lived experience and an analytical framework” (ref 5, p 1). More specifically, this has been defined by Jones and Taylor as, “any quantification of a property that is measured” (ref 6, p 460). They further defined what encompasses an understanding of scale as it “involves a number of concepts and processes such as quantity, distance, measurement, estimation, proportion, and perspective. Although most applications of scale involve linear distances, other variables such as temperature, time, volume, or mass are also important” (ref 6, p 460). For this work, the definition of scale will mirror that of Jones, focusing on linear distances. As part of a larger scale study, Jones and Taylor6 presented a trajectory of scale concept development which is meant to outline the skills that students will possess as they progress in their scale development through novice, developing and experienced levels (Table 1). The trajectory could be used as a guideline for developing interventions to move students forward in their scale proficiency. Properly designing assessments so that they are timely and effective, yet still elicit accurate information about students’ knowledge, is a challenge. Ideally these assessments would provide a meaningful picture of the students’ scale understanding and identify misconceptions. In addition, because scale has been identified as an important component in science literacy, the degree to which scale literacy correlates to performance and success in general chemistry courses will also be determined. A more narrow measure of students’ scale

The incorporation of themes or frameworks into instruction, specifically science instruction, has been an important component for developing scientific literacy, as noted by several organizations from the American Association for the Advancement of Science in their Project 2061 and other publications1,2 to the National Research Council in their framework for K−12 science education.3 Moving forward, the Next Generation Science Standards for K−12 science education will also include themes for science instruction.4 Included in these themes is the concept of scale, through the most recent cross-cutting concept of “Scale, Proportion, and Quantity”.3 To include these themes or frameworks in a meaningful manner into instruction, assessment of both the growth of knowledge of a student in the discipline should be measured as well as the development of the student’s knowledge or conception within the theme. One-on-one interviews would be the most ideal situation, where the interviewer and interviewee would be able to create a dynamic conversation so that any clarification could be provided if needed. In addition, an extraction of the personal part of the learning experience could be determined by having the interviewee explain things further with follow-up questions. Interviews are time-consuming; thus, only a small handful of students could be examined. However, using interviews to get a representative sample of student knowledge would aid in developing a better assessment that could be used for much larger samples. This paper describes the development and testing of two assessment instruments based on previous research for the measuring of concepts of scale of students in introductory college chemistry courses. © 2014 American Chemical Society and Division of Chemical Education, Inc.

Published: June 17, 2014 1538

dx.doi.org/10.1021/ed400471a | J. Chem. Educ. 2014, 91, 1538−1545

Journal of Chemical Education

Article

Table 1. Scale Concept Trajectory6

students choices for degrees of agreement or disagreement (similar to a questionnaire) is the CHEMX12 instrument that utilized a Likert scale. A study by Cooper14 using this same approach showed that this response method effectively captured the student responses by comparing the results of the Likert-scaled assessment to other assessments such as grades. From a simple perspective, it can be said that a “valid” assessment is one that measures what it is supposed to measure15 for the test population. Validity is determined by multiple methods, including experts constructing and analyzing the test items, the use of student responses and item statistics to edit and select test items, and comparison of the assessment measurement to other valid measures.16 The reliability of an assessment instrument can be defined as the consistency of its measurement each time it is used under the same set of conditions with the same group of subjects. For example, a reliable chemistry assessment instrument would (in theory) produce the same results if given to a group of students with the same abilities. For all practical purposes, because student populations are so diverse in their range of abilities, reliability testing for assessment tools is accomplished using estimation. Considering these basic requirements, and combining both types of assessments, i.e., a multiple choice test and concept inventory with Likert-type responses, will allow for a more complete view of the students’ understanding of scale and the misconceptions that they may hold. The following research questions were investigated in this study.

literacy, focusing on their proportional reasoning, has been developed and tested with engineering students based on student interviews.7 The researchers developed and tested a pilot version of the assessment with three items examining size and scale that utilized both a multiple-choice component and a written portion. The results were utilized in constructing a typology of undergraduate students’ conception of size and scale from fragmented, linear, proportional to logarithmic. One way of determining what and how a student thinks is through individual interviews with the student, but the time required to conduct the interview and to process the collected information limits this approach; consequently, the total number of interviews is often very small compared to the population or even the sample. Efforts have been made to develop effective multiple-choice tests and to provide alternative approaches for constructing multiple-choice items, such as the method introduced by Tamir8 in 1971. This method used open responses from students to create distracters, or incorrect responses, on multiple-choice tests. By utilizing this method, the distracters are more realistic possible responses because the students’ own words are used, which therefore results in a more accurate way to test student misconceptions. Knowing the students’ misconceptions or common errors is valuable in writing test items. However, particularly with items that specifically test definitions or misconceptions, the variance of these responses can make it difficult to “guess” students’ thought processes. In 1988, Treagust9 introduced a framework for the development of diagnostic tests that identify students’ misconceptions. This framework is made up of 10 stages which are split into three broader areas: Steps 1−4, Defining the content; Steps 5−7, Obtaining information about students’ misconceptions; and Steps 8−10, Developing the diagnostic test. Several tests have been developed for chemistry assessment utilizing Treagust’s method, such as the Chemistry Concepts Inventory (CCI),10 Particulate Nature of Matter Assessment (ParNoMa),11 and CHEMX,12 just to name a few. Well-written multiple-choice tests can be graded quickly and effectively with regard to measuring what students know. However, the test items are often written to test one concept only and include distracters to elucidate this particular information as well.13 While incorrect responses often reflect the most common errors made by students, this format does not provide options for determining degrees of agreement or disagreement; this would allow for insight into the depth of the students’ interpretation of the content. An example of providing



1. How can scale literacy be measured for classwide assessment? 2. What is the scale literacy of introductory college chemistry students? 3. How does scale literacy predict performance in general chemistry?

INSTRUMENT DEVELOPMENT Two assessments were developed to analyze different areas of students’ knowledge and understanding of scale: the Scale Literacy Skills Test (SLST) and the Scale Concept Inventory (SCI). The tests were designed to complement each other and be used together in what is called the student’s Scale Literacy Score (SLS). Both assessments were statistically examined for internal (pre/post) and external (ACS final exam scores for general chemistry I or final exam score and final course percent for preparatory chemistry) validity and reliability. 1539

dx.doi.org/10.1021/ed400471a | J. Chem. Educ. 2014, 91, 1538−1545

Journal of Chemical Education

Article

were then trial tested with two samples (N = 60; N = 56) from a single lecture section of general chemistry I during the summer. Students were given 90 min to complete the test. Item statistics16 included both difficulty (fraction of students who answered correctly) and discrimination values (fraction of high performing quartile to low performing quartile who answered correctly). Questions were selected and refined to maintain content coverage and matched items were refined and selected based on their difficulty and discrimination. In addition, the incorrect responses were refined further to omit distracters that were not selected at a high enough rate (selected at a rate of 5% or less.) On the basis of the trial testing results, the items were further refined to a final test of 45 items. Initially, the SLST was given to the students during the first week of class on paper using Scantron forms to report their answers. The SLST was administered during discussion periods with a 50 min time limit; students were awarded their weekly discussion points based on completion. The use of the SLST in a course-management system for pretesting began in Fall 2012. Starting in Fall 2010, the SLST was also administered at the end of the semester online using the course management system. Descriptive statistics, as well as difficulty and discrimination statistics, for each of the 45 items of the final version of the test were examined and evaluated (shown in Table 4). As shown, the groups performed similarly.

The assessment development and successive testing was performed at a large doctoral urban public institution in the Midwest. All participants were enrolled in preparatory chemistry or the first semester of a two-semester sequence of a general chemistry course. (For specifics related to study participants and testing schedule, see Tables 2 and 3.) Table 2. SLST Study Participants, Testing Method, and Testing Schedule Semester Included

Pre/Post

Preparatory Chemistry, n

General Chemistry I, n

Fall 2009−Spring 2013 Fall 2010−Spring 2013

Pre Post

931a, 312b 578b,c

1227a, 523b 956b

a

Instrument was administered via paper and pencil with Scantron or similar forms; count includes all students who attempted the SLST. b Instrument was administered via course management system; students who did not answer all items via the course management system were excluded from this count. cInstrument was administered starting in Spring 2011.

Table 3. SCI Study Participants, Testing Method, and Testing Schedule Semester Included

Pre/Post

Preparatory Chemistry, n

General Chemistry I, n

Fall 2009−Spring 2013 Fall 2010−Spring 2013

Pre Post

682a, 130b 337a,c, 114b,c

1227a, 166b 693a, 287b

a

Instrument was administered via paper and pencil with Scantron or similar forms; count includes all students who attempted the SCI and answered the verification item correctly. bInstrument was administered electronically via Qualtrics. cInstrument was administered starting in Spring 2011.

Table 4. Combined Results of SLST Item Analysis and Overall Test Performance

n Difficulty

Preparatory chemistry is a four-credit course with college algebra (or mathematics placement test) as the prerequisite; general chemistry I is a five-credit course with college algebra (or mathematics placement test) and preparatory chemistry (or chemistry placement test) as the prerequisite. In addition, a group of experienced graduate students in chemistry made up the expert group (N = 14−21). The research protocol is approved (IRB #09.047) and all data included is from students who consented via this protocol.

Discrimination

Overall (out of 45 possible)

Development and Validation of the SLST

Two trial tests of the SLST were constructed based on the Jones trajectory,6 results of interviews,17 and published tests.10 The interviews that aided in the development of this assessment are extensively described elsewhere.17 These items were written by the authors and vetted for both scientific content and clarity with experts in chemistry with the exception of one item that was from the CCI (for comparison).10 In total, 120 unique items were written with a number of item pairs where one item of each pairing (two versions of the same item) was placed on each test (see Box 1). The two 60-item tests

High Low Mean High Low Mean High Low Mean Median Mode Standard deviation

Preparatory Chemistry

General Chemistry I

1243 0.862 0.080 0.435 0.607 −0.029 0.304 39 8 19.6 19 19 5.4

1750 0.902 0.083 0.536 0.563 0.030 0.348 42 6 24.1 24 23 6.1

The typical difficulty range for valid items on a high-stakes assessment is 0.3−0.7 and above 0.25 for discrimination.16 The difficulty and discrimination range for some items were well outside of what is normally accepted for valid items (see the high and low range on the difficulty and discrimination values). However, these items were kept specifically because they tested misconceptions. Difficulty and discrimination values on tests of misconceptions can have a broader range because they are used as a diagnostic tool.10 A Kuder−Richardson (KR−21)16 analysis was performed that determines the internal consistency or reliability of the test. The equation for KR−21 is given below, where X̅ is the test mean, n is the number of items, and s is the standard deviation. A result of 0.6 or higher is determined to be reliable.

Box 1. Example of Paired Items from SLST Trial Tests

KR−21 = 1 − 1540

X̅ (n − X̅ ) ns 2 dx.doi.org/10.1021/ed400471a | J. Chem. Educ. 2014, 91, 1538−1545

Journal of Chemical Education

Article

The KR−21 for the preparatory chemistry course was 0.62 and 0.70 for the general chemistry I course. Item statistics of the final version of test are included in the Supporting Information. The test is available from the author upon request. Because the trajectory of scale development6 (Table 1) was a critical part of the development of the SLST, each question was categorized with respect to the trajectory and examined for performance. For example, the question shown (Box 2) Box 2. Example of One SLST Test Item on Number Sense Which value is greater than zero but less than one? (A) −5 × 105

(B) −5 × 10−5

(C) 5 × 10−5

(D)

5 × 105

Figure 2. Performance by type of student on subgroups of the SLST based on the novice to expert assignment of items based on the Scale Concept Trajectory (the components within each designation are noted below the bar graph).6

examines how well a student understands numbers, scientific notation, and negative numbers; it is categorized as number sense in the trajectory. Most of the items in the SLST were written to fit into the majority of the components of the trajectory of scale. Additionally, when the final 45 items were selected for the assessment, the coverage of these components by number of items and difficulty of these items were also considered. The remaining nine items were written based on where students struggled with the connection between macroscopic and particulate representations of matter as well as the definitions of macroscopic and particle properties. These items were categorized into the two additional groups (macroscopic and particle) that are more specific to chemistry and not included in the Scale Concept Trajectory of Jones.6 The performance of each novice group and our expert group was then examined with respect to the categories in the trajectory of scale development, Figure 1. There tends to be a progressive improvement of performance from preparatory chemistry to expert levels for each area, however it is also apparent that there is not a progressive decrease in what Jones called novice skills to expert skills. Interestingly, all groups tend to have similar areas of strengths and weaknesses. However, when we compare the novice group and the expert group with respect to the trajectory, the experts performed substantially better on each of the areas (Figure 2).

Development and Validation of the SCI

The initial results of the SLST and the interviews (extensively described in a separate manuscript)17 revealed that another assessment was needed that directly addressed the students’ misconceptions related to scale. Thus, the SCI was designed to incorporate previously published misconceptions18 (and references therein) that related to scale and which would be relevant to a first year chemistry student as well as new statements based on interviews.17 Although these misconceptions were extracted from the literature18 and from interviews17 conducted by the authors, experts were used to validate the content of the statements. Chemistry faculty vetted each statement to edit or eliminate any item on the basis of clarity or correctness. The final version of the SCI contains 40 statements that are scored using a 5-point Likert scale: a five-option continuum from strongly agree (5) to strongly disagree (1). Twenty-three of the statements were written to elicit a positive or agree response, while 13 were written for a negative or disagree response. This technique was used to ensure that the students read each question. In addition, a verification item was used to identify those students who did not correctly utilize the SCI (for example, not reading the statements, not understanding the rating scale or simply entering random responses)

Figure 1. Performance by type of student on subgroups of the SLST based on the content groupings of items when written to the Scale Concept Trajectory (the number of items corresponding to each content area is noted).6 1541

dx.doi.org/10.1021/ed400471a | J. Chem. Educ. 2014, 91, 1538−1545

Journal of Chemical Education

Article

and should therefore not be included in the final analysis. Regardless of a student’s response, the verification item was not included in the student’s overall score. Because a student’s sense of scale is uniquely developed, three subjective items were also included on the assessment for qualitative analysis and therefore also not graded along with the verification item. The SCI was initially implemented during the first week of class on paper using Scantron forms to report student answers. The SCI was handed out in lecture and was collected during the next lecture. Extra credit points were given for returning a completed survey. Starting in 2010, the SCI was additionally given on paper during the last week of class. It was transferred to a fully online form in Fall 2012 (starting with post-testing in Fall 2012 and used for pre- and post-testing in Spring 2013). Students received extra credit for completing both. The sample sizes were lower for the SCI compared to the SLST because students were removed if they did not answer the verification item correctly or if they did not submit the inventory. Additionally, given the nature of how each assessment was given (weekly discussion points versus extra credit), the average return rate on the SCI was lower than that for the SLST (65% versus 77%). Item statistics of the final version of test are included in the Supporting Information. The inventory is available from the author upon request. Unlike the SLST, analysis of the assessment using difficulty and discrimination was not appropriate; therefore other methods were used to determine the validity of the instrument. The descriptive statistics are provided in Table 5. An ANOVA

Pearson product moment correlation coefficients were used to examine the relationship between the post SCI score and final percent in the course. The correlations were positive, small to medium in nature and significant: r(788) = 0.361, p < 0.001 for general chemistry I and r(447) = 0.172, p < 0.001 for preparatory chemistry. The trend of higher performing students having a higher SCI score is consistent with expectations for this concept inventory and thus adding to the validity of the SCI.



Scale Literacy Score (SLS)

Common predictive measures of success in general chemistry include placement exams in math and/or chemistry,19−22 measures such as logical thinking or reasoning,23 conceptual math knowledge24 or high school content knowledge.25 Students taking general chemistry at the institution where the studies were carried out are required to take placement exams in chemistry and/or math. Therefore, students’ performance on the placement exams should correlate to their performance in the course. Each scale assessment as well as the combined score was also investigated as a potential predictor for success in chemistry using final exams26,27 as a measure of success and comparing to other common predictors (ACT scores and math and chemistry placement scores). Both the SLST and the SCI measure aspects of a student’s conception of scale and were intentionally created to complement each other to provide a complete picture of the students’ scale comprehension. To capture the results of both instruments, a combined score called the Scale Literacy Score (SLS) was created by weighting each of the assessments equally and calculated for each administration (pre/post) and compared with standardized measures for the students, ACT scores and subscores and a mathematics and chemistry placement exam.19 Because the prerequisite for general chemistry I is both mathematics and chemistry, all general chemistry students in eight semesters beginning with Fall 2009 through Spring 2013 were evaluated using these predictors for success in the course. A Pearson product moment correlation coefficient was used to compare these assessments to the scores on two ACS final exams (the 2005 First Term General Chemistry Paired Questions exam and the 2008 General Chemistry ConceptualFirst Term Subset exam).26,27 A result of 0.5 and above is considered a strong correlation, while a score of 0.3−0.5 is considered a moderate correlation. The results of the analysis is found in Table 7 and all values are significant at the p = 0.01 level. The scale literacy measure was the best predictor for performance on the conceptual final exam27 and the same as the combined placement test for the other final exam.26 Because the prerequisite for preparatory chemistry is only mathematics, the analysis of data collected in this course could not include a chemistry placement test or a combined placement test score as only the mathematics placement test was given to the students. All preparatory chemistry students in the five semesters beginning with Spring 2011 through Spring 2013 were evaluated using these predictors for success in the course. The results are shown in Table 8. Unlike the results found for general chemistry I, the results for preparatory chemistry show a better correlation between the final exam and the ACT composite or mathematics score and the final percent in the course and the mathematics placement test score. This

Table 5. Descriptive Statistics for the Combined Pre-SCI Preparatory Chemistry, % (n = 812)

General Chemistry I, % (n = 1393)

89 39 64 64 62 5

91 39 67 67 63 6

High Low Mean Median Mode Standard Deviation

(Analysis of Variance) test was performed using the general chemistry I, from Spring 2010, students’ final grade in the class with post SCI score as well change in SCI score from pre to post (Table 6) for validity. The results show a significant difference overall between at least two groups of students. The study was then expanded to include all semesters of testing. Table 6. Analysis of SCIa Final Grade

Post SCI, %

Change (Post−Pre), %

n

A B C D Fb Experts F p

76 71 69 67 65 78 8.392