Comparing Two Tests of Formal Reasoning in a College Chemistry

Sep 30, 2010 - Department of Chemistry, University of South Florida, Tampa, Florida ... For a more comprehensive list of citations to this article, us...
0 downloads 0 Views 763KB Size
Research: Science and Education edited by

Thomas A. Holme Iowa State University of Science and Technology Ames, IA 50011-3111

Comparing Two Tests of Formal Reasoning in a College Chemistry Context Bo Jiang, Xiaoying Xu, Alicia Garcia, and Jennifer E. Lewis* Department of Chemistry, University of South Florida, Tampa, Florida 33620 *[email protected]

Formal reasoning ability, namely, the ability to reason in the abstract level beyond the bounds of specific contexts, has been shown to be essential for student achievement in science and chemistry (1-7). Formal reasoners have greater comprehension and generalization skills (8, 9). According to Piaget's cognitive development theory, formal operations include theoretical reasoning, combinatorial reasoning, functionality and proportional reasoning, control of variables, and probabilistic reasoning (10-13). Piagetian theory expects most students in high school to exhibit these reasoning patterns. However, research studies have shown that as many as 50% of students entering college do not fully have these reasoning abilities (14, 15). Students who cannot reason formally will have difficulty understanding functional relationships and topics such as concentration/dilution, enthalpy, and entropy. While others have posited that “mathematics ability” must be part of any instrument intending to assess potential general chemistry performance (16), formal reasoning ability is a more fundamental measure because it emerges directly from Piagetian theory. Mathematics ability is a slippery construct, and mathematics measures typically probe mathematics achievement rather than mathematics ability per se (16-19). Therefore, student performance on these measures is likely related to prior instruction in mathematics, without attachment to any particular theory of learning. In fact, the degree to which the questions on mathematics measures provoke algorithmic responses is negative in terms of indicating conceptual understanding. For instruments intending to measure formal reasoning ability, which is not usually explicitly taught in school, questions are designed to be independent of formal instruction, that is, not to provoke an algorithmic response. Why focus on what seems to be a very subtle distinction between theory-based measures of formal reasoning ability and mathematics achievement measures? From both research and teaching perspectives, this distinction makes a difference. For research purposes, the desire to advance the current state of knowledge leads investigators to prefer instruments with theoretical underpinnings rather than simply an empirical basis (20). For example, researching interactions between instructional methods and cognitive development within the context of chemistry is more promising for the creation of new knowledge than is an empirical investigation based on chemistry achievement alone. In terms of teaching, an instrument with a solid basis in a particular theory of learning has the advantage of implying a well-defined pedagogical approach for remedial efforts. No matter how good the empirical fit is, if an instrument does not simultaneously suggest a pedagogical approach, instructors do not have the information they need to help students. No one would contest the claim that empirically based instruments can improve the 1430

Journal of Chemical Education

_

_

passing rate for a particular course by keeping students out of the course, but at what cost to those students? Studies hint that many of our remedial efforts for general chemistry have tended to be unsuccessful, either demonstrating no advantage or featuring high attrition rates themselves (21, 22). Much early work in chemical education suggested that Piagetian theory, from which the formal reasoning construct is drawn, was relevant for college chemistry learning, and this linkage continues today (23-28). Chemistry is abstract by nature and requires the most advanced and sophisticated formal thinking. The lack of formal thinking ability will pose an essential barrier to learning abstract chemical conceptions. Previous research revealed that students' achievement can be hindered by their lack of formal thinking abilities. Informed by Piagetian theory, chemical educators are encouraged to take account of students' individual thinking levels and foster meaningful learning. Two Tests of Logical Thinking For those who want to use this theoretical perspective, the question of which formal reasoning instrument to use becomes important. Two instruments, TOLT and GALT, are similar and have been used in current science education research to measure students' formal reasoning ability. The Test of Logical Thinking (TOLT) is a 10-item, 18-question, paper-and-pencil exam originally developed by Tobin and Capie (29). TOLT contains two items from each of the following: proportional reasoning, probabilistic reasoning, controlling variables, correlational reasoning, and combinatorial reasoning. For all but the combinatorial items, TOLT uses a two-tiered question structure to require students to choose a justification for their selected answers. The Group Assessment of Logical Thinking (GALT) is a 12-item, 22-question, paper-and-pencil exam developed by Roadrangka et al. (30). GALT is very similar to TOLT, as both of them measure the five types of formal reasoning and have the same two-tiered question structure. The major difference is that GALT has two additional items (see the online supporting information) that measure students' ability in concrete thinking, rooted in principles of conservation of mass and volume, which are not tested in TOLT. GALT and TOLT have each been in common use by educators and researchers over the past two decades. For example, Noh and Scharmann found that GALT score correlated significantly with students' problem-solving ability (5), Bunce and Hutchinson found that GALT could be used to identify students at risk of failure in college chemistry (31), and Poole demonstrated significant relationships between students' GALT scores and microbiology grades (32). Boujaoude and Salloum illustrated

_

Vol. 87 No. 12 December 2010 pubs.acs.org/jchemeduc r 2010 American Chemical Society and Division of Chemical Education, Inc. 10.1021/ed100222v Published on Web 09/30/2010

Research: Science and Education

that TOLT was a significant predictor of performance on conceptual chemistry problems (8). Knight et al. reported no relationship between TOLT and “connected” or “separate” dimensions of their Knowing Styles Inventory (33). Verzoni and Swan found that TOLT scores were strongly related to conditional reasoning performance, namely, “the ability to reason deductively using inferential rules at progressively higher levels of abstraction” (34). Oliva and Cadiz showed that TOLT scores correlated significantly with mechanics conceptions (35), and that TOLT scores interacted with preconceptions to affect science conceptual change (9). The validity and reliability of TOLT and GALT scores have each been investigated separately. TOLT has even been translated into Spanish (9) and Greek (36). However, no work has been done to explicitly investigate the advantage or disadvantage of the two additional test items on principles of conservation that GALT contains over and above TOLT. In a recent study by Williamson et al., TOLT was chosen because based on the population of interest “the shorter TOLT is a better choice” than GALT, as “few if any students would be predicted to lack conservation of matter or conservation of volume” (37), but no evidence was given to support that claim. Other researchers prefer GALT simply because they believe including the two concrete items in GALT enables the test to have more discriminatory power than TOLT (38). However, no evidence supports that belief. To be able to make a recommendation for college chemistry faculty interested in whether they should use TOLT or GALT, it is useful to compare TOLT with GALT in two ways. First, if we focus on the functioning of the two additional concrete items that GALT contains over and above TOLT, we can add these two concrete items into TOLT and create a new test called “TOLT þ 2”. We can then compare TOLT with TOLT þ 2 in terms of their reliability, discriminatory power, and potential bias. Second, we can perform a direct comparison between TOLT and GALT as intact instruments in these same aspects. This study addresses these two research questions: • Is there any advantage to adding the two extra concrete items on principles of conservation into TOLT in terms of reliability, discriminatory power of the test, and potential bias of test items? • When TOLT and GALT are compared directly as intact instruments, which one is better for use in preparatory chemistry and general chemistry in terms of reliability, discriminatory power of the test, and potential bias of test items?

This study has two major parts: first, we focused on the functioning of the two concrete items and compared TOLT and TOLT þ 2 in a General Chemistry I course during Fall 2005 and Spring 2006. Second, as a follow-up, we did a direct comparison between TOLT and GALT as intact instruments during Fall 2006 and Spring 2007 in a General Chemistry I course, as well as during Fall 2006, Spring 2007, and Fall 2007 in a Preparatory Chemistry course. Results from these two comparisons are presented below. Comparison 1: TOLT versus TOLT þ 2 Instruments Data Source for Comparison 1 TOLT and TOLT þ 2 tests were given at the beginning of Fall 2005 and Spring 2006 semesters at a large public research

r 2010 American Chemical Society and Division of Chemical Education, Inc.

_

university in the southeastern United States. Students received attendance credit for completing the test as a small portion of the course grade. Because the population of interest for this research is college students taking introductory, college-level chemistry courses, it is reasonable to assume that all students enrolled in the General Chemistry I course sections at the university of investigation form a representative sample. All sections participated, and each had between 150 and 200 students enrolled. The test was administered in such a way that in each section, each student was randomly given either TOLT or TOLT þ 2, but not both, and that the number of students taking TOLT was approximately equal to the number of students taking TOLT þ 2. As mentioned above, TOLT contains 10 items comprising 18 questions. The first 8 items are two multiple-choice questions each. For each item, students have to answer both questions correctly to get 1 point. The remaining 2 items are 1 question each, and they are open-ended combination or permutation problems requiring construction of lists. Students receive 1 point for a complete and correct list, with no partial credit; thus, scores on TOLT have a range of 0-10. This grading scheme is the scoring process advised by the original test developer (29) and used in the literature (8, 9, 33). In addition to items from TOLT, TOLT þ 2 has 2 concrete items on principles of conservation taken from GALT. These 2 items were placed at the beginning of the instrument to mimic their placement in GALT. The format of these 2 extra items is the same as the first 8 TOLT items, each item being a pair of multiple-choice questions; hence, scores on TOLT þ 2 have a range of 0-12. Students' demographic information and their scores on the verbal and quantitative sections of the Scholastic Assessment Test (SAT) were obtained via the registrar. Using a locally developed questionnaire, information on the number of semesters of high school chemistry and the highest level of mathematics each student completed was collected on the first class day. At the end of each semester, students' scores on a common final exam were collected for comparison purposes. The final exam was the First Term General Chemistry Special Examination from the Examinations Institute of the American Chemical Society (ACS) Division of Chemical Education (39). Students' TOLT and TOLT þ 2 scores as well as demographics and SAT scores were analyzed using SPSS software. Participants At the university of investigation, General Chemistry I is the first semester of a two-semester course that offers an introduction to the science of chemistry. The course assumes background knowledge from high school chemistry; students who have not taken high school chemistry or its equivalent are advised not to enroll in the course. At the completion of the course, students are expected to have a basic knowledge of general chemistry and be able to understand and apply the particulate nature of matter and the first law of thermodynamics. Students were from more than 20 majors, including biomedical science, education, nonscience, and other majors. Table 1 presents demographic information about the students. The categories are from the registrar's official record. This diverse sample of students is typical of the student population taking general chemistry at this university. Table 2 compares students who took TOLT to those who took TOLT þ 2 with respect to academic background, SAT scores, and ACS exam scores. Between the groups of TOLT and

pubs.acs.org/jchemeduc

_

Vol. 87 No. 12 December 2010

_

Journal of Chemical Education

1431

Research: Science and Education Table 1. Distribution of Demographic Variablesa for Test Group Students Each Semester Number of Students (%) Demographic Variables

Fall 2005, N = 1274

Overall, N = 1991b

Spring 2006, N = 717

Male

593 (46.5)

274 (38.2)

867 (43.5)

Female

677 (53.1)

439 (61.2)

1116 (56.1)

Sex Unspecified

4 (0.3)

4 (0.5)

8 (0.4)

61 (8.5)

198 (9.9)

Asian or Pacific Islander

137 (10.8)

Black (Not of Hispanic Origin)

143 (11.2)

93 (13.0)

236 (11.9)

Hispanic

144 (11.3)

105 (14.6)

249 (12.5)

White (Not of Hispanic Origin)

787 (61.8)

415 (57.9)

1202 (60.4)

American Indian or Native Alaskan Ethnicity Specified as “Other” Ethnicity Unspecified

5 (0.4)

4 (0.6)

9 (0.5)

23 (1.8)

8 (1.1)

31 (1.6)

35 (2.7)

31 (4.3)

66 (3.4)

Test Group Ac

632 (49.6)

362 (50.5)

994 (49.9)

Test Group B

642 (50.4)

355 (49.5)

991 (50.1)

a

The sex and ethnicity categories are from the university registrar's official record. b Overall, 58.6% students were in their first year in college, 57.3% described their major or intended major as premed or allied health, and 74.0% reported having at least one full year of high school chemistry. c Group A, students who took TOLT; Group B, students who took TOLT þ 2.

Table 2. Comparison of Academic Background and ACS Final Exam Scores for the Two Groups Background Variables

Group

Semesters of HS Chemistry Highest Level of Mathb Year in College Anchor Sumc % Score on TOLT/TOLT þ 2d SAT Quantitative

Number of Students

Mean

St. Dev.

A

877

1.86

0.816

B

870

1.85

0.831

A

874

2.87

0.973

B

871

2.82

1.008

A

878

1.69

1.020

B

872

1.72

1.056

A

991

6.45

2.655

B

997

6.24

2.641

A

991

0.645

0.265

B

997

0.658

0.240

A

860

558.40

77.8120

B

861

557.67

81.3260

SAT Verbal

A

860

544.35

74.1150

B

861

543.51

78.8950

ACS Exam Score

A

798

21.76

6.875

B

802

21.70

6.911

Effect Sizea 0.012 0.050 0.029 0.079 0.051 0.009 0.011 0.009

a Group A, students who took TOLT; Group B, students who took TOLT þ 2. Effect sizes calculated as Cohen's d. b 1 = “haven't taken any math courses as advanced as algebra”; 2 = “algebra and/or trigonometry”; 3 = “precalculus”; 4 = “calculus I”; 5 = “calculus II”. c Anchor sum: sum score of the 10 items common to both tests. d % Score on TOLT or TOLT þ 2: proportion of correctly answered items.

TOLT þ 2 test takers, no evidence emerges of a significant difference between populations on all these measures. When the differences are quantified using effect size, the effect sizes are all very small according to Cohen's d standards (40). Using two, one-sided t-tests for equivalence as proposed by Lewis and Lewis (41), we found the TOLT group and TOLT þ 2 group were equivalent on all these measures. Reliability and Discriminatory Power of TOLT and TOLT þ 2 Reliability of any educational test can be indicated by its internal consistency, which is an indication of the degree to which the items in the test are functioning in a homogeneous fashion (42). 1432

Journal of Chemical Education

_

Vol. 87 No. 12 December 2010

_

One indication of internal consistency and discriminatory power is item-to-total correlation, namely, the correlation between the scores on each item and the total score on the test. In general, “good” items have positive item-to-total correlations (43). Another method of estimating internal consistency, one of the most generalizable, is Cronbach's R, a coefficient developed by Cronbach (43) for a test to indicate the extent to which the test items measure the same latent construct. It is defined as

pubs.acs.org/jchemeduc

P 2! k σi 1R ¼ k-1 σX 2

_

ð1Þ

r 2010 American Chemical Society and Division of Chemical Education, Inc.

Research: Science and Education

where k is the number of items on the test, σi2 is the variance of item i, σX2 is the total test variance (43). Typically, an R value of 0.70 or higher is considered satisfactory for research purposes (44). Both TOLT and TOLT þ 2 were found to have relatively high reliability in terms of internal consistency. Figure 1 presents the item-to-total correlations and Table 3 the coefficient R values for TOLT and TOLT þ 2. The two tests showed similar ranges of item-to-total correlations and levels of reliability. However, when item-to-total correlations were considered as a measure of each individual item's contribution to overall test reliability and discriminatory power, the two extra items, Items 1 and 2, were not impressive. Item 1 had the lowest item-to-total correlation among all items. Item 2 had an item-to-total correlation well below the average (Figure 1). In addition to item-to-total correlations, another way to look at the discriminatory power of tests is the difficulty of the items: Items that are too easy do not contribute to the discrimination between high-ability and low-ability students. According to educational measurement convention, item difficulty for a test item is defined as the proportion of test takers who answered that item correctly (43). It ranges from 0 to 1.00, with higher values indicating easier items. The difficulty values for each item in TOLT and TOLT þ 2 are shown in Table 4. As far as Items 3-12 are concerned, when each item in TOLT þ 2 is compared to itself in TOLT, the item difficulties were reassuringly similar. When items were compared to each other, the majority of items had a difficulty level of 0.6-0.7 (Table 4),

Figure 1. Item-to-total correlations for each test item. Note: The dotted line is the average item-to-total correlation of all items. TOLT does not have Items 1 and 2, which are the two extra concrete Items in TOLT þ 2. Table 3. Comparative Reliability by Coefficient Alpha Values for Each Test Cronbach's R Value

95% Confidence Interval

TOLT

0.756

0.732 ∼ 0.778

TOLT þ 2

0.754

0.731 ∼ 0.776

Instrument

indicating that 60-70% of students answered each item correctly. However, Item 1, one of the two concrete items that only appeared in TOLT þ 2, had a difficulty of 0.94, meaning that 94% of test takers answered this item correctly. This item was much easier than all other items in the two tests. This result, as well as the low item-to-total correlation for both Items 1 and 2 (Figure 1), suggests that there was no advantage to adding the two extra concrete items into TOLT. Differential Item Functioning Analysis on TOLT and TOLT þ 2 The validity of making comparisons of test scores is based on the assumption that the measurement properties of the test are functioning similarly across different demographic groups, for example, male students and female students. When a test item unfairly favors members of one particular group over another, it is biased. For instance, for male students and female students with the same level of algebra ability, an algebra test item in the context of golfing and football may unfairly favor males over females, as males tend to be more familiar with golfing and football. A necessary condition for item bias is differential item functioning (DIF) (45). DIF occurs when performance on an item for members of two groups differs even when the groups are matched on the ability measured by the overall instrument. DIF is an important issue for test score validity; the most widely used method to detect DIF is Mantel-Haenszel (MH) statistics. MH statistics are based on the concept of odds ratio. For our example with males and females, the odds ratio can be defined as the ratio of male students' odds of answering an item correctly to female students' odds of answering it correctly, for cases when their abilities are matched. To examine whether any items in TOLT or TOLT þ 2 have DIF with respect to student sex, the MH statistics were computed using SPSS software. The measure of the effect size of DIF, ΔMH, namely, the extent to which male students had better odds of answering an item correctly than female students with the same level of formal reasoning ability, was determined as level A (little or no DIF), B (moderate DIF), and C (large DIF) based on the widely used Educational Testing Service (ETS) item classification system (45), which takes into consideration both statistical significance and the practical effect size. Almost all items were classified as level A (little or no DIF) according to the ETS classification system. Item 2 was the only one classified as level B using the ETS classification (see Table 5 in the online supporting information) and should be deemed to exhibit moderate DIF. Item 2's MH odds ratio was 0.548 (see Table 5 in the online supporting information), meaning that on average, female students only had 54.8% of the odds that male students with the same formal reasoning ability had of answering Item 2 correctly. These results indicated that TOLT was better than TOLT þ 2 from item bias point of view, as TOLT does not contain Item 2, a potentially biased item. While the Mantel-Haenszel statistics can only detect the items with uniform DIF, logistic regression (LR) has the advantage

Table 4. Comparative Item Difficulty Values for Each Item in Each Testa Item TOLT TOLT þ 2 a

1 ; 0.94

2 ; 0.71

3

4

5

6

7

8

9

10

11

12

0.72

0.62

0.55

0.58

0.73

0.71

0.69

0.67

0.59

0.59

0.74

0.66

0.51

0.59

0.72

0.69

0.63

0.63

0.53

0.55

Note: TOLT does not have Items 1 and 2, which are the two extra concrete items in TOLT þ 2.

r 2010 American Chemical Society and Division of Chemical Education, Inc.

_

pubs.acs.org/jchemeduc

_

Vol. 87 No. 12 December 2010

_

Journal of Chemical Education

1433

Research: Science and Education

of identifying both uniform and nonuniform DIF (46). According to Mellenberg (47): Uniform DIF exists when there is no interaction between ability level and group membership. That is, the probability of answering the item correctly is greater for one group than the other uniformly over all levels of ability. Nonuniform DIF exists when there is interaction between ability level and group membership: that is, the difference in the probabilities of a correct answer for the two groups is not the same at all ability levels.

In this study, the first logistic regression model used the log odds ratio for correct responses to each item by sex as the dependent variable, and the total score, sex, and the corresponding interaction term as the three independent variables. Results (see Table 5 in the online supporting information) indicate that no evidence emerges of nonuniform DIF for items other than Item 12, because of the lack of a statistically significant interaction. Item 12 does show a statistically significant interaction effect (p = 0.01), which means that the item favors one sex over the other differently at different overall ability levels. Therefore, using logistic regression revealed a potential problem with Item 12 that MH analysis was unable to detect. When we take the interaction away from our model to check whether the main effect for sex is statistically significant, our results should be comparable to the MH results, and indeed we see significance for Items 2, 3, 4, 7, and 10. This same pattern of uniform DIF was observed via MH analysis, but the MH effect size calculation associated with the ETS guidelines provided additional insight. Only Item 2, which is from GALT, was classified as level B. Because statistical significance testing carries with it an element of chance, it will be valuable to know whether Item 2 (from GALT) and Item 12 (from TOLT) continue to show up as potentially problematic items with other data sets. Comparison 2: TOLT versus GALT as Intact Instruments Comparison 1 above focused on the functioning of the two concrete items from GALT in a general chemistry context. As a follow-up, we also made a direct comparison between TOLT and GALT as intact instruments in both general chemistry and preparatory chemistry courses. The preparatory chemistry course at the university of investigation is offered to students with no prior chemistry coursework or with relatively low academic achievement, a different population from the general chemistry students. The same research design and methods described in the Comparison 1 were applied here. We collected data from all General Chemistry I sections during Fall 2006 and Spring 2007 semesters as well as from the preparatory chemistry course during Fall 2006, Spring 2007, and Fall 2007 semesters. In each course, TOLT and GALT were assigned randomly (e.g., each student was randomly given either TOLT or GALT). Thus, students were evenly distributed into the TOLT group and GALT group. Analysis of students' academic backgrounds showed no significant difference between the TOLT takers and GALT takers in either course, similar to the equivalence results in Table 2. Reliability and Discriminatory Power of TOLT and GALT Table 6 in the online supporting information lists the reliability results of TOLT and GALT for these two courses 1434

Journal of Chemical Education

_

Vol. 87 No. 12 December 2010

_

measured by Cronbach's R. TOLT was found to have slightly higher reliability than GALT. Also, when item-to-total correlations were considered as a measure of each individual item's contribution to the discriminatory power and reliability of the test, the GALT items were in general not impressive and had lower item-to-total correlations than the TOLT items (see Table 6 in the online supporting information). One could argue that because TOLT and GALT have different numbers of test items, their reliabilities are not directly comparable. Two facts contradict this argument. First, as discussed in the literature, when the Spearman-Brown prophecy formula is used to standardize the coefficient R, the standardized R values can be directly compared, as the test lengths are already taken into consideration and the R values are adjusted accordingly (48). Therefore, the higher standardized coefficient R for the TOLT suggests that it has higher level of internal consistency than the GALT. Second, the major difference between TOLT and GALT is the two extra concrete items that the GALT contains over and above TOLT. If we exclude the two concrete items, then the remaining 10 items in the GALT are very similar to the 10 items in the TOLT. For a direct comparison, one would expect that the standardized coefficient R based on the remaining 10 items in the GALT would be similar to the R based on the 10 items in the TOLT. In our preparatory chemistry data set, GALT had a standardized R value of 0.647 for all 12 items, which fell to 0.622 for the remaining 10 items, still lower than the standardized R value of 0.669 based on the 10 items in the TOLT. Results for General Chemistry I were similar, with the 10-item GALT R value dropping to 0.627. These observations suggest TOLT was better than the GALT in terms of test reliability. One possible argument for using GALT is that, while it offers no advantage for general chemistry students, its two concrete items might be useful for identifying low-reasoningability students in preparatory chemistry. This plausible potential advantage was not, however, borne out in our data. Table 7 in the online supporting information shows students in preparatory chemistry tended to score lower on every item in both instruments than did students in general chemistry, consistent with the expectation that preparatory chemistry students have lower formal reasoning abilities. Item 1, one of the two concrete items in GALT, was again much easier than all other items, as 92% of general chemistry students and 83% of preparatory chemistry students answered it correctly (see Table 7 in the online supporting information). Not only was this GALT item too easy for general chemistry students, but it was also too easy and had the lowest item-to-total correlation for preparatory chemistry students (see Tables 6 and 7 in the online supporting information); hence, it lacks discriminatory power, similar to results in comparison 1 (Figure 1 and Table 4). DIF Analysis on TOLT and GALT In terms of potential item bias, Mantel-Haenszel statistics showed that GALT had more frequently occurring potentially biased items with ETS classification of “C” (i.e., large DIF) for both general chemistry and preparatory chemistry students (see Table 8 in the online supporting information). In this regard, TOLT is better than GALT in that it is displaying less potential for bias. First, logistic regression was run to examine items for nonuniform DIF. For TOLT, no item was found to exhibit nonuniform

pubs.acs.org/jchemeduc

_

r 2010 American Chemical Society and Division of Chemical Education, Inc.

Research: Science and Education

DIF for either the general chemistry (GC) or the preparatory chemistry (PrepC) population. For GALT, Items 9 and 10 for general chemistry and Item 2 for preparatory chemistry show a statistically significant interaction effect, which indicates nonuniform DIF. When, as before, we take the interaction away from our model to check whether the main effect for sex is statistically significant and comparable to MH results, the uniform DIF findings from the two methods agree well. For TOLT, the only item with uniform DIF is Item 4 for preparatory chemistry, which also has the highest χ2 (3.76) within the MH analysis but was classified as Level B. For GALT, the items revealed as having uniform DIF from logistic regression are exactly the same as those identified via MH analysis: Items 2, 4, and 11 for general chemistry; and Items 2 and 10 for preparatory chemistry. Item 2, identified as a potential problem in comparison 1, continues to be problematic in these data sets, remaining as Level B in general chemistry but escalating to Level C in preparatory chemistry. To visualize what the DIF results mean, we made a series of graphs to represent students' performance on the specific items for each sex by overall ability level (Figure 2). Overall ability in this case is a student's total score on the test. For an item without DIF, the proportions of correct responses for males and for females should be similar when the total ability is controlled. Figure 2A presents the raw data for TOLT Item 9 from comparison 1. No uniform or nonuniform DIF was found for this item, which means that either sex is equally likely to give a correct response for this item when total ability is controlled. Accordingly, at each data point on the graph, the proportion of correct responses for males is never very different from the proportion of correct responses for females. Figure 2B is for GALT Item 9 in general chemistry from comparison 2. GALT Item 9 does not exhibit uniform DIF. However, when we checked for nonuniform DIF using LR statistics, a statistically significant result was obtained, which indicates significant differences in performance for students based on sex depending on overall ability levels. From the raw data presented in Figure 2B, if we look at the lower ability levels in particular, GALT Item 9 appears to alternate between favoring females (at the score 2, 3, and 6 ability levels) and favoring males (at the score 4 level). Whether this pattern warrants concern depends upon the use to which the GALT scores will be put. Because one application of formal reasoning tests is a screening mechanism (49) for which low scores are grounds for remediation, the tendency of an item to alternately favor one sex or the other depending on overall ability could present a problem in terms of choosing an appropriate cutoff score. Checking for nonuniform DIF in a data set to be used for this purpose would be advisable. Figure 2C is for GALT Item 2 in preparatory chemistry from comparison 2. Item 2 exhibits both uniform and nonuniform DIF. The raw data plot demonstrates that this item favors males at most of the ability levels but favors female at the score 3 level. When considered overall, however, the degree to which Item 2 favors males at the other ability levels is greater than the degree to which it favors females at score 3, resulting in the finding of uniform DIF as well as nonuniform DIF. This item also has uniform DIF favoring males for general chemistry in both comparisons 1 and 2 (see Table 5 in the online supporting information). These multiple indicators hint that the item could be biased. It is not clear that males would necessarily be more

r 2010 American Chemical Society and Division of Chemical Education, Inc.

_

Figure 2. The proportion of correct responses for male and female versus total score for sample items.

familiar with this sort of context of a metal ball sinking into water (see the online supporting information), so the source of the potential bias is unknown. Regardless, the fact that males perform better on this item consistently for all three data sets is cause for concern. Other Concerns with the GALT Besides Item 2, Item 11 in the GALT was also found to have a large level of DIF in general chemistry (see Table 8 in the online supporting information), again favoring males. Because this item

pubs.acs.org/jchemeduc

_

Vol. 87 No. 12 December 2010

_

Journal of Chemical Education

1435

Research: Science and Education

Figure 3. Text of Item 11 from the GALT (30).

does not exhibit DIF with the preparatory chemistry population, an allegation of bias would be premature. However, Item 11, which asks students to enumerate all possible pairs of dance partners (Figure 3), is problematic for other reasons. The item explicitly requests students to “restrict the possible combinations to boys and girls dancing with each other”, which lacks cultural sensitivity (50), as in certain cultures girls are not allowed to dance with boys, such as in some Muslim and Orthodox Jewish groups. Students who identify with these and other cultural groups with distinct traditions regarding dancing must answer the question in a way counter to their cultural norm, rendering Item 11 unnecessarily offensive. Additionally, the stricture that boys can only be allowed to dance with girls implies heteronormativity (51), which, again unnecessarily, can serve to marginalize and stigmatize students with sexual identities that are not heterosexual. These issues alone are enough to recommend recasting the question in a different context or simply choosing the TOLT rather than the GALT. Conclusions and Implications From comparison 1 of TOLT with TOLT þ 2 using a representative sample from a first semester general chemistry course, we could find no advantage to adding the two concrete items from GALT to TOLT. Both TOLT and TOLT þ 2 exhibited reasonable reliability, and the common items of the two tests had similar discriminatory power as measured by itemto-total correlations. The concrete item on mass conservation was much easier and showed a much lower item-to-total correlation than all other items on the two instruments, exhibiting a significant lack of discriminatory power for that GALT item on TOLT þ 2. The fact that 94% of students correctly answered that item supports Williamson et al.'s claim that “the shorter TOLT is a better choice [than the GALT] because few if any students would be predicted to lack conservation of matter” reasoning ability (37). The other concrete item on TOLT þ 2, concerning volume conservation, was the only item to display statistically significant DIF with a moderate effect size, while all other items were classified as level A (little or no DIF). From comparison 2, a direct comparison between TOLT and GALT as intact instruments, GALT showed no advantage over TOLT for either General Chemistry I or preparatory chemistry in terms of reliability, discriminatory power, and potential item bias. GALT Item 1 remained very easy for both populations, exhibiting a lack of discriminatory power. Across the board, more items exhibited DIF in GALT than in TOLT. GALT Item 2 consistently demonstrated potential bias against 1436

Journal of Chemical Education

_

Vol. 87 No. 12 December 2010

_

females across the general and preparatory chemistry student population. Beyond statistical issues, GALT Item 11 reinforces heteronormativity and lacks cultural sensitivity. Although we would recommend TOLT on the basis of our investigations, those who still do want to use GALT in the context of college chemistry courses would do well to consider whether Item 1 is necessary, and should investigate modifications of Items 2 and 11. As discussed earlier, it is advantageous to use a theory-based instrument rather than one that intends to measure prior knowledge of mathematics and chemistry. Entering chemistry students have typically had prior instruction in mathematics and chemistry, but a low score on an instrument containing mathematics and chemistry questions suggests only that this prior instruction was ineffective. What will make the second opportunity to learn basic mathematics and chemistry effective? On the question of an instructional approach for effective remediation, the instrument is silent. On the other hand, a theory-based instrument, such as TOLT, ties to learning theory. A low TOLT score immediately suggests two potential remedies: (i) Ensure that chemistry concepts are presented in a concrete way when they are initially introduced in the general chemistry course (11); and (ii) Apply specific interventions that have been shown to support the development of formal reasoning ability (52-55). Also, given the need for effective remedial courses, future investigations that evaluate the equity implications of courses aligned with a cognitive development perspective may be beneficial to this important group of at-risk students. As one example, learning cycles have shown benefits for low-formal reasoning students (56), so a remedial course using a learning-cycle approach would lend itself well to this type of investigation. Literature Cited 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.

15. 16. 17. 18. 19. 20. 21.

Cavallo, A. M. L. J. Res. Sci. Teach. 1996, 33, 625–656. Niaz, M. Int. J. Sci. Educ. 1996, 18, 525–541. Niaz, M.; Robinson, W. R. J. Res. Sci. Teach. 1992, 29, 211–226. Nicoll, G.; Francisco, J. S. J. Chem. Educ. 2001, 78, 99–102. Noh, T.; Scharmann, L. C. J. Res. Sci. Teach. 1997, 34, 199–217. Rubin, R. L.; Norman, J. T. J. Res. Sci. Teach. 1992, 29, 715–727. Uzuntiryaki, E.; Geban, O. Instruct. Sci. 2005, 33, 311–339. Boujaoude, S.; Salloum, S.; Abd-El-Khalick, F. Int. J. Sci. Educ. 2004, 26, 63–84. Oliva, J. M.; Cadiz, C. P. Int. J. Sci. Educ. 2003, 25, 539–561. Good, R.; Mellon, E. K.; Kromhout, R. A. J. Chem. Educ. 1978, 55, 688–693. Herron, J. D. J. Chem. Educ. 1975, 52, 146–150. Inhelder, B.; Piaget, J. The Growth of Logical Thinking from Childhood to Adolescence; Basic: New York, 1958. Zeidler, D. L. J. Res. Sci. Teach. 1985, 22, 461–471. Cracolice, M. S. In Chemists' Guide to Effective Teaching; Pienta, N. J., Cooper, M. M., Greenbowe, T. J., Eds.; Prentice Hall: Upper Saddle River, NJ, 2005; p 12-27. McKinnon, J. W.; Renner, J. W. Am. J. Phys. 1971, 39, 1047–1052. Wagner, E. P.; Sasser, H.; DiBiase, W. J. J. Chem. Educ. 2002, 79, 749–755. McFate, C.; Olmsted, J. A. I. J. Chem. Educ. 1999, 76, 562–565. Niedzielski, R. J.; Walmsley, F. J. Chem. Educ. 1982, 59, 149–151. Russell, A. A. J. Chem. Educ. 1994, 71, 314–317. Bunce, D.; Gabel, D.; Herron, J. D.; Jones, L. J. Chem. Educ. 1994, 71, 850. Bentley, A. B.; Gellene, G. I. J. Chem. Educ. 2005, 82, 125–130.

pubs.acs.org/jchemeduc

_

r 2010 American Chemical Society and Division of Chemical Education, Inc.

Research: Science and Education 22. Freeman, W. A. J. Chem. Educ. 1984, 61, 617–619. 23. Nurrenbern, S. C. J. Chem. Educ. 2001, 78, 1107–1110. 24. Shibley, I. A.; Milakofsky, L.; Bender, D. S.; Patterson, H. O. J. Chem. Educ. 2003, 80, 569–573. 25. Tsaparlis, G. Res. Sci. Technol. Educ. 2005, 23, 125–148. 26. Cattle, J.; Howie, D. Int. J. Sci. Educ. 2005, 30, 185–202. 27. Cracolice, M. S.; Deming, J. C.; Ehlert, B. J. Chem. Educ. 2008, 85, 873–878. 28. Endler, L. C.; Bond, T. G. Res. Sci. Educ. 2008, 38, 149–166. 29. Tobin, K. G.; Capie, W. Educ. Psychol. Meas. 1981, 41, 4l3–423. 30. Roadrangka, V.; Yeany, R. H.; Padilla, M. J. Paper presented at the annual meeting of the National Association for Research in Science Teaching, Dallas, TX, 1983. 31. Bunce, D. M.; Hutchinson, K. D. J. Chem. Educ. 1993, 70, 183–187. 32. Poole, B. A. M. Ph.D. Dissertation, The University of Southern Mississippi, 1997. 33. Knight, K. H.; Elfenbein, M. H.; Martin, M. B. Sex Roles 1997, 37, 401–414. 34. Verzoni, K.; Swan, K. Appl. Cognit. Psychol. 1995, 9, 213–234. 35. Oliva, J. M.; Cadiz, C. P. Int. J. Sci. Educ. 1999, 21, 903–920. 36. Valanides, N. C. Sch. Sci. Math. 1996, 96, 99–106. 37. Williamson, V.; Huffman, J.; Peck, L. J. Chem. Educ. 2004, 81, 891–896. 38. Baird, W. E.; Shaw, E. L.; McLarty, P. Sch. Sci. Math. 1996, 96, 85–93. 39. Examinations Institute of the American Chemical Society Division of Chemical Education; First Term General Chemistry ( Special Examination), Clemson University, 1997. 40. Cohen, J. Statistical Power Analysis for the Behavioral Sciences, 2nd ed.; Lawrence Erlbaum Associates, Inc.: Hillsdale, NJ, 1988. 41. Lewis, S. E.; Lewis, J. E. J. Chem. Educ. 2005, 82, 1408–1412.

r 2010 American Chemical Society and Division of Chemical Education, Inc.

_

42. Popham, W. J. Modern Educational Measurement: Practical Guidelines for Educational Leaders, 3rd ed.; Allyn and Bacon: Needham, MA, 2000. 43. Crocker, L. M.; Algina, J. Introduction to Classical and Modern Test Theory; CBS College Publishing, Holt, Rinehart and Winston: New York, NY, 1986. 44. Nunnally, J. C. Psychometric Theory, 2nd ed.; McGraw-Hill: New York, 1978. 45. Clauser, B. E.; Mazor, K. Educ. Meas. Issues Pract. 1998, 17, 31–44. 46. Swaminathan, H.; Rogers, H. J. J. Educ. Meas. 1990, 27, 361–370. 47. Mellenberg, G. J. J. Educ. Stat. 1982, 7, 105–108. 48. Bodner, G. M. J. Chem. Educ. 1980, 57, 188–190. 49. Lewis, S. E.; Lewis, J. E. Chem. Educ. Res. Pract. 2007, 8, 32–51. 50. Liamputtong, P. In Doing Cross-Cultural Research: Ethical and Methodological Perspectives; Liamputtong, P., Ed.; Springer: Dordrecht, The Netherlands, 2008; p 3-20. 51. Warner, M. Social Text 1991, 9, 3–17. 52. Adey, P. S.; Shayer, M. Really Raising Standards: Cognitive Intervention and Academic Achievement, 1st ed.; Routledge: New York, 1994. 53. Vass, E.; Schiller, D.; Nappi, A. J. J. Res. Sci. Teach. 2000, 37, 981–995. 54. Cattle, J.; Howie, D. Int. J. Sci. Educ. 2005, 30, 185–202. 55. Endler, L. C.; Bond, T. G. Res. Sci. Educ. 2008, 38, 149–166. 56. Abraham, M. R.; Renner, J. W. J. Res. Sci. Teach. 1986, 23, 121–143.

Supporting Information Available Two concrete items that GALT contains over and above TOLT; Tables 5-8. This material is available via the Internet at http://pubs. acs.org.

pubs.acs.org/jchemeduc

_

Vol. 87 No. 12 December 2010

_

Journal of Chemical Education

1437