An Introduction to Nonparametric Statistics in Chemistry Education

Jul 31, 2014 - The scale could have just as easily have been defined as 0 to 600. ..... Chem Rank – Math Rank, 2, 0, -1, 1, -2 ... Like Spearman's Ï...
0 downloads 0 Views 602KB Size
Chapter 7

An Introduction to Nonparametric Statistics in Chemistry Education Research Downloaded by PRINCETON UNIV on November 13, 2014 | http://pubs.acs.org Publication Date (Web): July 31, 2014 | doi: 10.1021/bk-2014-1166.ch007

Scott E. Lewis* University of South Florida, 4202 E. Fowler Ave., CHE205, Tampa, Florida 33620, United States *E-mail: [email protected]

The intent of this chapter is to present an overview of nonparametric statistics, particularly as they are employed in chemistry education research. The nonparametric statistics tests presented are: chi-square, Spearmen’s rho, Kendall’s tau, logistic regression, the Wilcoxan signed rank test, the Mann-Whitney test and the Kruskal-Wallis test. Each of these is presented with a hypothetical chemistry education research example and followed with a review of how the test has been used in the research literature. This overview is intended for researchers who are performing or considering projects in chemistry education research and are familiar with the general processes of statistical testing and interpretation of results, but are unfamiliar with nonparametric statistical tests.

Introduction Chemistry education research (CER) is at the intersection of multiple established disciplines and as a result, faculty who undertake research projects in this field have a wide variety of training. It is unlikely a researcher is proficient in the wide range of techniques present in the field. This makes the efforts to share the tools of research in chemistry education research, through both this book and the Nuts and Bolts of Chemical Education Research (1), uniquely important. The intent of this chapter is to present an overview of nonparametric statistics, particularly as they are employed in chemistry education research. This overview is intended for researchers who are performing or considering projects © 2014 American Chemical Society In Tools of Chemistry Education Research; Bunce, D., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 2014.

in chemistry education research and are familiar with the general processes of statistical testing and interpretation of the results, but are unfamiliar with nonparametric statistical tests.

Downloaded by PRINCETON UNIV on November 13, 2014 | http://pubs.acs.org Publication Date (Web): July 31, 2014 | doi: 10.1021/bk-2014-1166.ch007

Data Scales Prior to introducing the statistical tests, an overview of the different scales of data is needed to ground the discussion of the appropriate uses for each test. Stevens (2) proposed a hierarchy of data scales to describe the type of information that could be conveyed with each scale, a classification which remains in widespread use in statistics texts. The first data scale is termed nominal, and describes data which represents categories and offers no potential for ranking. In CER, examples of nominal data can include student demographics such as sex or race, or categorizing student qualities or experiences, such as whether students have taken part in a teaching reform. The next data scale is ordinal which also represent categories but with a potential for ranking. One of the important features of ordinal data is that there is no assumption made regarding the distance between the rankings. A common CER example of ordinal data is a set of responses to a Likert-style survey, for example, when respondents are asked to rate an item on the frequency of occurrence using the scale of Never, Occasionally, Sometimes, Usually, or Always. There is a clear ranking of the responses but no assumption can be made regarding the difference between the rankings. For example, the difference between Never and Occasionally is not necessarily the same as the difference between Occasionally and Sometimes. Another example of ordinal data is the ubiquitous letter grade scale used to rate student performance. Depending on how the grades are assigned, it is possible that the difference between a grade of A and B, may not be the same as the difference between a grade of B and C. Instead, classifying this data as ordinal conveys only that students receiving an A were rated higher than those receiving a B. The interval data scale has both a ranking and an assumption that the difference between the rankings is consistent across the scale. Another characteristic of interval data is that the value for zero is arbitrarily defined. A CER example of interval data could be SAT scores on each subject test, which have a range of 200 to 800. The scale could have just as easily have been defined as 0 to 600. As a result of the arbitrary zero, ratio relationships are not consistent among interval data. That is, an SAT subject score of 400 has not demonstrated twice that of a subject score of 200 on any associated metric. The difference between the rankings remains consistent, as the 100 point difference between a 400 and a 300 subject score shares similarity with the difference between a 500 and a 400 subject score; each difference represents a distance of approximately one standard deviation. The last data scale is ratio, which builds upon interval but the zero point has a meaningful definition. A common CER example of ratio data include most test scores, where a zero indicates answering no questions correctly. Ratio level data allows for comparisons of both differences and ratios. In classic test theory, a test 116 In Tools of Chemistry Education Research; Bunce, D., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 2014.

Downloaded by PRINCETON UNIV on November 13, 2014 | http://pubs.acs.org Publication Date (Web): July 31, 2014 | doi: 10.1021/bk-2014-1166.ch007

score of 60 represents a student who scored twice as many correct as a test score of 30. Differences also remain consistent, as the number of questions needed to go from a score of 30 to a score of 45 is the same as the number of questions needed to go from 60 to 75. Ratio data completes the hierarchy of data scales that progresses from nominal, to ordinal, then interval and finally ratio. The hierarchy is arranged in terms of the amount of information that is conveyed, where interval data convey more information than ordinal data, for example. A statistical test proposed for one data scale, say ordinal, can be employed for any higher level of data scale, in this case interval or ratio data. However, typically, when a statistical test is described for a data scale, it should not be used with any lower-level data. That is, a test prescribed for ordinal data should not be used with nominal level data. The data scales presented are not without controversy. Velleman and Wilkinson (3) provide a thorough introduction to these concerns. Among some of the critiques, there is concern that some data are not well described by the available scales. Percentages, for example, follow ratio level data but also feature additional information. Also, there is the potential that determining the appropriate data scales depends on the context of the data, while the general description of the data scales provides the impression that data can be assigned a scale independent of context. For example, consider binary data that classifies students as passing or failing a class. One context for this data is to determine how many students may repeat a class and the data can be treated as nominal. Alternatively, in a study investigating student success, a researcher may treat pass as a higher ranked outcome thus making the data ordinal. Finally, there is a tendency to prescribe each statistics tests to a particular data scale, which may prevent researchers from using other statistical tests which could provide evidence of meaningful relationships. The intent of this chapter, then, as an introduction, is to indicate the data scales which are most commonly associated with each statistical test, with the caveat that employing other tests should also be considered. To assist in determining the appropriateness of a test, an effort is made to present the underlying math to each statistical test presented in this chapter.

Nonparametric versus Parametric Statistical tests are classified based on their reliance on the normality assumption; parametric statistics assume normality and nonparametric do not. The normality assumption is that the data used in the test will follow a normal distribution, which has the appearance of a bell curve. Nominal data cannot follow a normal distribution, as the data does not indicate a ranking. Ordinal data also does not follow a normal distribution, as there is no indication about the consistency of differences among the rankings. A histogram of ordinal data may at first appear normal, but since no assumption on the distance between ordinal data can be made, the distance between categories cannot be assumed to follow any particular pattern. As a result, the normality assumption on nominal or ordinal level data would not be satisfied and nonparametric statistics are recommended. 117 In Tools of Chemistry Education Research; Bunce, D., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 2014.

Downloaded by PRINCETON UNIV on November 13, 2014 | http://pubs.acs.org Publication Date (Web): July 31, 2014 | doi: 10.1021/bk-2014-1166.ch007

Interval and ratio data scales can follow a normal distribution, but the normality distribution should be examined. Initial tests involve examining the skewness and kurtosis values of the distribution in the context of the standard error of skewness and kurtosis. Follow-on tests can include a visual inspection of a frequency plot, particularly to ensure the data do not have multiple modes. While many parametric tests are robust to violations of normality (4), not all tests are robust; and in these cases nonparametric statistics can offer a suitable alternative. It is worth noting, though, that many parametric tests such as independent sample t-tests, ANOVA, and multiple regression, rely on a normality assumption only on the dependent variable. In an independent sample t-test, for example, which compares two groups, the independent variable group identification is nominal and therefore cannot follow a normal distribution. Table I indicates the relationship between common parametric tests and their nonparametric alternatives.

Table I. Mapping Nonparametric Tests to Parametric Counterparts Type of Relationship

Parametric Test

Counterpart Nonparametric Test Chi-square (χ2) test

Measures of Association between Variables

Pearson Correlation

Spearman’s rho (ρ) Kendall’s tau (τ)

Multiple Regression

Logistic regression

Pre/Post Comparison of a Repeated Measure

Paired t-test

Wilcoxon Signed Rank test

Comparison of Two Independent Groups

Independent Samples t-test

Mann-Whitney test

Comparison of More than Two Independent Groups

Analysis of Variance (ANOVA)

Kruskal-Wallis test

Nonparametric tests can result in determinations of statistical significance in much the same way parametric tests do. The determination of statistical significance involves establishing a limit for Type I error rate termed the α-value prior to analysis. The analysis results in an observed p-value that is a measure of the probability that the data would arise if the null hypothesis is assumed to be true. The p-value is compared to the α-value, and if it is below the α-value, the null hypothesis is rejected. In many nonparametric tests, the value for p can be calculated directly by hand or by using statistical software. The recommendation is to calculate statistical significance by hand for small sample sizes, as many software programs make automatic corrections for continuity or in the case of ties (5). When performing calculations by hand the statistical significance can be determined using tables of p-values. All tests presented in this chapter have an applicable table presented in Leach (6), with logistic regression being the only exception. For larger samples, where the corrections employed are more applicable, the use of statistical software is recommended and possibly necessary. 118 In Tools of Chemistry Education Research; Bunce, D., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 2014.

Some statistics software packages do not include all the tests here, and the researcher is recommended to check the availability of the desired tests prior to making a decision on purchasing.

Nonparametric Statistical Tests

Downloaded by PRINCETON UNIV on November 13, 2014 | http://pubs.acs.org Publication Date (Web): July 31, 2014 | doi: 10.1021/bk-2014-1166.ch007

Measures of Association Correlations are one of the most useful techniques in education research as they allow a quick examination of association, or absences thereof, between two variables. The term correlation often refers to the Pearson Product Moment Correlation, which is a parametric statistic, relying on a normality assumption for each of the variables examined. Multiple nonparametric measures of association exist, depending on the data scales used, and are reviewed below.

Chi-Square (χ2) Test The χ2 test examines associations between data that are at the nominal data scale. Nominal is the lowest data scale, and thus this test can be employed in any of the other data scales, though interpretation can become problematic when the number of data points possible becomes larger (e.g., a nominal scale may have three categories, but a ratio scale may have hundreds of possible data values). To demonstrate the chi-square test, a contingency table must be created. In a contingency table, variables are listed as the heading for either the columns or rows and the frequency of each cross-tabulation among the variables is reported. Table II uses fictional data to demonstrate a contingency table between gender and passing a class.

Table II. Example Contingency Table Fail

Pass

Total

Male

24

16

40

Female

20

40

60

Total

44

56

100

In this example, a χ2 test can determine if there is a relationship between the gender of the student and the chance for passing a class. To do this, the test determines an expected value for each cell, which assumes that there is no relationship between the variables. The expected value for male students who fail would be: the percentage of students who are male (0.40) multiplied by the percent of students who failed (0.44) multiplied by the total number of students (100). This calculation provides a value of 17.6. Table III includes the data from Table II, coupled with the expected value for each cell. 119 In Tools of Chemistry Education Research; Bunce, D., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 2014.

Table III. Example Contingency Table with Expected Values

Male

Observed Expected

Female

Observed Expected

Downloaded by PRINCETON UNIV on November 13, 2014 | http://pubs.acs.org Publication Date (Web): July 31, 2014 | doi: 10.1021/bk-2014-1166.ch007

Total

Fail

Pass

Total

24 17.6

16 22.4

40

20 26.4

40 33.6

60

44

56

100

The χ2 test then examines the difference between the observed value and the expected value for each cell to determine if a relationship is present. Because the expected value assumes no relationship between the variables, any differences between observed and expected are evidence of a relationship. χ2 is calculated using the formula (6):

In this example, the χ2 value equals 6.926. The degrees of freedom (df) for a contingency table, not counting the totals, is equal to:

In this example, df = 1. Using both the χ2 value and df value can indicate a p-value and ultimately aid in the decision regarding the null hypothesis. The p-value for this example was found to be 0.008, which may lead to rejection of the null hypothesis based on the threshold of Type I error (the α value) the researcher decides is appropriate. The incorporation of expected values in Table III offers insight into the relationship that is observed, which the χ2 value alone cannot provide. In Table III, we see that female students were more likely to pass the course than expected, and male students were more likely to fail the course. The χ2 test is common in CER projects as illustrated by three recent examples in the research literature. Gron et al. (7) converted the ratio data of students’ recorded percent error into ordinal data by categorizing results as