The Replication Crisis and Chemistry Education Research - Journal of

4 days ago - (8) One approach for our community may be to follow the lead of some journals and require that raw data (deidentified, of course) be incl...
5 downloads 10 Views 231KB Size
Editorial pubs.acs.org/jchemeduc

Cite This: J. Chem. Educ. 2018, 95, 1−2

The Replication Crisis and Chemistry Education Research Melanie M. Cooper* Department of Chemistry, Michigan State University, East Lansing, Michigan 48824, United States ABSTRACT: There is a growing acknowledgment that some research findings in fields as diverse as medicine and psychology may not be replicable. There are many reasons for this, including a reliance on small sample sizes, and retroactive choice of data to analyze. It has become clear that the traditional measure of statistical significance (p < 0.05) is not enough to support research claims. In this editorial I discuss the background and some possible solutions. KEYWORDS: General Public n 2005 John Ioannidis published a paper, “Why Most Published Research Findings Are False”, that has become a rallying point for researchers across a wide range of disciplines.1 While the title was intentionally provocative, the message was serious and has had ramifications for a number of research communities. In the paper (which has been cited over 5000 times), Ioannidis wrote (ref 1, p 696): [A] research finding is less likely to be true when the studies conducted in a field are smaller; when effect sizes are smaller; when there is a greater number and lesser preselection of tested relationships; where there is greater flexibility in designs, definitions, outcomes, and analytical modes; when there is greater financial and other interest and prejudice; and when more teams are involved in a scientific field in chase of statistical significance. The original paper was focused on the medical research literature (thus the reference to financial interest), but soon the idea that research findings involving complex human behaviors may be subject to systematic problems spread to the social and psychological sciences. Subsequently, many landmark studies in these fields have failed to replicate. This “replication crisis” was almost certainly accelerated by the publication of a peer-reviewed paper in a leading journal by a prestigious researcher that claimed statistically significant “proof” that extrasensory perception (ESP) or precognition is a real (that is, reproducible) phenomenon.2 Such a claim caused some soul-searching in the psychology community, and sure enough, attempts to replicate the findings failed.3 However, the fact that this paper made it into the peer-reviewed literature accelerated other researchers’ suspicions that there are systemic problems in the design and analysis of studies based on significance testing. A large number of other studies have also failed to replicate, perhaps the most famous being the “power posing” study in which researchers reported that standing tall with hands on hips (like superwoman) for as little as two minutes resulted in a surge of testosterone, and a decrease in cortisol (the stress hormone), and a resulting increase in personal confidence. The TED talk on power posing4 has been viewed over 44 million times and clearly struck a chord with many viewers. This leaves those of us who practiced our power poses wondering if the placebo effect still works when you know the physiological response is not there!

I

© 2018 American Chemical Society and Division of Chemical Education, Inc.

Why, you may ask, am I discussing these social psychology studies in a chemistry education editorial? The answer lies in the fact that many of the methods used in these studies are also found in papers published in this Journal. Over the years the research described in chemistry education papers has become more complex, and so have the statistical analyses of their data. While sophistication can be a good thing, the problems that Ioannides cites in his original article are also endemic to education research. It is not my intention to point fingers at any particular researchers, and frankly I am no expert in statistics, but it now seems clear that identifying significance in terms of p values is no guarantee that the results reported are meaningful. The major sources of error in these studies can be attributed to both random and systematic errors. For example, it is possible to analyze data from small studies, and find meaningless (random) “statistically significant” effects. While it is becoming increasingly common to report some measure of the power of the study (for example, by reporting the effect size), this is not always the case. Another source of error involves “judicious” (actually injudicious) choice of data to analyze, using repeated and multiple testing of variations of the data set until “significance” is achieved. This practice of running multiple tests under multiple conditions, to determine the one set of data that provides the required result, is so prevalent that it even has a name: “p hacking”. An excellent, interactive example of how p hacking can provide what appears to be very strong evidence (p < 0.01) for two diametrically opposed findings from a single data set can be found online.5 The ability to find “significance” by selectively analyzing data has led some journals to abandon p values as the criterion for significance. For example, the editors of Basic and Applied Social Psychology wrote, “We believe that the p < 0.05 bar is too easy to pass and sometimes serves as an excuse for lower quality research.”6 Instead, they call for “strong descriptive statistics, including effect sizes”.6 It should be noted that while there are still some who believe that quantitative studies are “more rigorous” than qualitative, in fact both types of studies are required to provide strong causal and mechanistic evidence for a particular finding. While findings from chemistry education research are not likely to have deadly consequences (as may be the case based on some highly funded medical studies), many of us are Published: January 9, 2018 1

DOI: 10.1021/acs.jchemed.7b00907 J. Chem. Educ. 2018, 95, 1−2

Journal of Chemical Education

Editorial

Notes

attempting to do research on systems that we care deeply about. Designing and carrying out an experiment (which may take upward of a year), only to find no significance is discouraging to say the least, and the temptation to search for significance can be strong. The inclusion of seemingly sophisticated statistical techniques may provide the illusion of rigor, but it cannot substitute for a well-thought-out experimental design, a sufficient sample, and a plan for how the data will be analyzed. Ideally, studies should be replicated, but certainly the methods should be described thoroughly, such that other researchers can replicate the study. While there are a number of reasons why research findings might not replicate (different student populations, different cultural expectations, different approaches to teaching and learning), until we try to address some of these issues, we may be dealing with much weaker evidence about some aspects of evidence-based teaching and learning than we think. Indeed, well-designed studies that provide evidence about nonreplication conditions can be an important contribution to the literature, because such studies will allow us to identify the actual conditions that contribute to a finding. It should also be noted that the replication problem is not limited to medicine and psychology. A report in Nature7 indicates that while most chemists trust published reports, over 80% report that they have had problems replicating literature findings, and over 60% report that they have had trouble replicating their own findings. A number of approaches to the replication problem have been proposed. For example, in medical trials a study can be preregistered so that the methodology is locked in before the experiment begins: the data to be collected and the methods of analysis are predetermined. Some disciplines have set up programs in which volunteers can use the published experimental design to try to replicate the findings.8 One approach for our community may be to follow the lead of some journals and require that raw data (deidentified, of course) be included in the Supporting Information when publishing the research. In this way, other researchers can access the data and use it for their own studies, whether to extend the findings or to replicate them. As mentioned earlier, we should also aim to conduct analyses using strong descriptive statistics accompanied by effect sizes and, when feasible, combine quantitative methodological approaches with qualitative ones. One thing is clear: Our community is not immune to the problems with experimental design and data analysis that have beset biomedical and psychology studies, and we must hold our community to the highest of standards. In chemical education research (CER) we are not immune to the random variations and systematic errors that can creep into our studies. Perhaps it is time to revisit the guidelines for both authors and reviewers of CER papers to be more explicit about expectations for experimental design, data analysis, and availability of all data. It is better to be aware of the problems and try to guard against them, than to add to an already vast literature that may or may not be meaningful.



Views expressed in this editorial are those of the author and not necessarily the views of the ACS. Melanie Cooper is the Lappan-Phillips Professor of Science Education and Professor of Chemistry at Michigan State University. Her current research focus is the development and assessment of evidence-based curricula in order to improve the teaching and learning of chemistry within large-enrollment undergraduate chemistry courses. She was on the leadership team for the development of the Next Generation Science Standards, and was a member of the committee that developed the NRC report on Discipline-Based Education Research. Dr. Cooper is a frequent contributor to this Journal.



REFERENCES

(1) Ioannidis, J. P. A. Why Most Published Research Findings Are False. PLoS Med. 2005, 2 (8), e124. (2) Bem, D. J. Feeling the Future: Experimental Evidence for Anomalous Retroactive Influences on Cognition and Affect. J. Pers. Soc. Psychol. 2011, 100 (3), 407−425. (3) Ritchie, S. J.; Wiseman, R.; French, C. C. Failing the Future: Three Unsuccessful Attempts to Replicate Bem’s ‘Retroactive Facilitation of Recall’ Effect. PLoS One 2012, 7 (3), e33423. (4) Cuddy, A. Your Body Language May Shape Who You Are. https://www.ted.com/talks/amy_cuddy_your_body_language_ shapes_who_you_are (accessed Dec 2017). (5) Aschwanden, C. Science Isn’t Broken. https://fivethirtyeight. com/features/science-isnt-broken/ (accessed Dec 2017). (6) Trafimow, D.; Marks, M. Editorial. Basic Appl. Soc. Psychol. 2015, 37 (1), 1−2. (7) Baker, M. 1,500 Scientists Lift the Lid on Reproducibility. Nature 2016, 533 (7604), 452−454. (8) Open Science Collaboration. Estimating the Reproducibility of Psychological Science. Science 2015, 349 (6251), aac4716. DOI: 10.1126/science.aac4716.

AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. ORCID

Melanie M. Cooper: 0000-0002-7050-8649 2

DOI: 10.1021/acs.jchemed.7b00907 J. Chem. Educ. 2018, 95, 1−2