Subscriber access provided by HARVARD UNIVERSITY
Article
Biological Importance and Statistical Significance David P. Lovell J. Agric. Food Chem., Just Accepted Manuscript • DOI: 10.1021/jf401124y • Publication Date (Web): 02 Aug 2013 Downloaded from http://pubs.acs.org on August 13, 2013
Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.
Journal of Agricultural and Food Chemistry is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.
Page 1 of 35
Journal of Agricultural and Food Chemistry
Biological Importance and Statistical Significance David P. Lovell
St George’s, University of London, Cranmer Terrace, London SW17 0RE Tel: +44(0)20–8725–5363 Fax: +44(0)20–8725–2993 Email:
[email protected] Funding: No funding source applicable. Potential conflict of interests: The author was a member of the EFSA Scientific Committee from 2008 to 2012 and chair of an EFSA Working Group on Statistical Significance and Biological Relevance. All comments in this paper are his own opinions and do not represent the views of EFSA.
The author received travel and accommodation support from the International Life Sciences Institute (ILSI) to present topics discussed in this paper at the 2012 ILSI International Food Biotechnology Committee Plant Composition Workshop.
1
ACS Paragon Plus Environment
02/08/2013
Journal of Agricultural and Food Chemistry
Page 2 of 35
1
ABSTRACT
2
Statistical ideas behind the analysis of experiments related to crop composition and
3
the genetic factors underlying composition are discussed. The emphasis is on concepts
4
rather than statistical formulations. Statistical analysis and biological considerations
5
are shown to be complementary rather than contradictory, in that the statistical
6
analysis of a dataset depends on the experimental design, that no amount of statistical
7
sophistication can rescue a badly designed study, and good experimental design is
8
crucial. The traditional null hypothesis significance testing approach has severe
9
limitations but p values and statistical significance still often seem to be the primary
10
objective of an analysis. Emphasis instead should be on identifying the size of effects
11
that are biologically important and, with the involvement of the “domain” scientist,
12
using these to help design experiments with appropriate sample sizes and statistical
13
power. The issues discussed here are also directly applicable to other areas of research.
14 15 16
KEYWORDS: statistical significance, experimental design, power
2
ACS Paragon Plus Environment
02/08/2013
Page 3 of 35
Journal of Agricultural and Food Chemistry
17
INTRODUCTION
18
The International Life Sciences Institute’s (ILSI) International Food Biotechnology
19
Committee (IFBiC) held its Plant Composition Workshop in September 2012. This
20
date was close to the 50th anniversary of the death of R.A. Fisher in Adelaide,
21
Australia, on July 29, 1962. Fisher was both an outstanding geneticist and statistician.
22
His wide-ranging contributions to the development of statistics, experimental design,
23
and genetics underpin many of the topics discussed at the 2012 ILSI IFBiC workshop.
24
Through his work at Rothamsted and later at University College London and
25
Cambridge University, Fisher laid the foundations for many developments, including
26
the modern theory and basis for the design of experiments and the analysis of variance
27
(ANOVA) methodology. Fisher was one of the founders of population genetics and
28
initiated the concept of significance testing.1,2 His contributions, such as the Fisher’s
29
exact test, appear throughout any consideration of the statistical methods used in the
30
interpretation of composition data.
31
The objective of this paper is to discuss statistical concepts, some originally
32
developed by Fisher, used in the analysis of experiments related to the composition of
33
crops and the genetic factors that underlie their composition. This paper concentrates
34
on concepts rather than detailed statistical formulations and equations, and is aimed at
35
the reader who wants to know why something is done rather than how it is done. The
36
concepts are also directly applicable to other areas of research. Two main arguments
37
will be put forward: first, the over-riding importance of experimental design that
38
precedes statistical analysis; and second, the limitations of the concept of statistical
39
significance and the mistaken equating of it with biological relevance or importance.
40 41
Investigations in the area of research on the composition of crops and the genetic factors that underlie their composition encompass a wide range of objectives
3
ACS Paragon Plus Environment
02/08/2013
Journal of Agricultural and Food Chemistry
Page 4 of 35
42
from the assessment of agricultural field trials to the safety evaluation of food derived
43
from modern technologies. One of the objectives of this paper is to lay out some of
44
the statistical issues related to these studies so that even if a statistical analysis is
45
unable to provide a “perfect solution,” an appreciation of the statistical issues can help
46
in part of what could be termed a weight of evidence approach. Statistical
47
considerations are a crucial aspect of an evidence-based approach.3
48
Some of the discussion here is based on work carried out by the Statistical
49
Working Group of the European Food Safety Authority (EFSA) Scientific Committee
50
to help EFSA scientific panels and committees in their assessment of biologically
51
relevant effects.4 An opinion produced by this group was primarily intended for EFSA
52
experts and staff, with the objective of clarifying the main concepts and definitions
53
associated with statistical significance and biological relevance. It was considered that
54
this may also be useful for risk managers and risk communicators in general. EFSA
55
has also produced a number of other documents in which experimental design and
56
statistical analysis issues are discussed in areas ranging from field design to 90-day
57
toxicology studies.5–7
58 59
EXPERIMENTAL DESIGN
60
The concepts of statistical significance and hypothesis testing dominate many
61
scientists’ thinking about the statistical analysis of experimental data almost to the
62
exclusion and detriment of other aspects of statistical analysis. A primary question is
63
“What is the purpose or objective of a statistical analysis?” In many discussions, a
64
statistical analysis and a statistician's role are equated with the finding of statistical
65
significance using, for example, one of the various statistical methods to test a null
4
ACS Paragon Plus Environment
02/08/2013
Page 5 of 35
Journal of Agricultural and Food Chemistry
66
hypothesis of no difference between a treated group and a control group. However,
67
the statistician has a far more important role in the design of studies.
68
In practice, the design of the experiment can be considered the strategy and the
69
statistical analyses used the tactics. Statistical thinking is consequently important at
70
the strategic experimental design stage, and the analysis is consequently secondary to
71
and dependent on good experimental design. Statistical tests rely on the implicit
72
assumption that the design is correct. No amount of statistical testing will rescue
73
results obtained in a poorly designed or non-designed study.
74
Researchers are frequently exhorted to consult a statistician before starting an
75
experiment. The following quote from Fisher remains apposite: “To call in the
76
statistician after the experiment is done may be no more than asking him to perform a
77
post-mortem examination: he may be able to say what the experiment died of.”8
78
The following quotation from Price and Underhill provides an example of a
79
study design for composition research: “Studies are usually with the GM plant variety
80
and its parent. Several field locations are selected, representing the area where the
81
GM plant is expected to be grown. Within a location, growing plots of land are
82
allocated to the experiment. Usually blocks of two or more plots are formed to ensure
83
that the treated and control plants are grown under the same conditions.”9
84
Experimental design in the agricultural sciences has been heavily influenced
85
by Fisher, who first developed the factorial design: experiments where two or more
86
factors are investigated in all possible combinations (based on Mendelian genetic
87
concepts).10,11 Fisher and co-workers developed a range of experimental designs, such
88
as the randomized complete block, the split plot, the Latin Square, and the factorial
89
designs in a series of influential books and papers. Some of these designs are, to some
5
ACS Paragon Plus Environment
02/08/2013
Journal of Agricultural and Food Chemistry
Page 6 of 35
90
extent, contrary to much of conventional teaching of varying just one factor and
91
keeping all the others constant, called the one factor at a time (OFAT) approach.12
92 93
The factorial approach has developed into the field of design of experiment
94
(DOE) methodology. Fisher demonstrated that these DOE approaches such as the
95
factorial design (where there is systematic and simultaneous variation of experimental
96
conditions such as the classical field trials involving nitrogen, phosphorus and
97
potassium fertilizers) are both economically and scientifically more efficient than the
98
traditional OFAT approach. They can identify the multiple factors (and the
99
interactions between them) that affect results appreciably. The DOE methodology
100
further identifies the levels of these factors that optimize results as well as minimizes
101
the number of independent studies needed and the experimental units required. This
102
approach is ideal for topics such as protocol development, in which there may be a
103
number of factors that can affect the experimental results. The DOE approach is now
104
widely used in industrial settings, such as the manufacturing and chemical industries,
105
to identify optimal conditions for processes under which to operate, and to reduce the
106
number of runs needed to identify these. DOE methodology could, for instance, be an
107
efficient approach to identifying optimum allocation or use of resources in studies in
108
which there are a number of different factors where conditions could be varied. The
109
advantages of the factorial approach are, unfortunately, still not widely appreciated in
110
many research areas. The agricultural basis for many experimental designs can be
111
seen in the use of terms such as blocks and plots.
112 113
HYPOTHESIS TESTING AND STATISTICAL SIGNIFICANCE
114
The concept of statistical significance and hypothesis testing unfortunately dominates
115
many scientists’ thinking about the statistical analysis of their data, almost to the
6
ACS Paragon Plus Environment
02/08/2013
Page 7 of 35
Journal of Agricultural and Food Chemistry
116
exclusion and detriment of other aspects of statistical analysis. The two concepts are
117
different, and Cox13 distinguished between statistical significance testing (the
118
Fisherian approach) and hypothesis testing (the Neyman-Pearson approach).
119
One of Fisher’s most well-known legacies is the significance test, which
120
together with reports of the p value and asterisks adorn many tables of results. Fisher
121
envisaged a test of significance that then provided him with evidence on which to
122
base further work.11 The test was based on a test of whether a single model fitted the
123
data. There was no alternative hypothesis. The use of the statistical test for hypothesis
124
testing derives from the work of Jerzy Neyman and Egon Pearson in the late 1920s
125
and early 1930s.14
126
Statistical significance later became a concept associated with the use of a
127
specific statistical test to test a null hypothesis of no difference between two groups
128
such as a treated group or control group. This has been referred to as the null
129
hypothesis significance testing (NHST) approach.15 The concepts are (or should be)
130
familiar to many: Type I or Type II alpha and beta errors, power, and significance
131
levels. The familiar 2 × 2 table can be used to illustrate the NHST approach. However,
132
understanding the NHST approach is difficult in part because it is a mixture of two
133
different concepts. It was (and remains) deeply controversial.
134
The standard NHST approach is a hybrid of the Fisher and Neyman-Pearson
135
processes.16 Fisher was initially interested in testing a single null hypothesis and not
136
in the alternative hypothesis. This test of the null hypothesis was not particularly
137
important and was to be used as a guide rather than a decision-making process. There
138
was no alternative hypothesis. Neyman and Pearson wanted to compare two
139
hypotheses that relate to the concept of the null and alternative hypotheses. This is
140
particularly relevant for quality control–type investigations.
7
ACS Paragon Plus Environment
02/08/2013
Journal of Agricultural and Food Chemistry
141
Page 8 of 35
Gigerenzer provides a detailed discussion of some of the issues around the null
142
hypothesis and statistical significance testing.17 He discusses the long-running (and
143
often acrimonious) debate between Fisher and Neyman and Pearson over the context
144
of significance testing, pointing to how Fisher had initiated the significance level but
145
moved in later life to take a more nuanced view. On the other hand, Neyman and
146
Pearson maintained their interest in the use of the decision rule approach.
147
In the Neyman-Pearson method, the only criterion is whether the hypothesis is
148
rejected. It is a binary decision with the critical value for a test statistic equivalent to,
149
say, p=0.05 as the criterion. The result is then declared statistically significant at
150
p=0.05 and the alternative hypothesis is accepted in contrast to the null hypothesis.
151
The Fisherian approach is to report the exact p value and use this value only to gauge
152
where the null hypothesis has been rejected and not whether another hypothesis has
153
been accepted.
154
Note that in the NHST approach, the test statistic and its associated p value
155
depend on a number of factors such as sample size, statistical test used, and amount of
156
variability. This means that the actual size of difference that is just significant (i.e.,
157
reached the critical value of the test statistic) will vary from study to study. In addition,
158
each experiment is, in effect, one from a population of possible experiments and is
159
thus only an estimate (with a distribution) of the true difference. By “bad luck,” the
160
actual experiment performed can give an estimate in one of the tails and the study
161
would be reported as “not significant” even when there is a real difference (a Type II
162
error).
163 164
CRITICISM OF THE USE OF SIGNIFICANCE
8
ACS Paragon Plus Environment
02/08/2013
Page 9 of 35
Journal of Agricultural and Food Chemistry
165
Some think that producing statements about the presence or absence of statistical
166
significance is the role of the professional statistician. Protocols often include
167
language stating that a level of probability 0.05 are often reported as nonsignificant or not significant,
182
although Altman in particular does not recommend this.19 The choice of the three
183
levels relates to the tables for the three levels of 0.05, 0.01, and 0.001 in the statistical
184
tables produced by Fisher and Yates.20
185
Many authors now argue against this convention. Yates accidentally
186
introduced this “star” nomenclature but abhorred its subsequent use and vehemently
187
opposed its unthinking use. In 1937, Yates produced a monograph on factorial
188
experiments that became the standard work of reference on the ANOVA.21 Asterisks
189
were used to indicate successive footnotes, but were interpreted by adherents as a new
9
ACS Paragon Plus Environment
02/08/2013
Journal of Agricultural and Food Chemistry
Page 10 of 35
190
standard notation. Yates regretted that this gave too much emphasis to the
191
significance test, and criticized some scientific research workers for paying undue
192
attention to them.22
193
The term significant has a number of meanings, some of which are technical
194
and others less so. In 2008, Nature published a list of disputed definitions that
195
included the word “significance.”23 In regard to this list, Reese noted that “Too many
196
scientists — and editors — take the line you reproach and use statistical significance
197
as a criterion of importance.”24
198 199
Salsburg provides an interesting discussion on how the word significant changed its meaning:
200
The word was used in its late-nineteenth-century meaning, which is simply that the
201
computation signified or showed something. As the English language entered the
202
twentieth century, the word significant began to take on other meanings, until it took
203
on its current meaning, implying something very important. … Unfortunately, those
204
who use statistical analysis often treat a significant test statistic as implying
205
something closer to the modern meaning of the word.
25(p98)
206
Many statisticians abhor the use of the NHST and the “cult of the p value,” yet
207
is seems to have taken root as part of the basic requirements that the regulator and the
208
referee/journal editors require. Attaining a p value