Biological Importance and Statistical Significance - ACS Publications

Aug 2, 2013 - dominate many scientists' thinking about the statistical analysis ..... software packages where appropriate information is entered to ob...
2 downloads 0 Views 377KB Size
Subscriber access provided by HARVARD UNIVERSITY

Article

Biological Importance and Statistical Significance David P. Lovell J. Agric. Food Chem., Just Accepted Manuscript • DOI: 10.1021/jf401124y • Publication Date (Web): 02 Aug 2013 Downloaded from http://pubs.acs.org on August 13, 2013

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Agricultural and Food Chemistry is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 35

Journal of Agricultural and Food Chemistry

Biological Importance and Statistical Significance David P. Lovell

St George’s, University of London, Cranmer Terrace, London SW17 0RE Tel: +44(0)20–8725–5363 Fax: +44(0)20–8725–2993 Email: [email protected]

Funding: No funding source applicable. Potential conflict of interests: The author was a member of the EFSA Scientific Committee from 2008 to 2012 and chair of an EFSA Working Group on Statistical Significance and Biological Relevance. All comments in this paper are his own opinions and do not represent the views of EFSA.

The author received travel and accommodation support from the International Life Sciences Institute (ILSI) to present topics discussed in this paper at the 2012 ILSI International Food Biotechnology Committee Plant Composition Workshop.

1

ACS Paragon Plus Environment

02/08/2013

Journal of Agricultural and Food Chemistry

Page 2 of 35

1

ABSTRACT

2

Statistical ideas behind the analysis of experiments related to crop composition and

3

the genetic factors underlying composition are discussed. The emphasis is on concepts

4

rather than statistical formulations. Statistical analysis and biological considerations

5

are shown to be complementary rather than contradictory, in that the statistical

6

analysis of a dataset depends on the experimental design, that no amount of statistical

7

sophistication can rescue a badly designed study, and good experimental design is

8

crucial. The traditional null hypothesis significance testing approach has severe

9

limitations but p values and statistical significance still often seem to be the primary

10

objective of an analysis. Emphasis instead should be on identifying the size of effects

11

that are biologically important and, with the involvement of the “domain” scientist,

12

using these to help design experiments with appropriate sample sizes and statistical

13

power. The issues discussed here are also directly applicable to other areas of research.

14 15 16

KEYWORDS: statistical significance, experimental design, power

2

ACS Paragon Plus Environment

02/08/2013

Page 3 of 35

Journal of Agricultural and Food Chemistry

17

INTRODUCTION

18

The International Life Sciences Institute’s (ILSI) International Food Biotechnology

19

Committee (IFBiC) held its Plant Composition Workshop in September 2012. This

20

date was close to the 50th anniversary of the death of R.A. Fisher in Adelaide,

21

Australia, on July 29, 1962. Fisher was both an outstanding geneticist and statistician.

22

His wide-ranging contributions to the development of statistics, experimental design,

23

and genetics underpin many of the topics discussed at the 2012 ILSI IFBiC workshop.

24

Through his work at Rothamsted and later at University College London and

25

Cambridge University, Fisher laid the foundations for many developments, including

26

the modern theory and basis for the design of experiments and the analysis of variance

27

(ANOVA) methodology. Fisher was one of the founders of population genetics and

28

initiated the concept of significance testing.1,2 His contributions, such as the Fisher’s

29

exact test, appear throughout any consideration of the statistical methods used in the

30

interpretation of composition data.

31

The objective of this paper is to discuss statistical concepts, some originally

32

developed by Fisher, used in the analysis of experiments related to the composition of

33

crops and the genetic factors that underlie their composition. This paper concentrates

34

on concepts rather than detailed statistical formulations and equations, and is aimed at

35

the reader who wants to know why something is done rather than how it is done. The

36

concepts are also directly applicable to other areas of research. Two main arguments

37

will be put forward: first, the over-riding importance of experimental design that

38

precedes statistical analysis; and second, the limitations of the concept of statistical

39

significance and the mistaken equating of it with biological relevance or importance.

40 41

Investigations in the area of research on the composition of crops and the genetic factors that underlie their composition encompass a wide range of objectives

3

ACS Paragon Plus Environment

02/08/2013

Journal of Agricultural and Food Chemistry

Page 4 of 35

42

from the assessment of agricultural field trials to the safety evaluation of food derived

43

from modern technologies. One of the objectives of this paper is to lay out some of

44

the statistical issues related to these studies so that even if a statistical analysis is

45

unable to provide a “perfect solution,” an appreciation of the statistical issues can help

46

in part of what could be termed a weight of evidence approach. Statistical

47

considerations are a crucial aspect of an evidence-based approach.3

48

Some of the discussion here is based on work carried out by the Statistical

49

Working Group of the European Food Safety Authority (EFSA) Scientific Committee

50

to help EFSA scientific panels and committees in their assessment of biologically

51

relevant effects.4 An opinion produced by this group was primarily intended for EFSA

52

experts and staff, with the objective of clarifying the main concepts and definitions

53

associated with statistical significance and biological relevance. It was considered that

54

this may also be useful for risk managers and risk communicators in general. EFSA

55

has also produced a number of other documents in which experimental design and

56

statistical analysis issues are discussed in areas ranging from field design to 90-day

57

toxicology studies.5–7

58 59

EXPERIMENTAL DESIGN

60

The concepts of statistical significance and hypothesis testing dominate many

61

scientists’ thinking about the statistical analysis of experimental data almost to the

62

exclusion and detriment of other aspects of statistical analysis. A primary question is

63

“What is the purpose or objective of a statistical analysis?” In many discussions, a

64

statistical analysis and a statistician's role are equated with the finding of statistical

65

significance using, for example, one of the various statistical methods to test a null

4

ACS Paragon Plus Environment

02/08/2013

Page 5 of 35

Journal of Agricultural and Food Chemistry

66

hypothesis of no difference between a treated group and a control group. However,

67

the statistician has a far more important role in the design of studies.

68

In practice, the design of the experiment can be considered the strategy and the

69

statistical analyses used the tactics. Statistical thinking is consequently important at

70

the strategic experimental design stage, and the analysis is consequently secondary to

71

and dependent on good experimental design. Statistical tests rely on the implicit

72

assumption that the design is correct. No amount of statistical testing will rescue

73

results obtained in a poorly designed or non-designed study.

74

Researchers are frequently exhorted to consult a statistician before starting an

75

experiment. The following quote from Fisher remains apposite: “To call in the

76

statistician after the experiment is done may be no more than asking him to perform a

77

post-mortem examination: he may be able to say what the experiment died of.”8

78

The following quotation from Price and Underhill provides an example of a

79

study design for composition research: “Studies are usually with the GM plant variety

80

and its parent. Several field locations are selected, representing the area where the

81

GM plant is expected to be grown. Within a location, growing plots of land are

82

allocated to the experiment. Usually blocks of two or more plots are formed to ensure

83

that the treated and control plants are grown under the same conditions.”9

84

Experimental design in the agricultural sciences has been heavily influenced

85

by Fisher, who first developed the factorial design: experiments where two or more

86

factors are investigated in all possible combinations (based on Mendelian genetic

87

concepts).10,11 Fisher and co-workers developed a range of experimental designs, such

88

as the randomized complete block, the split plot, the Latin Square, and the factorial

89

designs in a series of influential books and papers. Some of these designs are, to some

5

ACS Paragon Plus Environment

02/08/2013

Journal of Agricultural and Food Chemistry

Page 6 of 35

90

extent, contrary to much of conventional teaching of varying just one factor and

91

keeping all the others constant, called the one factor at a time (OFAT) approach.12

92 93

The factorial approach has developed into the field of design of experiment

94

(DOE) methodology. Fisher demonstrated that these DOE approaches such as the

95

factorial design (where there is systematic and simultaneous variation of experimental

96

conditions such as the classical field trials involving nitrogen, phosphorus and

97

potassium fertilizers) are both economically and scientifically more efficient than the

98

traditional OFAT approach. They can identify the multiple factors (and the

99

interactions between them) that affect results appreciably. The DOE methodology

100

further identifies the levels of these factors that optimize results as well as minimizes

101

the number of independent studies needed and the experimental units required. This

102

approach is ideal for topics such as protocol development, in which there may be a

103

number of factors that can affect the experimental results. The DOE approach is now

104

widely used in industrial settings, such as the manufacturing and chemical industries,

105

to identify optimal conditions for processes under which to operate, and to reduce the

106

number of runs needed to identify these. DOE methodology could, for instance, be an

107

efficient approach to identifying optimum allocation or use of resources in studies in

108

which there are a number of different factors where conditions could be varied. The

109

advantages of the factorial approach are, unfortunately, still not widely appreciated in

110

many research areas. The agricultural basis for many experimental designs can be

111

seen in the use of terms such as blocks and plots.

112 113

HYPOTHESIS TESTING AND STATISTICAL SIGNIFICANCE

114

The concept of statistical significance and hypothesis testing unfortunately dominates

115

many scientists’ thinking about the statistical analysis of their data, almost to the

6

ACS Paragon Plus Environment

02/08/2013

Page 7 of 35

Journal of Agricultural and Food Chemistry

116

exclusion and detriment of other aspects of statistical analysis. The two concepts are

117

different, and Cox13 distinguished between statistical significance testing (the

118

Fisherian approach) and hypothesis testing (the Neyman-Pearson approach).

119

One of Fisher’s most well-known legacies is the significance test, which

120

together with reports of the p value and asterisks adorn many tables of results. Fisher

121

envisaged a test of significance that then provided him with evidence on which to

122

base further work.11 The test was based on a test of whether a single model fitted the

123

data. There was no alternative hypothesis. The use of the statistical test for hypothesis

124

testing derives from the work of Jerzy Neyman and Egon Pearson in the late 1920s

125

and early 1930s.14

126

Statistical significance later became a concept associated with the use of a

127

specific statistical test to test a null hypothesis of no difference between two groups

128

such as a treated group or control group. This has been referred to as the null

129

hypothesis significance testing (NHST) approach.15 The concepts are (or should be)

130

familiar to many: Type I or Type II alpha and beta errors, power, and significance

131

levels. The familiar 2 × 2 table can be used to illustrate the NHST approach. However,

132

understanding the NHST approach is difficult in part because it is a mixture of two

133

different concepts. It was (and remains) deeply controversial.

134

The standard NHST approach is a hybrid of the Fisher and Neyman-Pearson

135

processes.16 Fisher was initially interested in testing a single null hypothesis and not

136

in the alternative hypothesis. This test of the null hypothesis was not particularly

137

important and was to be used as a guide rather than a decision-making process. There

138

was no alternative hypothesis. Neyman and Pearson wanted to compare two

139

hypotheses that relate to the concept of the null and alternative hypotheses. This is

140

particularly relevant for quality control–type investigations.

7

ACS Paragon Plus Environment

02/08/2013

Journal of Agricultural and Food Chemistry

141

Page 8 of 35

Gigerenzer provides a detailed discussion of some of the issues around the null

142

hypothesis and statistical significance testing.17 He discusses the long-running (and

143

often acrimonious) debate between Fisher and Neyman and Pearson over the context

144

of significance testing, pointing to how Fisher had initiated the significance level but

145

moved in later life to take a more nuanced view. On the other hand, Neyman and

146

Pearson maintained their interest in the use of the decision rule approach.

147

In the Neyman-Pearson method, the only criterion is whether the hypothesis is

148

rejected. It is a binary decision with the critical value for a test statistic equivalent to,

149

say, p=0.05 as the criterion. The result is then declared statistically significant at

150

p=0.05 and the alternative hypothesis is accepted in contrast to the null hypothesis.

151

The Fisherian approach is to report the exact p value and use this value only to gauge

152

where the null hypothesis has been rejected and not whether another hypothesis has

153

been accepted.

154

Note that in the NHST approach, the test statistic and its associated p value

155

depend on a number of factors such as sample size, statistical test used, and amount of

156

variability. This means that the actual size of difference that is just significant (i.e.,

157

reached the critical value of the test statistic) will vary from study to study. In addition,

158

each experiment is, in effect, one from a population of possible experiments and is

159

thus only an estimate (with a distribution) of the true difference. By “bad luck,” the

160

actual experiment performed can give an estimate in one of the tails and the study

161

would be reported as “not significant” even when there is a real difference (a Type II

162

error).

163 164

CRITICISM OF THE USE OF SIGNIFICANCE

8

ACS Paragon Plus Environment

02/08/2013

Page 9 of 35

Journal of Agricultural and Food Chemistry

165

Some think that producing statements about the presence or absence of statistical

166

significance is the role of the professional statistician. Protocols often include

167

language stating that a level of probability 0.05 are often reported as nonsignificant or not significant,

182

although Altman in particular does not recommend this.19 The choice of the three

183

levels relates to the tables for the three levels of 0.05, 0.01, and 0.001 in the statistical

184

tables produced by Fisher and Yates.20

185

Many authors now argue against this convention. Yates accidentally

186

introduced this “star” nomenclature but abhorred its subsequent use and vehemently

187

opposed its unthinking use. In 1937, Yates produced a monograph on factorial

188

experiments that became the standard work of reference on the ANOVA.21 Asterisks

189

were used to indicate successive footnotes, but were interpreted by adherents as a new

9

ACS Paragon Plus Environment

02/08/2013

Journal of Agricultural and Food Chemistry

Page 10 of 35

190

standard notation. Yates regretted that this gave too much emphasis to the

191

significance test, and criticized some scientific research workers for paying undue

192

attention to them.22

193

The term significant has a number of meanings, some of which are technical

194

and others less so. In 2008, Nature published a list of disputed definitions that

195

included the word “significance.”23 In regard to this list, Reese noted that “Too many

196

scientists — and editors — take the line you reproach and use statistical significance

197

as a criterion of importance.”24

198 199

Salsburg provides an interesting discussion on how the word significant changed its meaning:

200

The word was used in its late-nineteenth-century meaning, which is simply that the

201

computation signified or showed something. As the English language entered the

202

twentieth century, the word significant began to take on other meanings, until it took

203

on its current meaning, implying something very important. … Unfortunately, those

204

who use statistical analysis often treat a significant test statistic as implying

205

something closer to the modern meaning of the word.

25(p98)

206

Many statisticians abhor the use of the NHST and the “cult of the p value,” yet

207

is seems to have taken root as part of the basic requirements that the regulator and the

208

referee/journal editors require. Attaining a p value