Safety assessments and multiplicity adjustment: comments on a recent

difference testing, where the null hypothesis is equality of means and false positives are. 20 considered the error ... statements about the statistic...
0 downloads 9 Views 144KB Size
Subscriber access provided by UNIV OF DURHAM

Correspondence/Rebuttal

Safety assessments and multiplicity adjustment: comments on a recent paper Hilko Van der Voet J. Agric. Food Chem., Just Accepted Manuscript • DOI: 10.1021/acs.jafc.7b03686 • Publication Date (Web): 18 Feb 2018 Downloaded from http://pubs.acs.org on February 20, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Agricultural and Food Chemistry is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 4

Journal of Agricultural and Food Chemistry

1

Correspondence

2

Safety assessments and multiplicity adjustment: comments on a recent

3

paper

4 5

Hilko van der Voet, Wageningen University & Research

6 7

Many endpoints representing possible non-target effects have to be evaluated in safety

8

assessments that compare new and accepted products. This is known as the multiple-

9

comparison or multiplicity problem. One popular method to adjust statistical testing

10

procedures for multiplicity is the False Discovery Rate (FDR) method1, often implemented

11

via adjustment of p values, as for example provided in SAS procedure MULTTEST. FDR-

12

adjusted p values are obtained by multiplication of the raw p values with factors between

13

1 and , where  is the number of tested hypotheses. Let  ≤ ⋯ ≤  be the ordered p

14

values for  endpoints. Then the FDR-adjusted p values according to a linear step-up

15

algorithm are sequentially calculated as  =  ;  = min   ,    for  =  −

16

1, … ,1.



17 18

Recently, Hong et al.2 published an evaluation of the European Food Safety Authority

19

(EFSA)’s framework for safety assessment of GM crops using a rat 90-day feeding study3

20

which is a compulsory part of the safety assessment according to current EU legislation4.

21

The appropriateness of these animal studies and of the EFSA framework on how to

22

conduct such studies are both under discussion. For example, the EU research project

23

GRACE (http://www.grace-fp7.eu) has performed and evaluated four 90-day and one 1-

24

year study contributing to this discussion (see Schmidt et al.5 and references therein).

25

Another currently ongoing EU research project is G-TwYST (https://www.g-twyst.eu)

26

which is evaluating two 90-day studies and one combined chronic/carcinogenicity (2-

27

year) study. Hong et al.2 also assessed the appropriateness and applicability of the EFSA

28

recommendations using a 90-day study and a battery of statistical approaches including

29

retrospective and prospective power analyses. This short Correspondence is not the place

30

to give a full appraisal of all aspects of this discussion. The discussion here is restricted

31

to just one element of the statistical approach used, which is the treatment of the

32

multiplicity due to many endpoints. Hong et al. evaluated a very large number of

33

endpoints, and adjusted the p values of their tests according to the FDR method. The

34

maximum number of endpoints for each of the sexes was  =146, so FDR-adjusted p

1

ACS Paragon Plus Environment

Journal of Agricultural and Food Chemistry

1

values are between 1 and up to 146 times as large as the raw p valuesa. The main result

2

of Hong et al. regarding the comparisons between test and control groups is that ‘no

3

treatment-related differences were observed’. This can be contrasted with the detailed

4

comparisons in Appendix D of the paper, where 32 out of 816 of the 95% confidence

5

intervals for observed differences do not contain the value 0, and therefore indicate

6

significant differences in an unadjusted test. Note that this rate of significant results

7

(3.9%) is close to the expected rate of 5% false positives that is expected under a null

8

hypothesis of equality for all endpoints, and therefore in itself is not a reason for concern

9

about safety. However, the reported absence of any statistically significant difference

10

should be seen as the direct consequence of using the FDR adjustment. Clearly, with no

11

‘discovery’ at all in this set of results, the false discovery rate is zero by definition, and in

12

this respect the methodology can be said to have operated very effectively. In summary,

13

FDR-adjustment is not a minor detail, but is a main factor that determines the test

14

results.

15 16

I have two serious concerns about the methodology in this paper.

17 18

First, the use of standard FDR-correction or any other multiple-testing scheme makes no

19

sense in food safety testing. It controls false discoveries, and is therefore connected to

20

difference testing, where the null hypothesis is equality of means and false positives are

21

considered the error of the first kind: you want to have a small probability of erroneously

22

reporting a difference. This is useful in studies that set out to find differences between

23

groups, perhaps to find new explanations for biological phenomena or effective

24

treatments. However, in the context of safety or equivalence testing, the purpose is to

25

demonstrate safety with a chosen confidence level. Therefore the statistical hypotheses

26

are reversed: the null hypothesis is that some difference exists and we want to show

27

equivalence by rejecting such a null hypothesis (some possible approaches allowing for

28

endpoints with widely different variation have been described6,7,8,9). In equivalence

29

testing, the errors of the first kind are the false negatives rather than the false positives,

30

to guarantee a small probability of erroneously reporting equivalence. Consequently, the

31

commonly used methods for multiplicity correction including FDR are addressing the

32

wrong type of error, and should not be used in safety assessments.

33 34

Secondly, contrary to the tests used to report results in Hong et al., the statistical power

35

analyses in the same paper (both prospective and retrospective) do not use FDR

36

adjustments. Therefore, the results of these power analyses cannot be interpreted as

a

FDR adjustment was performed for the set of all endpoints that were reported across-sex, and separately for the male or female-specific comparisons, so  may have been lower in practice, but exact values are not given.

2

ACS Paragon Plus Environment

Page 2 of 4

Page 3 of 4

Journal of Agricultural and Food Chemistry

1

statements about the statistical power obtained using the FDR-adjusted tests. Clearly,

2

the statistical power for any endpoint separately is much lower than stated (because the

3

p values are adjusted upward). The potential danger of the paper is the message that its

4

approach would be an appropriate procedure, because 1) a high power of the difference

5

tests for the proposed effect sizes seems to be attained, and at the same time: 2) not a

6

single statistically significant difference is obtained. But the statistical approaches

7

followed for 1 and 2 (without and with FDR correction, respectively) are inconsistent. It is

8

misleading to present FDR-adjusted test results together with power analyses which do

9

not incorporate these adjustments.

10 11

As an additional point, Hong et al. also claim that FDR adjustments would be endorsed by

12

EFSA. Whereas EFSA in its guidance3 did acknowledge the multiplicity problem (‘the issue

13

of multiple testing […] should be addressed’), they have however not given an

14

endorsement of FDR adjustment. Instead, EFSA3 leaves the matter to the statistical

15

analyst (‘Any methods used to adjust for multiplicity should also be clearly documented

16

and referenced’). EFSA10 already concluded on this: ‘FDR as usually applied (i.e. in a

17

context of difference testing) is a property of the subset of endpoints for which a

18

significant difference has been found. It does not address the endpoints for which no

19

significance has been found and therefore FDR applied to difference testing does not

20

seem sufficient as a measure in GMO risk assessment. It could be of interest to adapt the

21

FDR concept for equivalence testing, i.e. for a situation where hypotheses are reversed,

22

but the GMO Panel is not aware that this has yet been done.’ By now alternative

23

methods for multiple or multivariate equivalence testing for safety evaluations have been

24

proposed11,12,13,14, which are currently under debate.

25 26 27 28 29

References 1. Benjamini, Y.; Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Statist. Soc. B 1995, 57, 289–300. 2. Hong, B.; Du, Y.Z.; Mukerji, P.; Roper, J.M.; Appenzeller, L.M. Safety assessment of food

30

and feed from GM crops in Europe: Evaluating EFSA’s alternative framework for the rat 90-

31

day feeding study. J. Agric. Food Chem. 2017, 65, 5545-5560.

32 33 34

3. EFSA. Guidance on conducting repeated-dose 90-day oral toxicity study in rodents on whole food/feed. EFSA Journal 2011, 9, 2438. 4. EC Commission. Implementing Regulation (EU) No 503/2013 of 3 April 2013 on

35

applications for authorisation of GM food and feed in accordance with Regulation (EC) No

36

1829/2003 of the European Parliament and of the Council and amending Commission

37

Regulations (EC) No 641/2004 and (EC) No 1981/2006. Off. J. Eur. Communities 2013,

38

L157, 1−52.

39 40

5. Schmidt, K.; Schmidtke, J.; Schmidt, P.; Kohl, C.; Wilhelm, R.; Schiemann, J.; van der Voet, H.; Steinberg, P. Variability of control data and relevance of observed group 3

ACS Paragon Plus Environment

Journal of Agricultural and Food Chemistry

1

differences in five oral toxicity studies with genetically modified maize MON810 in rats.

2

Arch. Toxicol. 2017, 91: 1977-2006. https://dx.doi.org/10.1007/s00204-016-1857-x

3

6. van der Voet, H.; Perry, J.N.; Amzal, B.; Paoletti, C. A statistical assessment of differences

4

and equivalences between genetically modified and reference plant varieties. BMC

5

Biotechnol. 2011, 11: 15.

6

7. Meyners, M. Equivalence tests – a review. Food Qual. Prefer. 2012, 26, 231-245.

7

8. Kang, Q.; Vahl, C.I. Statistical analysis in the safety evaluation of genetically-modified

8 9

crops: equivalence tests. Crop Sci. 2014, 54, 2183-2200. 9. van der Voet, H.; Goedhart, P.W.; Schmidt, K. Equivalence testing using existing reference

10

data: an example with genetically modified and conventional crops in animal feeding

11

studies. Food and Chemical Toxicology 2017, 109: 472-485.

12

https://doi.org/10.1016/j.fct.2017.09.044

13 14 15 16 17

10. EFSA. Statistical considerations for the safety evaluation of GMOs. EFSA Journal 2010, 8, 1250. 11. Qiu, J.; Cui, X.Q. Evaluation of a statistical equivalence test applied to microarray data. J Biopharm. Statist. 2010, 20: 240-266. 12. van Dijk, J.P.; Souza de Mello, C.; Voorhuijzen, M.M.; Hutten, R.C.B.; Maisonnave Arisi,

18

A.C.; Jansen, J.J.; Buydens, L.M.C.; van der Voet, H.; Kok, E.J. Safety assessment of plant

19

varieties using transcriptomics profiling and a one-class classifier. Regulatory Toxicology

20 21 22 23 24

and Pharmacology 2014, 70: 297-303. http://dx.doi.org/10.1016/j.yrtph.2014.07.013. 13. Pallmann, P.; Jaki, T. Simultaneous confidence regions for multivariate bioequivalence. Statistics in Medicine 2017, 36: 4585-4603. 14. Vahl, C.I.; Kang, Q. Statistical strategies for multiple testing in the safety evaluation of a genetically modified crop. Journal of Agricultural Science. 2017, 155, 812-831.

4

ACS Paragon Plus Environment

Page 4 of 4