Negative Consequences of Using α = 0.05 for ... - ACS Publications

Aug 8, 2012 - 0.05 resulted in implied relative costs of Type I vs Type II errors that were ... for “optimal” α levels set to minimize Type I and...
0 downloads 0 Views 1MB Size
Policy Analysis pubs.acs.org/est

Negative Consequences of Using α = 0.05 for Environmental Monitoring Decisions: A Case Study from a Decade of Canada’s Environmental Effects Monitoring Program Joseph F. Mudge,*,† Timothy J. Barrett,† Kelly R. Munkittrick,*,†,‡ and Jeff E. Houlahan† †

Department of Biology, University of New Brunswick at Saint John, 100 Tucker Park Road, Saint John, New Brunswick, Canada, E2L 4L5 ‡ Canadian Water Network, 200 University Avenue West, Waterloo, Ontario, Canada, N2L 3G1 ABSTRACT: Using the traditional α = 0.05 significance level for null hypothesis significance tests makes assumptions about relative costs of Type I vs relevant Type II errors and inflates their combined probabilities. We have examined the results of 1254 monitoring tests conducted under the Canadian Environmental Effects Monitoring (EEM) program from 1992 to 2003, focusing on how the choice of α affected the relative probabilities and implied costs of Type I and Type II errors. Using α = 0.05 resulted in implied relative costs of Type I vs Type II errors that were both inconsistent among monitoring end points and also inconsistent with the philosophy of the monitoring program. Using α = 0.05 also resulted in combinations of Type I and II error that were 15−17% larger than those for “optimal” α levels set to minimize Type I and II errors for each study, and 12% of all monitoring tests would have reached opposite conclusions had they used these optimal α levels for decision-making. Thus, if the Canadian EEM program used study-specific optimal α levels, they would reduce the incidence of relevant errors and eliminate inconsistent implied relative costs of these errors. Environmental research and monitoring programs using α = 0.05 as a decision-making threshold should re-evaluate the usefulness of this “one-size-fits-all” approach.



INTRODUCTION A Better Alternative to α = 0.05. Although null hypothesis testing has been the dominant statistical approach in science for the last several decades, many have criticized the rationale behind comparing a p-value from an observed test statistic to a standard (albeit arbitrary) significance level (usually α = 0.05).1−8 Major problems with a single “onesize-fits-all” significance criterion are that it is often used without considering the quality of the study design and data, it usually ignores critical information about the statistical power of the design, and it makes implicit and often unrealistic assumptions about critical effect sizes. This frequently results in null hypotheses with high probabilities of not being rejected despite the existence of real and relevant effects and null hypotheses with high probabilities of rejection even when the real effects are trivial. Few researchers recognize that the choice of a significance level inherently results in a relative cost estimate for Type I and Type II statistical errors that is related to (but not equal to) the relative probabilities of Type I and Type II errors at the critical effect size.8 If a priori decisions about study design are made, the most common study design sets α = 0.05 and β = 0.20 (power = 0.80), loosely implying that Type I errors are more costly than Type II errors. An ideal, defensible significance level for a particular null hypothesis test should consider the relative costs as well as the probabilities of both Type I errors and Type © 2012 American Chemical Society

II errors at scientifically relevant effect sizes. The optimal significance level (optimal α) could be defined as the Type I error probability that minimizes the combined probabilities or costs of wrong conclusions for any particular hypothesis test. An optimal α can be calculated for any null hypothesis test by calculating the significance level that is associated with the minimum average of α and β (at a predetermined critical effect size and achieved sample size), and weighting this average by the a priori relative probabilities of the null and alternate hypotheses (if known) and the relative cost of Type I and Type II errors. Thus, setting an optimal α allows researchers to make conclusions with the highest possible degree of confidence by minimizing the probabilities (or costs) of committing statistical errors.8 Optimal α provides unambiguous estimates of the “best” α as long as there is agreement on the appropriate critical effect size and the relative costs of Type I and II errors. This makes the approach suitable for use in posthoc analyses to re-evaluate the conclusions of hypothesis tests that used α = 0.05. We do this by comparing test outcomes, average error rates, and implied Received: Revised: Accepted: Published: 9249

April 3, 2012 July 6, 2012 August 8, 2012 August 8, 2012 dx.doi.org/10.1021/es301320n | Environ. Sci. Technol. 2012, 46, 9249−9255

Environmental Science & Technology

Policy Analysis

±10% for fish condition). Fourth, it attempts to put equal weight on Type I and Type II errors (for cycles two and three) by requiring that an a priori study design be submitted that calculates the intended number of samples using existing estimates of variability for that site, established critical effect sizes for gonad size differences, and setting α = β. Fifth, it uses test results to guide future decisions. The iterative program settles the following cycle’s study design based on the result of the test in the previous cycle(s), such that there is an actual financial cost to decisions, and associated costs of Type I and II errors. Objectives. The choice of a significance level is central to decision-making in the EEM program but it is unclear how the reliance on a consistent but arbitrary α = 0.05 significance level has affected environmental management decisions in this monitoring program. As such, optimal α re-analysis of the EEM data set allows us to compare the results of over 1200 hypothesis tests using the traditional α = 0.05 and the optimal α approach to evaluate the magnitude of difference in a priori average of Type I and Type II error and the frequency of differences in inferences reached using an optimal α approach from those reached using α = 0.05 and also to determine whether implied cost ratios of Type I/Type II errors associated with decisions resulting from null hypothesis tests at α = 0.05 reflect a greater emphasis toward minimizing Type I errors (resulting in unnecessary additional monitoring and management operations) or Type II errors (resulting in unmitigated environmental damage).

relative costs of Type I and II errors using standard α versus optimal α. Optimal α Re-analysis of the Canadian Environmental Effects Monitoring Program. The Canadian Environmental Effects Monitoring (EEM) program is a standardized national monitoring program that provides a large data set of statistical conclusions that lends itself well to re-analysis using the optimal α approach. The EEM program was implemented in Canada for the pulp and paper industry in the early 1990s and for the metal mining industry in the late 1990s as a mandatory, nationally regulated program.9,10 EEM requires industry to fund and conduct studies to assess potential impacts in aquatic receiving environments by monitoring effects on fish, fish habitat (benthic invertebrates), and fish contamination levels, when those effluents are in compliance with their discharge limits. An adult fish population survey is a major component of the EEM program, and the survey assesses differences in five effect end points (condition, relative gonad weight, relative liver weight, weight-at-age, and age) in both sexes of two sentinel fish species collected at an effluent exposed site and at a minimum of one reference site. The pulp and paper EEM program follows a tiered monitoring approach with fish population surveys conducted every three years, and results from previous monitoring used to make decisions for future cycles of monitoring. An effect in EEM has been defined as a difference between reference and exposure sampling locations in any of the measured effect end points that is statistically significant at α = 0.05. For changes in the level of monitoring, this statistical effect must also exceed a predefined critical effect size, and be confirmed (in the same direction) in the next cycle of monitoring. When an effect has been statistically detected, exceeds the critical effect size, and is confirmed, industry enters either “focused monitoring” to understand the extent and magnitude of the effect, or more often begins “investigation of cause” studies to determine whether the effluent is the cause of the observed difference. If no statistically significant effects greater than the critical effect size are observed and confirmed, then industry can skip a cycle of monitoring, conducting a fish population survey again in six years. As such, there is a real economic cost to decisions that use a standard αif there are no statistically significant effects larger than the critical effect size for two consecutive cycles, the facility “skips” a monitoring cycle (reducing monitoring costs), but a statistically significant effect larger than the critical effect size results in ongoing “investigation of cause” studies or monitoring and associated costs. The critical effect sizes have been defined in EEM as a percent difference in the measured end point (10% for condition and 25% for gonad size and liver size) between the exposure and reference site, relative to the reference site.11 The pulp and paper EEM data set is suitable for re-analysis using the optimal α technique for five reasons. First, it has a large data set. Cycles 1−3 (1992−1996, 1997−2000, and 2000−2003) of the Canadian pulp and paper EEM program consist of over 1200 hypothesis tests of different end points for fish of different species at over 100 pulp and paper mills across Canada. Second, it has a simple, consistent study design. There is a fixed study design, and most facilities follow a simple upstream−downstream, control−impact design. In the first three cycles, most mills used the same design (although fish species may have changed and sample sizes adjusted). Third, it uses established critical effect sizes. The program has set a critical effect size to be used (±25% for most fish end points;



MATERIALS AND METHODS EEM Monitoring Tests Used for Optimal α Re-analysis. We used the end points of fish condition, relative gonad size, and relative liver size, which have been identified as the most important response variables in the pulp and paper EEM program. Data were used from the first three cycles of monitoring, as most mills used a consistent study design during this period. A total of 115, 64, and 62 EEM pulp and paper mills completed an adult fish population survey in cycles one, two, and three respectively. In practice, the results from the first cycle of monitoring were used primarily to identify study design challenges, to identify ways to improve sampling protocols, and to obtain estimates of variability in the data (to determine the required sample sizes for future monitoring to detect critical effect sizes).12 The results from cycles two and three were used to identify effects, and the results from all three cycles are summarized in ref 13. Statistical analyses were conducted for each mill after separating data by species and sex, assessing unusual observations, and removing nonspawning and immature fish (only adult fish data were used). Each mill had data for up to four mature species−sex combinations depending on the success of the field surveys. Fish survey data were considered reliable when there were at least 12 adult fish measured at both the reference and exposed sites.13 The effect end points were analyzed using analysis of covariance (ANCOVA). Condition was analyzed using body weight as the response and body length as a covariate, and relative gonad weight and relative liver weight were analyzed using organ weight as the response and body weight as a covariate. Length and weight data for all analyses were log10 transformed to improve linearity of the data and to homogenize the variances. Analyses of covariance were performed using the regression approach to ANCOVA. Comparisons with nonhomogeneous regression slopes (33 of 9250

dx.doi.org/10.1021/es301320n | Environ. Sci. Technol. 2012, 46, 9249−9255

Environmental Science & Technology

Policy Analysis

strength of inferences that can be achieved by switching to the optimal α approach. We also calculated the relative costs of Type I vs. Type II errors that were implied for each test by using α = 0.05 and summarized for each end point to determine how consistent these implied relative costs of Type I vs. Type II error were among end points and to determine how much the implied relative costs deviated from equality.

1287 tests) were removed and not considered for analysis. This left a total of 1254 tests that were suitable for re-analysis using the optimal α approach. Calculation of Optimal α Levels. The optimal α level for each test is defined as the α level associated with the minimum value of ω, the average of α, and the associated β level for the critical effect size at the achieved sample size, ω = [α + β]/2. Optimal α values for each test were determined through iterative examination of the change in ω over α values from 0 to 1.8 This analysis was completed using R version 2.13.14 Optimal α calculations require an estimate of β, and β was calculated from the noncentral t distribution15 because an ANCOVA test of main effect with 2 groups (reference and exposure) is equivalent to a two sample independent t test on the residuals from the regression (when regression slopes are parallel), less one degree of freedom for the estimation of the common regression slope.16 Calculation of Implied Relative Costs of Type I vs Type II Error. Implied relative costs of Type I vs. Type II error can be calculated for each test by assuming that α = 0.05 is the optimal α for that test. Setting α = 0.05 for all tests implies a different cost ratio of Type I/Type II errors (CI/II) for each test, if sample sizes and estimated variability differ but the critical effect size is held constant (such that α = 0.05 is the optimal α to minimize the average of the relative cost-weighted probabilities of errors for the null hypothesis and the alternate hypothesis at the critical effect size, ωc, ωc = [(CI/II × α) + β]/ [CI/II + 1]). We calculated the cost ratio of Type I/Type II errors implied by using α = 0.05 for each of the tests through an iterative examination of optimal α over a range of Type I/Type II error cost ratios to determine the cost ratio at which the optimal α level is equal to 0.05 for that test. This analysis was completed using R version 2.13.14 Comparison of Results Using Optimal α and Results Using α = 0.05. Analyses were summarized separately for each end point as variability differed among condition, gonad size, and liver size end points, and because the critical effect size used for fish condition in the EEM program is different from that for liver and gonad size. For each end point, we compared the proportion of tests for which a significant result was obtained between α = 0.05 and the optimal α to determine whether one approach was more likely to produce significant results. We also calculated the proportion of hypothesis tests that would have resulted in a different outcome using an optimal α than the outcome reached using α = 0.05 and determined the number of cases in which the new outcome would have potentially resulted in a different management decision (i.e., if the observed effect size was above the critical effect size for newly significant results). This determines the extent to which environmental management decisions would have changed if EEM hypothesis tests were conducted to minimize the combined probabilities of Type I and Type II errors. We also calculated the median, first and third quartiles, minimum and maximum α, β, and a priori average of Type I and Type II error for each end point using α = 0.05 and optimal α to show how much α and β levels differed between the two approaches. Next, we calculated the percent reduction in the average of Type I and Type II errors at the critical effect size that resulted from switching to an optimal α for each test, (calculated as the difference in ω associated with the optimal α from the ω associated with α = 0.05, relative to the ω for α = 0.05, multiplied by 100), and summarized for each end point to provide a measure of the magnitude of improvement in



RESULTS Differences in α and β between Standard and Optimal α Approaches. As expected, α levels associated with the optimal α approach deviated from the standard α = 0.05 (Table 1). For tests of fish condition, 68.1% of optimal α

Table 1. Distributions of α, β, and ω (Average of Type I and Type II Errors) for Fish Condition, Gonad Size, and Liver Size End Points, Using Either Standard α = 0.05 or Optimal α

condition (N = 439) standard α β ω optimal α β ω gonad size (N = 385) standard α β ω optimal α β ω liver size (N = 430) standard α β ω optimal α β ω

minimum

1st quartile

median

3rd quartile

maximum

0.05 1.5 × 10−12 0.025

0.05 0.002 0.026

0.05 0.018 0.034

0.05 0.082 0.066

0.05 0.772 0.411

6.6 × 10−5 7.3 × 10−7 3.3 × 10−5

0.013 0.012 0.012

0.030 0.030 0.030

0.062 0.067 0.065

0.262 0.450 0.356

0.05 7.5 × 10−15 0.025

0.05 0.011 0.031

0.05 0.111 0.081

0.05 0.389 0.220

0.05 0.910 0.480

6.6 × 10−5 6.5 × 10−8 3.3 × 10−5

0.024 0.025 0.024

0.073 0.079 0.076

0.155 0.201 0.177

0.309 0.611 0.460

0.05 1.4 × 10−8 0.025

0.05 0.013 0.032

0.05 0.092 0.071

0.05 0.255 0.153

0.05 0.843 0.447

2.1 × 10−4 1.9 × 10−4 2.0 × 10−4

0.027 0.027 0.027

0.066 0.073 0.069

0.118 0.140 0.129

0.287 0.520 0.404

levels were smaller than 0.05, but for tests of gonad size and liver size only 41.4% and 41.2% of optimal α levels were smaller than 0.05, respectively. Using α = 0.05, 11.6%, 39.0%, and 31.9% of tests failed to achieve 80% power to detect differences as large as the critical effect size for condition, gonad size, and liver size tests, respectively. The trade-off between α and β levels always resulted in lower ω levels (the average of Type I and Type II error at the critical effect size) for the optimal α level than for the standard α = 0.05. In particular, levels of ω reached much lower minimum values using optimal α for each end point, than the minimum value of 0.025 that can be reached using α = 0.05. 9251

dx.doi.org/10.1021/es301320n | Environ. Sci. Technol. 2012, 46, 9249−9255

Environmental Science & Technology

Policy Analysis

Table 2. Number of Tests Re-Analyzed Using the Optimal α Approach for Condition, Gonad Size, and Liver Size End Points, along with the Number of Tests That Were Significant (For Both Standard and Optimal α Levels), the Number of Tests That Resulted in a Different Outcome between Standard and Optimal α Levels, and the Breakdown of These Tests with Different Outcomes into Non-Significant Tests That Became Significant Using Optimal α (With the Number of These Tests Having an Observed Effect Size Greater than the Critical Effect Size) and Significant Tests That Became Non-Significant Using Optimal α (With the Number of These Tests Having an Observed Effect Size Less than the Critical Effect Size) total number of tests tests significant using α 0.05 tests significant using optimal α tests with a different outcome using optimal α newly significant tests using optimal α newly significant tests > CES newly nonsignificant tests using optimal α newly nonsignificant tests < CES

Differences in Decisions Made between Standard and Optimal α Approaches. Of the 1254 EEM hypothesis tests examined across the three end points, 517 were significant using the standard α = 0.05 and 518 were significant using the optimal α level (Table 2). Although there were similar numbers of significant tests for standard and optimal α levels, 12%, 15%, and 9% of tests resulted in a different conclusion using optimal α rather than α = 0.05 for condition, gonad size, and liver size tests, respectively. Of the tests with different conclusions, 25% (fish condition), 70% (gonad size), and 59% (liver size) switched from nonsignificant to significant using the optimal α level. Of the tests that became newly significant using the optimal α level, 8% (fish condition), 31% (gonad size), and 17% (liver size) had observed effect sizes above the established critical effect size, meaning that a different management decision would have been made. All of the tests that became newly nonsignificant under the optimal α approach had observed effect sizes below the established critical effect size, regardless of end point. Reductions in Average of Type I and Type II Error from Using Optimal α. Median percent reduction in error in average of Type I and Type II error at the critical effect size resulting from using the optimal α over α = 0.05 was 15% for liver size, 16% for fish condition, and 17% for gonad size (Figure 1). While percent reduction in error levels approaching 100% were sometimes observed for all end points, large reductions in error were more common for fish condition than

condition

gonad size

liver size

439 193 166 53 13 1 40 40

385 135 157 56 39 12 17 17

430 189 195 39 23 4 16 16

for gonad size or liver size. No improvement in average of Type I and Type II error occurred in the rare cases where the optimal α level was equal to the standard α = 0.05. Implied Relative Costs of Type I vs Type II Error Associated with α = 0.05. The median implied cost ratio of Type I/Type II errors at which α = 0.05 is the optimal α that minimizes the cost of errors was 0.40 for fish condition, 1.59 for liver size, and 1.51 for gonad size (Figure 2). Individual tests

Figure 2. Distribution of the implied cost ratios of Type I errors to Type II errors, assuming α = 0.05 was the optimal α chosen to minimize the average relative cost-weighted probabilities of errors for the null hypothesis and alternate hypothesis at the critical effect size for fish condition (n = 439), gonad size (n = 385), and liver size (n = 430) end points. Box plots display the median, first and third quartiles, and minimum and maximum values for each end point.

that resulted in large implied error cost ratios were tests that had high variability and lower sample sizes, while tests that had low variability and high sample sizes resulted in the lowest implied Type I/Type II error cost ratios.



DISCUSSION Use of α = 0.05 Resulted in Unnecessarily High Error Rates and Inconsistent Implied Error Costs. The EEM protocols have been designed to balance Type I and II errors and even here the use of optimal α resulted in significant reductions in the probability of making an error. This is because (1) optimal α is not the α-level where α and β are equal, rather it is the level where the average of α and β is minimized (Note: A difference between α and β under a particular optimal α does not imply that one error is worse than the other. Rather, it results because we want to minimize our probability of making either error and the α that minimizes that probability is rarely where α = β) and (2) even with excellent experimental design

Figure 1. Distribution of % reduction in ω (average of Type I and Type II error at the critical effect size) resulting from using optimal α instead of α = 0.05 for fish condition (n = 439), gonad size (n = 385), and liver size (n = 430) end points. Box plots display the median, first and third quartiles, and minimum and maximum values for each end point. 9252

dx.doi.org/10.1021/es301320n | Environ. Sci. Technol. 2012, 46, 9249−9255

Environmental Science & Technology

Policy Analysis

the precise optimal α cannot be calculated before the data are collected because realized sample size and variability often differ from intended sample size and predicted variability. This demonstrates that even where a balance between Type I and II error has been carefully targeted during experimental design, optimal α will be able to provide additional benefits in decisionmaking. One notable finding was that while the reduction in the average of Type I and Type II error at the critical effect size was very similar across all end points, for liver and gonad size using an α = 0.05 increased the probability of Type II error relative to Type I error (i.e., optimal α should generally have been larger than 0.05) while the reverse was true for fish condition. This is a consequence of the difference in the variability among end points. Higher variability in gonad and liver size increases the probability of Type II error and results in the average of Type I and Type II errors typically being minimized at a higher optimal α level. Using a standard α level (α = 0.05) that is lower than the optimal α level results in implied cost ratios of Type I/Type II errors being greater than 1 as they typically were for liver size (1.59) and gonad size (1.51), meaning Type I errors are typically implied to be approximately 1.5 times more important than Type II errors for these end points. The low variability in the fish condition end point decreases the probability of Type II error and resulted in the average of Type I and Type II errors typically being minimized at smaller optimal α levels. Using a standard α level (0.05) that is higher than the optimal α level results in implied cost ratios of Type I/ Type II errors being smaller than 1 as they typically were for fish condition (0.40), meaning Type II errors are typically implied to be approximately 2.5 times more important than Type I errors for this end point. Thus the current end points used for decision-making in the EEM protocol imply very different assumptions about the relative costs of Type I and II errors and none of the three implied cost ratios were close to 1the implied cost ratio that is most consistent with the EEM philosophy that Type I and II errors are equally problematic. Setting Appropriate Relative Costs of Type I vs Type II Error. As long as critical effect sizes are defined as the magnitude of effect that would result in important impacts on fish populations, it is difficult to formulate a rationale that would support different Type I/II cost ratios simply because the end point had changed. We suggest that where multiple end points are used, different (post sample collection) optimal α levels should be set for each end point such that the implied costs of Type I and Type II errors are consistent among end points. This raises the difficult question “What is an appropriate Type I/II cost ratio?”, which is a question that is research question-specific, will always require expert subject knowledge to properly evaluate, and is beyond the scope of our paper. However, there are two reasonable rules of thumb that can be used to facilitate the estimation of Type I/II error cost ratios. One, when using multiple end points α should be set so that the Type I/II cost ratios are consistent across end points, and two, if there is no clear rationale (economic, social, environmental, or health) for setting the costs of one type of error to be higher than the other, the relative costs of Type I and Type II error should be assumed to be equal. Unequal costs of Type I and Type II errors are implied whenever we set α at a level that does not minimize the average of α and β. Any well-considered decision about the appropriate α level for a particular study must include the relative costs of Type I and Type II errors, as the choice of α determines the associated β. Hinds17 suggested to set α = β in study design for

environmental monitoring such that the risk to industry (Type I errorhypothesis test concludes effect when there is no effect in the population) is set equal to the risk to the environment (Type II errorhypothesis test concludes no effect when there is in fact an effect in the population). However, equalizing risks to industry and to the environment does not minimize risks, making this approach favor “fairness” between industry and the environment over having the lowest overall risk. Even if α and β are set equal in study design the observed β from the data will differ from the defined α due to changes in desired sample samples and observed variability. Minimizing overall risks to both industry and the environment would require setting α at the level that minimizes the average of α and β (i.e., the optimal α approach), and this is rarely where α = β. Mapstone,3 Faul et al.,18 and Lipsey and Hurley7 have also suggested setting α and β proportional to the costs of Type I and Type II error. The problem with this approach can be easily illustrated when the costs of Type I and Type II errors are equal. Equal costs of Type I and Type II errors implies that the cost associated with making either error is interchangeable. If error costs are interchangeable, then error levels should be set to minimize their combined chances of occurrence (i.e., minimize the average of α and β, as in the optimal α approach). Because the relationship between α and β is nonlinear, shifting α and β away from equality will frequently result in a smaller combination of α and β, so the minimum average of α and β is rarely where α = β. If relative costs of errors are known, the goal should be to minimize errors given their relative costs (equivalent to minimizing overall costs of errors), rather than simply requiring error rates to be proportional to their relative costs. If there is general agreement that the relative costs of Type I and Type II errors are equal, optimal α is simple to implement and there is no barrier to using the method. When the costs of Type I and Type II error are not equal but there is a clear rationale and consensus among subject knowledge experts on what the relative costs are, optimal α is also simple to implement. Difficulties may, however, arise when consensus concerning the relative costs of Type I and Type II error cannot be reached. We recommend that unless there is a clear rationale and consensus for unequal costs, they should be assumed equal, in favor of minimizing the combined probabilities of errors. We believe that one of the most important contributions of optimal α may be that it forces researchers to explicitly address the issue of the relative costs of Type I and II errors. This is a topic that has been relatively unexplored in all scientific disciplines but is of critical importance in experimental design and analysis. One explanation for the lack of attention to the relative costs of Type I and II errors is that it imposes the subjective concept of valuation into a process that scientists prefer to think of as objective. However, this avoids the fundamental reality that all decision-making incorporates valuation. In some cases, economic estimates of (i) the costs of environmental degradation when real effects are not detected and (ii) the costs of efforts to mitigate or prevent effects that are not occurring may provide a strong foundation for quantitative estimates of the relative costs of Type I and II errors. Attempts have been made to place dollar estimates on the value of various ecosystem services and this approach may be useful in estimating relative cost ratios.19 In some contexts, other measures of cost may have to be developed to capture social, environmental, or health costs that are not well represented by monetary estimates. Addressing the issue of relative costs of 9253

dx.doi.org/10.1021/es301320n | Environ. Sci. Technol. 2012, 46, 9249−9255

Environmental Science & Technology

Policy Analysis

specific knowledge concerning prior probabilities or cost of management actions or errors. The result is that 9−15% of tests carried out in the EEM would have reached the opposite conclusion as to whether to reject or fail to reject the null if they had used optimal α. In addition, the average implied relative cost ratio of Type I/Type II errors varied from 0.4 to 1.59 depending on the end point and this internal inconsistency is a negative consequence of using α = 0.05. This paper highlights the importance of considering and reporting estimates of critical effect sizes, relative costs of Type I and Type II errors, and prior probabilities of null and alternate hypotheses for environmental research. However, we believe that setting equal prior probabilities and equal costs of error is justifiable under many situations, as it avoids biasing results according to information from other sources and/or according to value judgments about the relative seriousness of Type I or Type II error.

Type I and II errors, while enormously difficult, is critical to making appropriate decisions about sustainable resource use. Setting Appropriate Critical Effect Sizes. Setting an appropriate critical effect size was not a barrier to using the optimal α approach for EEM tests because the program was forced to define critical effect sizes when the decision was made to use cycle one data to develop appropriate statistical designs for subsequent cycles. While interim effect sizes were put in place for cycle two, a recent review has supported the effect sizes that were developed.11 However, this may not be true of all monitoring programs and thus, identifying a critical effect size may appear to be a barrier to using the optimal α approach. We argue that monitoring programs that have not identified a critical effect size have a significant weakness in their rationale for any decisions that are made using monitoring data and should place a high priority on developing critical effect sizes that can be used to trigger decisions. While critical effect sizes are under development we recommend identifying small, medium, and large effect sizes, and providing optimal α levels for each of these effect sizes so that interpretations can be made based on different potentially relevant effect sizes, along with their associated error rates. Identifying critical effect sizes is a difficult and contentious problem but should not be a barrier to using optimal α. Benefits of the Optimal α Approach for Null Hypothesis Tests in Environmental Management. Use of the optimal α approach in hypothesis testing has the potential to reduce the frequency of relevant statistical errors in environmental monitoring, without increasing the number of samples collected or altering the defined critical effect sizes. In addition, optimal α dramatically reduces the likelihood that different statistical outcomes will be exclusively due to differences in experimental design. For example, using traditional α levels, monitoring designs that provide low power may routinely “miss” effects that would be detected by designs with higher power. Thus, the probability of detecting an effect at any particular site is, at least in part, a function of experimental design and so it is possible to manipulate the environmental monitoring process by manipulating experimental design. This is not true of optimal α because experimental designs with low power will have a correspondingly large α and those with high power will also have a small α, eliminating the incentive for industry to conduct poor quality monitoring with high sampling error that leads to low power to detect meaningful differences at α = 0.05, and for environmental groups to demand expensive sampling designs that will detect effects that have little biological significance. Lastly, the method proposed here also allows α and β to be determined postdata collection but before the hypothesis test is conducted, so the actual observed variability and achieved sample sizes are used in the formulation of optimal α, resulting in a more accurate representation of error rates.



AUTHOR INFORMATION

Corresponding Author

*Phone: (506)663-9888 (J.F.M.); (506)648-5825 (K.R.M.). Email: [email protected] (J.F.M.); [email protected] (K.R.M.). Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS The manuscript was written through contributions of all authors. All authors have given approval to the final version of the manuscript. The authors are supported by NSERC Discovery and Strategic Grants, the Canadian Rivers Institute at the University of New Brunswick - Saint John, the Canadian Department of Defense, Natural Resources Canada - Canadian Forest Services, and Environment Canada.This work arose out of conversations with many attendees of the 2010 Aquatic Toxicity Workshop in Toronto, Ontario, working to address challenges in environmental monitoring and associated decision-making in Canada. Thanks to Leanne Baker for providing the photograph used in the TOC/Abstract graphic. The manuscript benefitted directly from discussions with Christopher Edge, Leanne Baker, Thijs Bosker, and Rémy Rochette at the University of New Brunswick, Saint John campus.



REFERENCES

(1) Cascio, W. F.; Zedeck, S. Opening a new window in rational research planning: adjust alpha to maximize statistical power. Pers. Psychol. 1983, 36 (3), 517−526. (2) Cohen, J. The earth is round. Am. Psychol. 1994, 49 (12), 997− 1003. (3) Mapstone, B. D. Scalable decision rules for environmental impact studies: Effect size, Type I and Type II errors. Ecol. Appl. 1995, 5 (2), 401−410. (4) Johnson, D. H. The insignificance of statistical significance testing. J. Wildlife Manage. 1999, 63 (3), 763−772. (5) Field, S. A.; Tyre, A. J.; Jonzen, N.; Rhodes, J. R.; Possingham, H. P. Minimizing the cost of environmental management decisions by optimizing statistical thresholds. Ecol. Lett. 2004, 7, 669−675. (6) Newman, M. C. “What exactly are you inferring?” A closer look at hypothesis testing. Environ. Toxicol. Chem. 2008, 27 (5), 1013−1019. (7) Lipsey, M. W.; Hurley, S. M. Design sensitivity: Statistical power for applied experimental research. In The SAGE Handbook of Applied Social Research Methods; Bickman, L., Rog, D. J., Eds.; SAGE Publications Inc.: CA, 2009; pp 44−76.



CONCLUSIONS Adjustments to α levels have been shown to yield optimal environmental management decisions under situations where the prior probabilities and costs of management actions and errors are known and under certain circumstances α levels as extreme as 0 or 1 have been shown to be most cost-effective (meaning that either no effect or an effect should sometimes be assumed, rather than sampling and statistically testing for an effect).5 Here we show that optimal environmental management decisions can also be reached by adjusting α without 9254

dx.doi.org/10.1021/es301320n | Environ. Sci. Technol. 2012, 46, 9249−9255

Environmental Science & Technology

Policy Analysis

(8) Mudge, J. F.; Baker, L. F.; Edge, C. B.; Houlahan, J. E. Setting an optimal α that minimizes errors in null hypothesis significance tests. PLoS ONE 2012, 7 (2), e32734; DOI 10.1371/journal.pone.0032734. (9) Munkittrick, K. R.; McGeachy, S. A.; McMaster, M. E.; Courtenay, S. C. Overview of freshwater fish studies from the pulp and paper Environmental Effects Monitoring program. Water Qual. Res. J. Can. 2002, 37 (1), 49−77. (10) Ribey, S. C.; Munkittrick, K. R.; McMaster, M. E.; Courtenay, S.; Langlois, C.; Munger, S.; Rosaasen, A.; Whitley, G. Development of a monitoring design for examining effects in wild fish associated with discharges from metal mines. Water Qual. Res. J. Can. 2002, 37 (1), 229−249. (11) Munkittrick, K. R.; Arens, C. J.; Lowell, R. B.; Kaminski, G. P. A review of potential methods for determining critical effect size for designing environmental monitoring programs. Environ. Toxicol. Chem. 2009, 28 (7), 1361−1371. (12) Lowell, R. B.; Ribey, S.; Ellis, I. K.; Porter, E. L.; Culp, J. C.; Grapentine, L. C.; McMaster, M. E.; Munkittrick, K. R.; Scroggins, R. P. National Assessment of Pulp and Paper Environmental Effects Monitoring Data; NWRI Contribution No. 03-521; National Water Research Institute: Ontario, Canada, 2003. (13) Barrett, T. J.; Lowell, R. B.; Tingley, M. A.; Munkittrick, K. R. Effects of pulp and paper mill effluent on fish: a temporal assessment of fish health across sampling cycles. Environ. Toxicol. Chem. 2010, 29 (2), 440−452. (14) R Development Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2011; http://www.R-project.org/. (15) Harrison, D. A.; Brady, A. R. Sample size and power calculations using the noncentral t-distribution. Stata J. 2004, 4 (2), 142−153. (16) Barrett, T. J. Computations using analysis of covariance. WIREs Comp. Stat. 2011, 3 (3), 260−268. (17) Hinds, W. T. Towards monitoring of long-term trends in terrestrial ecosystems. Environ. Conserv. 1984, 11 (1), 11−18. (18) Faul, F.; Erdfelder, E.; Lang, A.- G.; Buchner, A. G*Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behav. Res. Methods 2007, 39 (2), 175−191. (19) Costanza, R.; d’Arge, R.; de Groot, R.; Farber, S.; Graso, M.; Hannon, B.; Limburg, K.; Naeem, S.; O’Neill, R. V.; Paruelo, J.; Raskin, R. G.; Sutton, P.; van den Belt, M. The value of the world’s ecosystem services and natural capital. Nature 1997, 387, 253−260.

9255

dx.doi.org/10.1021/es301320n | Environ. Sci. Technol. 2012, 46, 9249−9255