Sample Size and Replication in 2D Gel Electrophoresis Studies

Mar 2, 2007 - between biological and technical replication is made, and the superiority of the former is emphasized. Keywords: sample size • power â...
17 downloads 11 Views 126KB Size
Sample Size and Replication in 2D Gel Electrophoresis Studies Graham W. Horgan* Biomathematics and Statistics Scotland, Rowett Research Institute, Aberdeen AB21 9SB, UK Received March 2, 2007

Abstract: Two-dimensional gel electrophoresis can be an expensive technology, and many studies are based on a modest number of replicates. It is important that the statistical power is sufficient to detect protein expression differences of interest. This paper reviews the application of power calculations and considers how other issues affect the choice of sample size. The important distinction between biological and technical replication is made, and the superiority of the former is emphasized. Keywords: sample size • power • gel

Introduction Two-dimensional gel electrophoresis can be an expensive technology, and many studies for this reason are based on a modest number of replicates. However, it is important that the power of the study is sufficient to detect protein expression differences of interest. It can sometimes be forgotten with the application of a relatively new technology, producing highdimensional data, that the traditional steps of experimental or observational study design still need to be addressed. Power calculation is one such issue: the probability of the experiment or study to detect differences in protein expression of magnitudes that are worth knowing about. The aim of this paper is discuss this issue. We review the essential statistical concepts and examine the effect of matters that particularly arise in gel electrophoresis or similar studies, such as biological versus technical replication, regularisation of tests, and use of false discovery rates (FDRs). We then comment on some aspects of the analysis of such data, as looking ahead to this analysis is important at the design stage.

replicated m times (so nm gels in total) the standard error of the estimated expression is SE )

x

S2b + S2t /m n

This follows from the standard formulas for adding two independent sources of variation and for the standard error of the mean.1 Increasing either n or m will reduce the SE, but increasing n will produce a greater reduction than increasing m, however large St is relative to Sb. This is illustrated in Figure 1. If the total number of gels is constrained, and n × m is fixed, then the smallest SE is achieved by choosing m ) 1, i.e., do no technical replication. This leads to the important principle in study design: Biological replication is always better than technical replication. Do no technical replication at all if it is occurring at the expense of biological replication. On this basis, we assume in the rest of this paper that replication is biological only. The variation between observed expression estimates from different samples will have standard deviation S ) xS2b + S2t since the technical variation will still be present, and with m ) 1, it cannot be distinguished from biological variation. If biological samples are limited, and resources permit technical replication, then this replication will be beneficial and will have the effect of reducing S to S ) xS2b + S2t /m

Types of Replication Replication is what enables experimental or observational data to be analyzed statistically. In electrophoresis studies, two types of replication are usually possible: biological and technical. Biological replication refers to the collection of protein samples from different animals, plants, humans, or cell cultures belonging to the same population or treatment group. It is assumed that the individuals have been selected from or randomly assigned to groups in the same way as for any other (non-proteomic) study. Technical replication refers to running two or more gels from each biological sample. If the standard deviation of the expression of a protein between biological samples from the same group is Sb and the technical variation between different gels from the same biological samples is St, then with n biological samples each * E-mail: [email protected].

2884

Journal of Proteome Research 2007, 6, 2884-2887

Published on Web 06/06/2007

The value of this technical replication can therefore be assessed by its effect in reducing S, where it appears in the results presented in later sections. Of course this will require knowledge of what the value of St is, something which can only come from some pilot studies involving technical replication. It might be expected that St will be stable over time in any laboratory and so will not need to be estimated for each new study. It is less clear that the same will be true of Sb.

Number of Replicates The aim of most gel electrophoresis proteomic studies is to search for proteins expressed differentially between two or more groups, usually control and treated or control and case groups. Each protein spot found on a gel will therefore be classified as differentially expressed (we will term this a “positive” finding) or not (a “negative” finding). The risk of false 10.1021/pr070114a CCC: $37.00

 2007 American Chemical Society

technical notes

Horgan

a value such as 80 or 90%, and the significance level to 5%. In this case, T will be close to 2 and Z close to or a little more than 1, and a useful simpler approximation to the formula becomes n ) 20

Figure 1. Effect on the standard error of mean spot intensity estimation of biological and technical replication when the standard deviations of the biological and technical variation are 1 and 1.5, respectively.

positives is usually set to 5%. The risk of false negatives is then controlled by experimental design, including setting the sample size. There are deficiencies in this view of testing, particularly in the case of proteomic studies, which are often seen as screening trials to identify proteins whose differential expression will then be confirmed by other means. However, we will start by presenting the issue of sample size selection from this viewpoint, as it is widely accepted, and consistent with the statistical testing approach usually adopted for electrophoresis studies. How the issues are affected by other considerations will be discussed later. We thus wish to choose a sample size that controls the risk of false negatives. The probability that a true positive, a protein which really is differentially expressed, will be detected as such is termed the statistical power of the experiment. In testing terms, this is sometimes termed the “sensitivity” of the test, with the probability of a true negative being correctly detected termed the “specificity”. The power depends on three things. • The extent of differential expression. Clearly, the stronger the expression factor for a protein is, the more likely we are to detect that it is differentially expressed. In designing an experiment, we need to specify the amount of differential expression that is important enough (i.e., biologically significant) that we would not wish to miss detecting it. We will denote this as D. • The random variation between different biological samples on different gels (or averaged over technical replicates) from the same treatment, case, or control group. As above, we denote this as S, and it is usually expressed as a percentage, i.e., as a coefficient of variation, the standard deviation divided by the mean. • The sample size, per group. We denote this n. Note again that these are biological replicates. The number of technical replicates, m, will determine the value of S and is best set to 1. For a given statistical power, any two of the above quantities will determine the third. We wish to choose n, given that we can specify D and have some estimate of S. In this case it can be shown2 that n is given, to a very good approximation, by the formula n ) 2(Z + T)2

S2 D2

The quantities Z and T depend on the power of the test and the significance level respectively. The power is usually set to

S2 D2

which provides a quick guide to roughly how many samples will be needed for 80% power. For 90%, use a multiplier of 25. The full formula cannot be used directly, because T depends on n, albeit only weakly unless n is quite small. It must be solved mathematically. Note that if S is expressed as a percentage (coefficient of variation), then so must D. In many cases where differential expression is large, terminology is based on “fold changes” rather than percentages: thus an increase of 100%, for example, is termed a 2-fold change, or 2-fold up-regulation. In these cases, where the variation in one group will therefore be correspondingly higher than the other, statistical analysis should proceed on a log scale. Logs can be seen as turning ratios into differences, and if treatment effects are discussed in terms of ratios, logs should certainly be used to turn these into differences, as all standard statistical tests are based on differences, not ratios, between sample values. Sample size calculation is widely available in statistical packages for computers. We present calculations of n for a range of values of D and S, and statistical powers of 80 and 90% in Table 1. The calculations in the table, although presented using fold-changes and coefficients of variance, are based on log transforming the data before analysis. The effect this has on using the formula is noted in Appendix A.

Other Power Calculation Issues What is presented above is not specific to studies producing high dimensional data. There are some other matters which might influence the calculations above, although their effect will not be large. Multiple Comparison Adjustments. The significance level in a traditional statistical test refers to that test only, i.e., there is a 5% (assuming the usual significance level) risk that in a test where in reality no effect or group difference is present, one nevertheless will wrongly be detected. When lots of tests are done, each carries a 5% risk, and the overall risk that that at least some false positive test results occur is greater than 5%, often considerably greater. With several hundred spots to test in a gel study, it approaches certainty. There is a plethora of multiple comparison adjustments available, which aim to adjust the p-values to give an overall risk of 5%. Even if they are not used across different measured variables, they may be used for each variable when more than two groups are being compared. There is a large choice of such tests: 12 are offered by SAS, more by SPSS, for example. The reason for our mentioning them here is to warn that their use will seriously reduce the power of the study for any given spot. If they are to be used, the replication will need to increased accordingly. With so many adjustments available, all with different effects, it is not practical to present any results on how the power will be affected. We can only suggest that some data simulation exercises should be constructed to enable the matter to be resolved. False Discovery Rates. False discovery rates3 (FDR) are another approach to the issue of doing many tests. They aim to Journal of Proteome Research • Vol. 6, No. 7, 2007 2885

technical notes

2D Gel Electrophoresis Studies Table 1. Sample Size Needed to Achieve a Power of 80 or 90% Coefficient of variation of untransformed data % change

fold change

0.1

0.2

0.3

0.4

10% 20% 30% 40% 50% 60% 80% 100% 125% 150% 200% 300%

1.1 1.2 1.3 1.4 1.5 1.6 1.8 2 2.25 2.5 3 4

19 7 5 3 3 3 2 2 2 2 2 2

71 21 11 8 6 5 4 3 3 3 2 2

158 45 23 15 11 9 6 5 5 4 3 3

279 78 39 24 17 14 9 7 6 5 5 3

10% 20% 30% 40% 50% 60% 80% 100% 125% 150% 200% 300%

1.1 1.2 1.3 1.4 1.5 1.6 1.8 2 2.25 2.5 3 4

25 8 5 4 3 3 2 2 2 2 2 2

95 27 14 9 7 6 4 4 3 3 3 2

210 59 29 19 13 11 7 6 5 4 3 3

372 103 51 32 22 17 12 9 7 6 5 4

estimate, for any p-value threshold in the test used, what proportion of the positive results found are false positives. The intention is not so much to adjust for this, but to inform. The researcher may find this information helpful in deciding what p-value to use if a subset of proteins are to be declared different between groups and identified, described, or investigated further. Using a different p-value threshold from 5%, guided by the FDR, will affect the power of the test: increasing or reducing it will increase or reduce the power, respectively. The effect on the sample size formula presented earlier will be to alter the value of T. However, it is not possible to know before the data are available what the FDR function will be like, as the distribution of sizes of true effects is usually unknown. In this case, this issue should not influence the power calculation, and the FDR should be seen as a useful tool, but one to be used in analysis rather than design. However, if suitable pilot data are available, and FDR is central to the data analysis, recently proposed methods4 offer a possible alternative way to derive sample sizes. Regularisation of Tests. This refers to the use of variance information from other spots in testing an individual spot. The rationale is that variance estimates are subject to error since they are based on a small sample size, and that if the variance estimate for a spot is combined with an average of the variances of all the other spots, an improved estimate will result. We will not discuss the advantages and disadvantages of doing this heresmore details can be found elsewhere.5 We wish only to point out that the effect on power calculations of an intention to regularize tests in this way will be minimal. They will increase the power of the test, because the effective degrees of freedom (DF) of variance estimates will be increased. However, unless the DF are very small, fewer than 10 or so, the effect on the power calculation will not be worth considering. We can describe the effect as being one of increasing the DF of the T term in the formula above. At 10 DF, this is about 2.2. However good the effect of regularisation is, the lower limit for T is 1.96. 2886

Journal of Proteome Research • Vol. 6, No. 7, 2007

0.5

80% 434 120 59 37 26 20 14 10 8 7 5 5 90% 580 160 78 48 34 26 17 13 10 8 6 5

0.6

0.8

1

1.25

1.5

2

624 172 84 52 37 28 19 14 11 9 7 5

1108 304 148 91 63 48 31 23 17 14 10 7

1730 474 230 141 98 73 48 35 26 21 15 10

2702 740 359 219 151 113 73 53 39 31 23 15

3890 1065 515 314 217 162 104 76 56 44 31 21

6914 1891 914 557 384 286 184 133 98 77 54 35

835 230 112 69 48 36 24 18 13 11 8 6

1483 407 197 121 84 63 41 30 22 18 13 9

2315 634 307 188 130 97 63 46 34 27 19 13

3617 990 479 292 202 151 97 70 52 41 29 19

5207 1424 689 420 290 216 139 100 74 58 41 27

9255 2531 1223 744 513 382 245 177 130 102 72 46

Data Analysis There is much that could be said about the analysis of gel proteomic data. We do not intend to give a full treatment here. Those interested should consult some of the literature on this topic.6-9 The considerable literature on microarray data analysis also contains ideas useful in proteomics.10,11 We wish here only to draw attention to some issues which have a bearing on the issue of power in experimental design. It could be argued that all issues related to analysis are relevant at the design stage. This is because the analysis planned must be specified to some extent before the sample size calculation can be done. It is also true that the best analysis will be more powerful than one which is sub-optimal (unless it is incorrect, such as treating technical replication as if it were biological). ANOVA. ANOVA is appropriate and more powerful than several pairwise t-tests where there are more than two groups to be compared. If the groups are structured, then this should be accounted for by using a suitable factorial ANOVA, or by defining and testing contrasts. It will often be desired to do some pairwise comparisons also, and these should be done as post-hoc t-tests, using variance information from all groups. Such tests will be more powerful than simple t-tests, although it should be remembered that each carries a 5% risk of false positives. Average Technical Replicates. If technical replicates are done, and recorded individually in the dataset, they must not be treated the same as biological replicates in the analysis. The analysis should incorporate a random effect term for the biological samples. Alternatively, the technical replicates should be averaged, and the analysis based on one mean value for each biological sample. Not to do this is to commit the error termed “pseudoreplication”. This issue is also discussed elsewhere.12 Logs. Analyzing data on a log scale should be considered. Logs can be thought of as converting multiplicative into

technical notes additive differences. On a log scale, the numbers 1, 10, 100, and 1000 are equally spaced. If gel spot data are skewed rather than reasonably Normally distributed, then analysis on a log scale will ensure that greater prominence is given to differences at the lower end of the range, at the expense of differences between the larger values. Care should be taken with very small values which are essentially background intensities. Adding a small constant before taking logs will ensure that differences between these small values do not unduly influence the analysis. Significance Level. There is nothing magical about the 5% level so widely used in statistical testing. It does not derive from philosophical, scientific, or mathematical considerations. It is simply a tradition which has been found useful. It may be inappropriate if gel studies are seen as a screening exercise, to select proteins worth further study. If the significance level is increased from 5%, then the power of the study will increase, albeit at the expense of an increased number of false positives. Consideration of the false discovery rate (FDR) can inform the choice of p-value to use. Maybe Try More Than One Test. In a similar spirit to the previous comment, the usual advice to use just one test might be waived in the case of proteomic data. The author has seen cases where spots have been compared using both standard t-tests and nonparametric Mann-Whitney tests. The resulting lists of significantly different spots have overlapped but not been identical. Examining the spots found significant by one but not both methods has indicated that these spots are worthy of further consideration. This appears often to be due to big differences in variability between the groups compared, something t-tests assume not does not happen. As long as any reporting of the study clearly states that more than one test was used, and that the results of both are reported, there can be no reason not to do this. Missing Spots and Missing Data. A spot which is not detected in a gel is not missing data. It may be that the spot is not found because expression of the corresponding protein is effectively zero. Even if it is not zero, and the spot has been obscured by other spots or simply missed by the image analysis software, it is likely that its expression level is lower than that of spots which have been found. To treat it as missing data would be to lose information. The simple approach is to assign missing spots a zero intensity. A more sophisticated approach where missing spots are treated differently from very lowintensity spots has also been suggested.13

Discussion Sample size is just one of the issues relating to the design of gel electrophoresis experiments. Other aspects of the design will need to be determined also. Many of the issues are similar to those which have been well explored in a general context.14 However, it is an important one, and sometimes given insufficient attention in planning the use of this rapidly developing technology. The central features of sample size planning are the same as those in any scientific experiment or observational study, and we have presented these here. Issues particular to gel studies, or to high dimensional molecular biology in general, do have some influence on power calculations, and we have discussed some of these. Many things affect the technical variation in gel-based proteomic studies.15,16 Developments in the technology of gel electrophresis may help to reduce the amount of technical variation, and thereby make studies more powerful. However,

Horgan

no amount of improvement will enable study size to be reduced beyond what biological variability dictates. Appropriate design, such as pairing or blocking of observations, can reduce the amount of biological variation which is unexplained (and it is this which specifies Sb in our calculations). Variation between pairs/blocks can then be estimated as a separate term in the analysis, and only the variation within pair/blocks is unexplained. In addition, some of the variation will be in the technology, and reducing this until it is small relative to the biological variation would be a sure way of improving the power of gel electrophoresis studies.

Appendix A The sample size calculations presented in Table 1 are based on data analysis being carried out on a log scale. Thus, if the treatment mean is greater than the control mean (or vice versa) by a proportion A, then on a natural log scale the difference in means becomes ln (1+A) - ln (1), i.e. ln (1+A). On a natural log scale, the standard deviation is approximately the coefficient of variance on the original scale. (This follows from the derivative of the natural log being the reciprocal function). The approximation is better the smaller the coefficient of variation, and it depends on the form of the distribution. For a Normal distribution it underestimates a little, whereas for a lognormal it overestimates. It matters not what base is used for the logs, as it will only scale all calculations relative to natural logs.

Acknowledgment. The ideas presented in this paper have been developed after many discussions with scientists at the Rowett Research Institute. The work has been supported by the Scottish Executive Environment and Rural Affairs Department. References (1) Cabrera, J.; Amaratunga, D. In Exploration and Analysis of DNA Microarray and Protein Array Data; Wiley: New York, 2004; Chapter 6.4. (2) Kahn, H. A.; Sempos, C. T. In Statistical Methods in Epidemiology; Oxford University Press: New York, 1989; Chapter 2. (3) Storey, J. D.; Tibshirani, R. Proc. Natl. Acad. Sci. U.S.A. 2003, 100, 9440-9445. (4) Tibshirani, R. BMC Bioinformatics 2006, 7, 106. (5) Baldi, P.; Long, A. D. Bioinformatics 2001, 17, 509-519. (6) Chang, J.; Van Remmen, H.; Ward, W. F.; Regnier, F. E.; Richardson, A.; Cornell, J. J. Proteome Res. 2004, 3, 1210-1218. (7) Gilchrist, M. A.; Salter, L. A.; Wagner, A. Bioinformatics 2004, 20, 689-700. (8) Fodor, I. K.; Nelson, D. O.; Alegria-Hartman, M.; Robbins, K.; Langlois, R. G.; Turteltaub, K. W.; Corzett, T. H.; McCutchenMaloney, S. L. Bioinformatics 2005, 21, 3733-3740. (9) Maurer, M. H.; Feldmann, R. E., Jr.; Bromme, J. O.; Kalenka, A. J. Proteome Res. 2005, 4, 96-100. (10) Tusher, V. G.; Tibshirani, R.; Chu, G. Proc. Natl. Acad. Sci. U.S.A. 2001, 98, 5116-5121. (11) Bickel, D. R. Bioinformatics 2004, 20 682-688. (12) Karp, N. A.; Spencer, M.; Lindsay, H.; O’Dell, K.; Lilley, K. S. J. Proteome Res. 2005, 4, 1867-1871. (13) Wood, J.; White, I. R.; Cutler, P. Signal Process. 2004, 84, 17771788. (14) Cochran, W.; Cox, G. In Experimental Designs; Wiley: New York, 1992; Chapter 2.2. (15) Choe, L. H.; Lee, K. H. Electrophoresis 2003, 24, 3500-3507 (16) Schlags, W.; Walther, M.; Masree, M.; Kratzell, M.; Noe, C. R.; Lachmann, B. Electrophoresis 2005, 26(12), 2461-2469.

PR070114A Journal of Proteome Research • Vol. 6, No. 7, 2007 2887