Data Variance and Statistical Significance in 2D ... - ACS Publications

Data Variance and Statistical Significance in 2D-Gel Electrophoresis and DIGE Experiments: Comparison of the Effects of Normalization Methods. Andrew ...
0 downloads 0 Views 3MB Size
ARTICLE pubs.acs.org/jpr

Data Variance and Statistical Significance in 2D-Gel Electrophoresis and DIGE Experiments: Comparison of the Effects of Normalization Methods Andrew J. Keeping and Richard A. Collins* Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada ABSTRACT: Identifying changes in the relative abundance of proteins between different biological samples is often confounded by technical noise. In this work, we compared eight normalization methods commonly used in two-dimensional gel electrophoresis and difference gel electrophoresis (DIGE) experiments for their ability to reduce noise and for their influence on the list of proteins whose difference in abundance between two samples is determined to be statistically significant. With respect to reducing noise we find that, while all methods improve upon unnormalized data, cyclic linear normalization is the least well suited to gel-based proteomics and the performances of the other methods are similar. We also find in DIGE data that the choice of normalization method has less of an impact on the noise than does the decision to use an internal reference in the experimental design and that both normalization and standardization using the internal reference are required to maximally reduce variance. Despite the similar noise reduction achieved by most normalization methods, the list of proteins whose abundance was determined to differ significantly between biological groups differed depending on the choice of normalization method. This work provides a direct comparison of the impact of normalization methods in the context of common experimental designs. KEYWORDS: normalization, DIGE, internal reference, internal standard, two-dimensional polyacrylamide gel electrophoresis, 2-D PAGE

’ INTRODUCTION One of the goals of gel-electrophoresis-based proteomic experiments is to identify differences in the relative abundances of individual proteins between biological samples, for example, wild-type vs mutant, or normal vs diseased. Obtaining valid quantitative estimates is often confounded by technical noise in the data (variability in quantification due to the experimental system). If uncorrected, or inadequately corrected, this noise can obscure the biologically meaningful results that are in turn required to address the question motivating the study. In two-dimensional gel electrophoresis (2D-GE) experiments, technical noise may arise from unintended differences in culture growth conditions, sample preparation, reagent quality, loading and electrophoresis of individual gels, or quantification of spot intensities. Several normalization methods have been described that reduce noise in the data and improve the accuracy of subsequent analysis. These range in conceptual complexity from Median Normalization, which modifies spot volume data according to the volume of the median spot of each gel, to Variance Stabilizing Normalization (VSN),1-3 which corrects for measurement error using a model that specifically parametrizes dominant sources of error. Typically, when a normalization method is introduced into the literature in application to 2D-GE or DIGE experiments, it is compared to a traditional or simpler previous method and the advantages of the new method are demonstrated. As published methods proliferate, it becomes increasingly important for the comparison of methods to be comprehensive. Here we compare representative implementations of each of eight normalization methods that have not all been compared to each other in a single study previously. Our findings are relevant to experimental r 2010 American Chemical Society

designs involving one protein sample per gel and DIGE experimental designs using two or three dye-labeled samples per gel.

’ EXPERIMENTAL SECTION Experimental Design

Data sets were obtained from experiments comparing mitochondrial proteins from different strains of Neurospora (to be described in detail elsewhere). The three-strain three-dye DIGE experiment was performed using one strain that contains the VS plasmid (N. intermedia Varkud-1c FGSC#1823) and two strains that lack the plasmid (N. intermedia Varkud-1b FGSC#1822, and N. crassa 74-OR23-1A). For the two-strain, two-dye DIGE experiment, we used the unstable heterokaryon transfer procedure4 to construct a pair of strains that differed only by the presence or absence of the Varkud and VS plasmids. Briefly, Cy-dye labeled proteins (see below) were separated by standard two-dimensional isoelectric focusing, using immobilized pH gradient strips (24 cm pH 3-11 nonlinear) in the first dimension and Laemmli SDS-PAGE in the second dimension. Gels were scanned using a Typhoon 9400 Variable Mode Imager (GE HealthSciences), producing a gel image for each dye-labeled sample present in a gel. Spot detection and background correction were performed using DeCyder ver. 5.01.01 software (GE HealthSciences). Spot matching was performed in two stages; Received: October 26, 2010 Published: December 20, 2010 1353

dx.doi.org/10.1021/pr101080e | J. Proteome Res. 2011, 10, 1353–1360

Journal of Proteome Research first by the DeCyder software itself, followed by exhaustive manual verification of all matches and of unmatched spots. The three-dye experiment followed the GE HealthSciences recommended protocol. Four biological replicates of each of three strains were used (for simplicity called strains A, B, and C); two replicates of each strain were labeled with Cy5 and two with Cy3. This labeling strategy, in combination with robust imputation of missing values, eliminates the influence of dye bias on comparisons between biological groups. Cy2 was used to label an internal reference consisting of an equal mixture of aliquots from all 12 biological replicates. One Cy-5-labeled sample, one Cy3-labeled sample and an aliquot of the Cy2-labeled internal reference were run on each of six gels. In the two-dye experiment, six biological replicates of two strains were used (called strains D and E). Each was labeled with Cy5. A Cy2-labeled internal reference was prepared as an equal mixture of aliquots from all 12 biological replicates. One biological replicate and an aliquot of the internal reference were run on each of 12 gels. Data Analyses

In the two-dye experiment, the initial data matrix consisted of rows corresponding to spots present in at least four of six biological replicates of each strain, and columns corresponding to each gel image. In the three-dye experiment, the initial data matrix consisted of rows corresponding to spots present in at least three of four biological replicates of each strain, and columns corresponding to each gel image. Both matrices contained missing values, which were imputed using the K-Nearest Neighbors imputation algorithm described by Troyanskaya et al.5 Prior to any analysis requiring the Gaussian distribution of the data in question, we used the Shapiro-Wilk test as in Karp and Lilley6 to confirm that this was the case. Normalization algorithms were applied in the R software environment.3,7,8 Normalization Algorithms

Eight normalization algorithms are compared in the current work. They are as follows. Total Volume Normalization (TVN): For each gel image, the sum of all spot volumes is calculated. The median of all such sums is also calculated. Each spot volume is divided by the sum of spot volumes on the same gel image, and then multiplied by the median sum in order to rescale all values into the original orders of magnitude. Median: For the assessment of technical variance, median normalization was performed by rescaling values so that the median spot volume on each gel was the same. For the rest of the study, median normalization was performed by rescaling internal reference volumes so that the median standardized log abundance (SLA) ratio for each biological replicate gel image was zero. In the three-dye experiment, this required rescaling each internal reference gel image twice, once for each of the two biological replicate gel images arising from a given gel. DeCyder: The DeCyder algorithm fits a Gaussian distribution to the frequency histogram of SLA ratios for each biological replicate gel image and rescales the corresponding internal reference spot volumes so that the peak of the distribution centers on zero (GE HealthSciences). Cyclic Loess: This approach normalizes data transformed into M vs A space9 by fitting a loess regression curve and then shifting each point by the M value of the loess line at the same value of A.10 The effect of the shift is that a loess curve run through the modified data would show M = 0 at all points A. The data

ARTICLE

transformed into M vs A space come from pairs of gel images. The process is iterated until there are no differences left to normalize, which typically takes 1-2 cycles through all pairs of gel images. Contrast: Like Cyclic Loess, this approach normalizes data using curves fit to scatter plots of data from two gel images. For a detailed description of the methodology, refer to Åstrand (2003).11 Cyclic Linear: This method uses the same approach as Cyclic Loess normalization, except that instead of a loess smoother, a linear fit through the scatter plot is calculated and used as the basis for normalization. Variance Stabilizing Normalization (VSN): The VSN algorithm3 uses both scaling and offset factors to normalize spot volumes, and was implemented as previously described.1,2 Quantile: For each gel image the list of spots is sorted into a ranked list on the basis of spot volume, with the consequence that the values for a given spot are not necessarily on the same row of the data matrix after sorting. At each row, or quantile, the average value across gel images is calculated, and then each value is replaced by the average. In the final step the original row order is restored for each gel image. For further description, see Bolstad et al. (2003).10

’ RESULTS AND DISCUSSION Normalization methods can be evaluated for their ability to reduce bias, to reduce variance and uncouple it from signal intensity. In the current work we evaluated normalization methods for their effects on variance; additionally, we observed intensity-dependent bias in a DIGE experiment and assessed the ability of normalization to reduce it. Finally, we examined the practical consequences of the normalization methods on the list of spots whose relative abundance was declared significantly different in two commonly used DIGE experimental designs. We analyzed data obtained from two experiments. The first, called the two-dye experiment, comprised 12 2D-DIGE gels in which Cy-5 lableled proteins from six biological replicates of each of two strains were electrophoresed on individual gels. Each gel also contained an equal amount of Cy-2 labeled internal reference sample: a mixture of each of the 12 biological samples in equal proportions. In the second experiment, called the three-dye experiment, we included a third Cy Dye (Cy3) to quantify proteins from four biological replicates of three strains arrayed across six gels with an aliquot of internal reference present on each gel. Spots were identified and quantified as described in the Experimental Section. We use the term “spot volume” to mean the fluorescence intensity of a spot after background subtraction; this value is considered to be proportionate to the amount of protein in a given spot on a given gel. Effect of Normalization Methods on Technical Variance

The internal reference spot volumes from all 12 gels in the two-dye experiment comprise a set of technical replicates: any variation must, by definition, arise from technical noise. When we calculated, for each of the 622 spots, the mean and standard deviation in the unnormalized data, we observed that, as a proportion of the mean, the data were substantially variable (Table 1). A perfect normalization method would eliminate this variability because the measured volume of a given internal reference spot should be the same on all gels. We normalized the technical replicate data with each of the following methods: VSN (Figure 1), TVN, Median, Contrast, 1354

dx.doi.org/10.1021/pr101080e |J. Proteome Res. 2011, 10, 1353–1360

Journal of Proteome Research

ARTICLE

Table 1. Normalization Reduces Variance as a Proportion of Average Spot Volume normalization

mean spot

mean spot volume relative

mean standard deviation

mean standard deviation

standard deviation as a

method

volume

to unnormalized

among replicate spots

relative to unnormalized

percentage of volume

Unnormalized

813 771

VSN

363 821

0.45

94 373

0.20

Contrast

726 789

0.89

187 181

0.39

25.8%

Cyclic Linear

765 981

0.94

269 109

0.56

35.1%

Cyclic Loess

726 615

0.89

184 788

0.39

25.4%

TVN

680 746

0.84

175 082

0.37

25.7%

Median Quantile

683 718 813 771

0.84 1.00

182 379 210 351

0.38 0.44

26.7% 25.8%

477 694

Figure 1. Normalization methods reduce variance in proteomic data partly by reducing spot volumes. For each spot in the two-dye experiment, average spot volume and standard deviation on the 12 replicate gels were calculated in both the unnormalized data and in data normalized by each of the methods described in the text. The results shown for VSN are typical of all normalization methods examined.

Cyclic Loess, Cyclic Linear and Quantile methods and evaluated how much reduction in variance had been achieved (Table 1). Reduced variance can be achieved simply by rescaling spot volumes to lower values, which was observed upon normalization by all methods other than Quantile normalization (evident in Figure 1 as a leftward shift in the data distribution). Nonetheless, each method also reduced variance as a percentage of spot volume, indicative of statistically useful normalization (Table 1). The degree to which variance was reduced as a percentage of spot volume was similar for most methods other than Cyclic Linear, which clearly underperformed. A more effective way to visualize the effect of normalization methods on variability in these data is to directly examine the agreement among technical replicates. For each spot we calculated the log of the ratio of spot volumes for all pairwise combinations of replicates, excluding self-comparisons (622 spots, 132 comparisons per spot). If the unnormalized data were free of technical noise, all log ratios would equal zero. Instead, a distribution is observed, and while the center of the peak is approximately equal to zero, the breadth of the distribution is indicative of noise (Figure 2). Each normalization method improved the agreement between technical replicates, as seen by the distributions in Figure 2 being narrower than in the unnormalized data. Six of the normalization methods performed similarly to each other; the seventh (Cyclic Linear normalization) performed noticeably worse. We observed the same trends when analyzing the six technical replicates in the three-dye experiment (data not shown). These analyses indicate that any normalization method is better than none at all, that some methods are more appropriate

58.7% 25.9%

Figure 2. Normalization methods improve agreement among technical replicates. For each internal reference spot in the two-dye experiment, spot volume ratios were calculated for all pairwise comparisons of the 12 replicate gels and converted to log2 values (log volume ratio). These ratios were plotted as a relative frequency plot in bins of 0.1 X-axis units, against the proportion of all observations (density). This analysis was repeated for data normalized by each of the indicated methods.

than others, and that a substantial amount of technical noise remains even with the best normalization methods. While we demonstrated in Figure 2 that normalization improved agreement between technical replicates, this analysis could not determine if there were signal-intensity-dependent trends in the data before or after normalization. To assess this we constructed M vs A plots9 of all of the pairwise comparisons described above, fit the distributions with loess smoothers and plotted the average absolute distance of the loess smoother to the X-axis at M = 0 (Figure 3). Intensity-dependent trends would appear in this plot as a systematic deviation from a horizontal line. The unnormalized data showed no evidence of a trend in agreement proportional to signal intensity. Similarly, no signal-intensity-dependent trends were introduced into any of the normalized data. The plots in Figure 3 also provide an alternative way of visualizing the effect of each normalization method on improving agreement between technical replicates, with perfect agreement indicated by a horizontal line intersecting the Y-axis at the origin. The curve for the unnormalized data is located well above the X-axis indicating suboptimal agreement among replicates. The curve for Cyclic Linear normalization is lower than for unnormalized data, though consistent with the conclusions drawn from Figure 2, it is the least effective normalization method. The remaining methods perform better, and are similar to each other in improving agreement among technical replicates. 1355

dx.doi.org/10.1021/pr101080e |J. Proteome Res. 2011, 10, 1353–1360

Journal of Proteome Research

ARTICLE

Figure 3. Evaluation of intensity-dependent trends in the agreement among technical replicates. For unnormalized data and for each normalization method, the volumes of matched spots on all pairs of gels were compared using M vs A plots9 and a loess smoother was fit to each resulting point cloud. The absolute distance between the loess smoother and M = 0 was calculated for all values of A on each M vs A plot, and the average of these is plotted against the value of A. The secondary Y-axis presents the same data on a fold difference scale describing the average agreement among spot volumes in pairwise comparisons of spots, where perfect agreement equals 1.0.

Effect of Normalization Methods on Data that Include Biological Replicates

As expected, biological replicate spot volumes are noisier than the technical replicate spot volumes because the former contain all of the sources of technical noise plus biological variation. This is evident from the value of the median for the biological data being greater than for the technical data in a plot of standard deviation as a proportion of the mean spot volume (Figure 4A). Normalization of the biological replicate spot volumes by each of the methods described above led to some reduction in variance (Figure 4B; VSN is shown as a representative example). The presence of an internal reference sample on all gels allows several sources of technical noise to be mitigated by dividing the spot volume of each biological replicate spot by the spot volume of the corresponding internal reference spot, producing a quantity called a spot volume ratio. While this is effectively a kind of normalization, it is distinct in that it produces a ratio rather than a corrected spot volume, so for clarity we will refer to this process as standardization. Figure 4B shows that standardization produces a greater reduction in noise (decrease in standard deviation÷ mean) than was obtained by normalizing the raw biological replicate spot volumes using VSN (or other methods, not shown). To investigate whether additional noise reduction in the biological replicate data could be achieved using both normalization and standardization, we applied each of the normalization methods to the raw spot volumes, calculated the spot volume ratios (i.e., standardized the normalized volumes) and plotted the distribution of standard deviation divided by the mean. The result for VSN normalization is shown in Figure 4B; the combinations of each of the other seven methods with internal reference

Figure 4. Effect of standardization (dividing by internal reference spot volume) and/or normalization on noise. Data are from the six gels in the two-dye experiment containing six biological replicates of Cy5-labeled strain D and Cy2-labeled internal reference. (A) Frequency plot of Standard Deviation ÷ Mean of the raw spot volumes for internal reference spots (technical, dashed line) or the corresponding spots from strain A (biological, solid line). Bin size is 0.1 X-axis units; arrows indicate the median X-axis value of each distribution. (B) Effects of Standardization (IR), Normalization by VSN or both (VSN þ IR) on noise in the biological replicate data (from panel A). (C) Relative Frequency Plot of SLA values (log2(biological replicate spot volume ÷ internal reference spot volume)), binned into units of 0.1 and plotted against the proportion of all observations (density).

standardization performed similarly (data not shown).The data in Figure 4B show that normalization and standardization are 1356

dx.doi.org/10.1021/pr101080e |J. Proteome Res. 2011, 10, 1353–1360

Journal of Proteome Research

ARTICLE

Figure 5. Evaluation of intensity-dependent trends in the agreement among biological replicates and the internal reference on each gel in the two-dye experiment. For unnormalized data and for each normalization method in turn, the biological replicate spot volumes and internal reference spot volumes on each gel were compared using M vs A plots9 and a loess smoother was fit to each resulting point cloud. The absolute distance between the loess smoother and M = 0 was calculated for all values of A on each M vs A plot, and the average of these is plotted against the value of A. The secondary Y-axis presents the same data on a fold difference scale describing the average agreement among biological replicate and internal reference spot volumes used to calculate Standardized Log Abundance values, where perfect agreement equals 1.0.

both required to obtain maximal reduction in noise in the biological data. An alternative way to visualize the effect of normalization is to plot the distribution of log2 of the spot volume ratio (called the standardized log abundance, or SLA; Figure 4C). The distribution of SLA values would be expected to center on zero if the Cy-labeling, loading and detection of the biological samples and internal reference sample were identical. The distribution of unnormalized SLA values is not centered on zero (Figure 4C, dashed line), indicating a detectable amount of one or more of these sources of variation in the raw SLA data. Cyclic linear normalization moved the center of the distribution closer to zero from its position in the unnormalized data. All of the other methods did so to a greater extent, demonstrating their abilities to effectively correct these sources of error. Decreasing breadth and increasing peak height in Figure 4C are also informative, as they are indicative of decreasing noise. Breadth and peak height are correlated in a density plot because the areas under each curve are necessarily the same. Unnormalized data yield the broadest distribution (shortest peak), followed by cyclic linear normalization, with the remaining normalization methods producing similarly narrow curves. The trend toward narrower curves after normalization reflects the relative performance of the methods at reducing noise in the data set. Because SLA values arise from the intragel comparison of biological replicate and internal reference proteins that are each labeled with different dyes, they are potentially subject to the phenomenon of dye bias. Dye bias can arise both systematically, from differences in the background fluorescence levels of

Figure 6. Quantile plot showing the distributions of observed p-values, arising from the comparison of the two strains in the two-dye data set, relative to their position in an ordered, ranked list (quantile ranking). Each curve corresponds to a normalization method, all data were standardized using the internal reference prior to statistical testing. The outlined box near the origin in A is enlarged in B. The diagonal Uniform Distribution line corresponds to the null expectation of repeatedly testing identical populations.

each dye,1,2 and on a protein-specific basis.12 In the context of standardization, it manifests in the data as an intensity-dependent decrease in agreement between the biological replicate spot volumes and internal reference spot volumes used in that process. Effective normalization will improve this agreement and eliminate any intensity dependence observed in unnormalized data. We evaluated the effectiveness of the normalization methods in our two-dye experiment by constructing, for each of 1357

dx.doi.org/10.1021/pr101080e |J. Proteome Res. 2011, 10, 1353–1360

Journal of Proteome Research

Figure 7. Effect of normalization methods on the enrichment of spots with statistically significant differences in abundance. The curves describe, for each normalization method, the relationship between the p- and q-values resulting from comparing the two strains present in the two-dye experiment. Q-value analysis was performed according to the method of Storey.13,14 For a given point on each curve, the q-value describes the proportion of spots with equal or smaller p-values that are expected to be false discoveries.

the 12 gels, an M vs A plot comparing the biological replicate spot volume and internal reference spot volume, fitting loess smoother curves to the point cloud on each plot and calculating the average absolute distance from those curves to the X-axis at M = 0 (curves for individual gels not shown); these distances are plotted in Figure 5. The curve for the unnormalized data is located well above the X-axis indicating suboptimal agreement between biological replicate and internal reference spot volumes at all intensities. In addition, there is a distinct upward trend at low X-axis values indicative of the intensity-dependent error resulting from dye bias. Each normalization method shifts the curve downward (toward lower Y-axis values), reflecting the improvement in agreement among SLA values because of normalization; Cyclic Linear normalization is notably the least effective method. With the exception of TVN, the normalization methods (especially Quantile and Cyclic Loess) also decrease the intensity-dependent error at low spot volumes. Effect of Normalization Method on the List of Spots Whose Abundance is Determined to Be Different between Strains

Using data that was both normalized and standardized, we compared the two strains using Student’s t test to obtain a p-value for each spot. Due to the large number of t tests performed, small p values are expected to be observed by chance at a frequency that can be estimated. In fact, as the number of tests performed becomes very large, on a quantile plot the observed p-values would lie along the diagonal if there is no difference in the two groups being compared. If instead statistically significant differences exist between the two groups, these would be revealed by an enrichment of smaller p-values. Conversely, a depletion of smaller p-values would reflect the presence of a confounding influence such as noise, regardless of whether any statistically

ARTICLE

Figure 8. Effect of normalization methods on the list of spots declared significantly different between strains D and E in the two-dye experiment. Yellow and blue denote spots that were declared different or not different, respectively, between the two strains at the q-value threshold indicated above each panel. Each column corresponds to the indicated normalization method; each row corresponds to a specific spot that was declared significant at q e 0.50 by at least one of the methods; the order of the rows is the same in both panels. In all cases, the data were standardized using the internal reference. There are no columns for unnormalized data or Cyclic Linear normalization because they failed to yield values of q e 0.50 for any spot.

significant differences existed. As shown in Figure 6, standardization alone resulted in a depletion of smaller p-values (evident from this curve falling below the diagonal), whereas the curves for standardized data after normalization by each of the methods examined showed enrichment in smaller p-values. This indicates that once the obscuring influence of technical noise is mitigated by normalization, the existence of an enrichment in spots with small p-values can be observed. Among the set of spots with small p-values, we cannot determine a priori which reflect real biological differences and which occur by chance. We can, however, apply q-value analysis13 to determine the false discovery rate (FDR) at a particular p-value threshold with which we can estimate the proportion of spots below that threshold that are due to real differences. Where there is an enrichment of small p-values, the corresponding q-value will reflect this condition with a value of less than 1. As the enrichment increases at ever-smaller p-value thresholds, the corresponding q-value decreases. If instead there is a depletion of small p-values, all of them are among those expected to be seen by chance and the q-value will reflect this by remaining at 1 for all p-values below which the depletion is observed. The relationship between q-value and p-value in our data is shown in Figure 7. As 1358

dx.doi.org/10.1021/pr101080e |J. Proteome Res. 2011, 10, 1353–1360

Journal of Proteome Research

ARTICLE

smaller p-values. Of the different normalization methods, the Cyclic Linear normalization curve in Figure 7 implies a lower proportion of significant observations at nearly any p-value threshold than for other normalization methods. The remaining methods perform similarly and show that normalization is required to identify the spots whose abundance differences most likely reflect real differences in protein abundance rather than artifacts of sampling. While we have described the interpretation of the q-value in the context of setting a particular p value threshold, it is also possible to define statistical significance using a q value threshold instead. As a selected threshold of either type is inherently arbitrary, we compared the lists of spots with associated q-values equal to or less than thresholds of 0.5 or 0.4, representing relatively more relaxed or stringent criteria, respectively. In Figure 8, the normalization methods that produced the greatest enrichment of small p-values have the longest lists, independent of threshold value. At q e 0.5, standardization alone or combined with Cyclic Linear normalization does not allow identification of any spots that satisfy the threshold. At q e 0.4 DeCyder or Cyclic Loess normalizations and standardization also fail to identify any spots, and the lists associated with the remaining normalization methods are substantially shorter. At either threshold, spots found on shorter lists are mostly subsets of spots found on longer lists, demonstrating that spots identified as differing significantly are unlikely to be unique to that choice of method. Despite this, it is also true that the length of the list of spots differing significantly is method-dependent; the choice of normalization method can directly impact the interpretation of the experimental results. Normalization and 3-Dye DIGE Experiments

Figure 9. Effect of normalization methods on the list of spots declared significantly different among strains A, B, and C in the three-dye experiment. Each column corresponds to the indicated normalization method; each row corresponds to a spot that was declared different by at least one of the methods; row order is specific to each panel. Yellow and blue denote spots that were declared different or not different, respectively, between the indicated strains at p e 0.05 and fold-change g2.

expected from the data in Figure 6, the curve for the unnormalized data in Figure 7 is a straight line at q = 1, implying that all p-values observed are expected by chance. Similarly, the enrichment of small p-values observed in Figure 6 after normalization by any of the methods is reflected in the sloping curves in Figure 7, which quantify the increasing proportion of significant observations (i.e., decreasing q-value) among subsets of spots with progressively

Q-value analysis cannot readily be applied to three-dye DIGE experimental designs involving two biological samples and an internal reference on the same gel.14 Instead, criteria of p < 0.05 and fold change (FC) >2 are typically used to identify proteins whose difference in abundance is sufficiently large to justify further study. Here we compare the effect of each normalization method on the list of spots in our three-dye experiment that meet these criteria. Data for each of the three pairwise comparisons between strains A, B and C are shown in Figure 9. Some spots were identified as differing significantly in the unnormalized data; however, a larger number were identified using normalized data. As with the two-dye experiment, spots identified by normalization methods that produced shorter lists were commonly found by methods that produced longer lists, further demonstrating the consistency among methods. While there were differences between the two-dye and three-dye experiments as to which normalization methods produced the longest lists, Cyclic Linear and DeCyder typically yielded a short list while Median and VSN typically yielded a longer list. These observations further demonstrate the ability of the normalization methods to overcome the masking of statistically significant differences by technical noise.

’ CONCLUSIONS The current work demonstrates that the commonly used normalization methods VSN, TVN, Median, Contrast, DeCyder, Quantile and Cyclic Loess substantially reduced the variance in 2D gel data; Cyclic Linear normalization was far less effective. We also found that the choice of normalization method can have a large effect on the list of proteins whose abundance is declared to be statistically different between strains. In an experiment in which the unnormalized data contained sufficient noise to preclude identification of any significant differences at a given q-value 1359

dx.doi.org/10.1021/pr101080e |J. Proteome Res. 2011, 10, 1353–1360

Journal of Proteome Research

ARTICLE

threshold, we found that some methods (e.g.Cyclic Linear, and DeCyder) still failed to allow identification any significant differences, whereas VSN, TVN, Median (and some others depending on the threshold chosen) allowed significant differences to be identified. Nonetheless, the lists of significant spots differed among methods. These results suggest that 2D gel data should be normalized in parallel analyses using each of several methods before deciding on a list of proteins to be further investigated.

’ AUTHOR INFORMATION Corresponding Author

*Phone: 416-978-3541. Fax: 416-978-6885. E-mail: rick.collins@ utoronto.ca.

’ ACKNOWLEDGMENT This work was supported by CIHR grant MOP 12837 and the Canada Foundation for Innovation. ’ REFERENCES (1) Karp, N. A.; Kreil, D. P.; Lilley, K. S. Determining a significant change in protein expression with DeCyder during a pair-wise comparison using two-dimensional difference gel electrophoresis. Proteomics 2004, 4 (5), 1421–32. (2) Kreil, D. P.; Karp, N. A.; Lilley, K. S. DNA microarray normalization methods can remove bias from differential protein expression analysis of 2D difference gel electrophoresis results. Bioinformatics 2004, 20 (13), 2026–34. (3) Huber, W.; von Heydebreck, A.; Sultmann, H.; Poustka, A.; Vingron, M. Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics 2002, 18 (Suppl 1), S96–104. (4) Collins, R. A.; Saville, B. J. Independent Transfer of Mitochondrial Chromosomes and Plasmids during Unstable Vegetative Fusion in Neurospora. Nature 1990, 345 (6271), 177–9. (5) Troyanskaya, O.; Cantor, M.; Sherlock, G.; Brown, P.; Hastie, T.; Tibshirani, R.; Botstein, D.; Altman, R. B. Missing value estimation methods for DNA microarrays. Bioinformatics 2001, 17 (6), 520–5. (6) Karp, N. A.; Lilley, K. S. Maximising sensitivity for detecting changes in protein expression: experimental design using minimal CyDyes. Proteomics 2005, 5 (12), 3105–15. (7) R Development Core Team, R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing: Vienna, Austria, 2009. (8) Gautier, L.; Cope, L.; Bolstad, B. M.; Irizarry, R. A. affy - analysis of Affymetrix GeneChip data at the probe level. Bioinformatics 2004, 20 (3), 307–15. (9) Dudoit, S.; Yang, Y. H.; Callow, M. J.; Speed, T. P. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Stat. Sin. 2002, 12 (1), 111–39. (10) Bolstad, B. M.; Irizarry, R. A.; Astrand, M.; Speed, T. P. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 2003, 19 (2), 185–93. (11) Astrand, M. Contrast normalization of oligonucleotide arrays. J. Comput. Biol. 2003, 10 (1), 95–102. (12) Krogh, M.; Fernandez, C.; Teilum, M.; Bengtsson, S.; James, P. A probabilistic treatment of the missing spot problem in 2D gel electrophoresis experiments. J. Proteome Res. 2007, 6 (8), 3335–43. (13) Storey, J. D.; Tibshirani, R. Statistical significance for genomewide studies. Proc. Natl. Acad. Sci. U.S.A. 2003, 100 (16), 9440–5. (14) Karp, N. A.; McCormick, P. S.; Russell, M. R.; Lilley, K. S. Experimental and statistical considerations to avoid false conclusions in proteomics studies using differential in-gel electrophoresis. Mol. Cell. Proteomics 2007, 6 (8), 1354–64. 1360

dx.doi.org/10.1021/pr101080e |J. Proteome Res. 2011, 10, 1353–1360