Optimal Replication and the Importance of Experimental Design for

Most biologists are familiar with the use of the significance test, and most ... That is, the power is the probability of not making a type II error. ...
0 downloads 0 Views 491KB Size
Optimal Replication and the Importance of Experimental Design for Gel-Based Quantitative Proteomics Sybille M. N. Hunt,* Mervyn R. Thomas,† Lucille T. Sebastian, Susanne K. Pedersen, Rebecca L. Harcourt, Andrew J. Sloane, and Marc R. Wilkins Proteome Systems Ltd, Locked Bag 2073, North Ryde, NSW 1670, Australia

J. Proteome Res. 2005.4:809-819. Downloaded from pubs.acs.org by UNIV OF GOTHENBURG on 01/25/19. For personal use only.

Received December 20, 2004

Quantitative proteomic studies, based on two-dimensional gel electrophoresis, are commonly used to find proteins that are differentially expressed between samples or groups of samples. These proteins are of interest as potential diagnostic or prognostic biomarkers, or as proteins associated with a trait. The complexity of proteomic data poses many challenges, so while experiments may reveal proteins that are differentially expressed, these are often not significant when subjected to rigorous statistical analysis. However, this can be addressed through appropriate experimental design. A good experimental design considers the impact of different sources of variation, both analytical and biological, on the statistical importance of the results. The design should address the number of samples that must be analyzed and the number of replicate gels per sample, in the context of a particular minimum difference that one is seeking to achieve. In this study, we explore the ways to improve the quality of protein expression data from 2-DE gels, and describe an approach for defining the number of samples required and the number of gels per sample. It has been developed for the simplest of situations, two groups of samples with variation at two levels: between samples and between gels. This approach will also be useful as a guide for more complex designs involving more than two groups of samples. We describe some Internet-accessible tools that can assist in the design of proteomic studies. Keywords: quantitative proteomics • 2-DE • statistical power analysis • experimental design • analytical variation

Introduction A key aim of proteomics is the expression analysis of large numbers of proteins. Technologies employed for this include two-dimensional polyacrylamide gel electrophoresis combined with visible or fluorescent stains and image analysis, and mass spectrometric approaches that use isotopic labeling techniques such as isotope-coded affinity tag peptide labeling1 or amino acid coded mass tagging.2 These have been recently reviewed elsewhere.3 Quantitative two-dimensional electrophoresis (2-DE) is commonly applied to differential display analysis, to find proteins that are differentially expressed between samples or groups of samples. If these proteins are shown to change consistently in a population, they may be associated with or responsible for a phenotype and are referred to as biomarkers. Biomarkers can form the basis of diagnostic and prognostic tests and as such they are of scientific and commercial interest. In many quantitative proteomic studies based on 2-DE gels, where researchers are seeking to identify differentially expressed proteins, there is often inadequate attention paid to experimental design. There are special challenges with proteomic data that need consideration, similar to the challenges * To whom correspondence should be addressed. Phone: 61 2 9889 1830. Fax: 61 2 9889 1805. E-mail: [email protected]. † Emphron Informatics, 6 Geewan Place, Chapel Hill, Queensland 4069, Australia. 10.1021/pr049758y CCC: $30.25

 2005 American Chemical Society

faced in the acquisition and analysis of mRNA expression data. These arise due to the following: very large number of measurements are usually generated for each sample; analytical variation is inherent to the protein separation, staining, image acquisition and processing steps; there is biological variation of environmental origin; and there is biological variation of genetic origin, for example in an out-bred population. Furthermore, the generation of proteomic data remains a relatively involved and multistep process, and as a consequence, experiments are usually modest in the number of samples analyzed. So while many experiments reveal proteins that are differentially expressed, these may be found to be not significant when subjected to rigorous statistical analysis. Many of the sources of variation simply cannot be easily controlled in proteomics (e.g., the study of human samples). However, a good experimental design can evaluate the impact that different sources of variation can have on the statistical importance of the results, and help assess the best course of action. The simplest experimental design for differential display exhibits a hierarchical structure (see Figure 1). At the top of the hierarchy are the groups of samples for comparison, defined by sample characteristics such as disease state (healthy or disease population) or treatment applied (drug dosage). The middle level of the hierarchy assays variation between the samples within a given group, capturing the major source of Journal of Proteome Research 2005, 4, 809-819

809

Published on Web 05/07/2005

research articles

Figure 1. Typical experimental paradigm displaying the hierarchical structure of a 2-group quantitative proteomic experiment. At the top of the hierarchy are the groups of samples for comparison of sample characteristics such as disease state. The middle level of the hierarchy allows for estimation of betweensamples (biological) variation within a given group and the lowest level for estimation of between-replicate 2-DE gels (analytical) variation.

biological variation. This variation will be smallest when dealing with simple bacterial cultures and at its greatest in studies dealing with samples from human subjects. The lowest level of the hierarchy involves replicate 2-DE gels run from the same sample, and captures the inherent analytical variation. It is important to recognize that variation is inherent across the experiment, however a good experimental design can address the number of samples that must be analyzed and the number of replicate gels per sample, in the context of a particular minimum expression difference that one is seeking to discover. In this study, we describe an approach for understanding and controlling variation in gel-based proteomics, and propose a means of designing an experiment to define the number of samples required and the number of gels per sample. This approach has been developed for the simplest of situationss two groups of samples with variation at two levels: between samples and between gelsshowever this can be extended to address more complex experimental situations. We also describe some Internet-accessible tools that can assist researchers in experimental design for their quantitative proteomic studies.

Materials and Methods Materials. Whole blood was collected from patients and immediately stored on ice, then spun gently to sediment red blood cells. The supernatant was further spun at 6000 × g to leave clarified plasma that was then stored at -80 °C until required. Standard laboratory chemicals were obtained from Sigma-Aldrich (St. Louis, MO) unless specified otherwise. Human Plasma Sample Preparation. Two mL aliquots of plasma (per subject) were quickly thawed at 37 °C and depleted of the three high-abundance proteins fibrinogen, immunoglobulin G and albumin. Fibrinogen was removed using a venom cross-linking method,4 immunoglobulin type G (IgG) with immuno-affinity chromatography using immobilized protein G sepharose beads (Amersham Biosciences, NSW, Australia), and human serum albumin by ethanol fractionation.5 The triple-depleted plasma sample was then pre-fractionated into narrow range pI fractions, pI 3.0-5.5, 5.5-6.5 and 6.5-11.0 using an IsoelectrIQ2 multi compartment electrolyzer (MCE; Proteome Systems, Sydney, Australia). Throughout the fractionation, protein concentrations were quantified using a Coomassie-blue based Bradford protein assay. Only the pI 5.5-6.5 fractions were used in this study. Three hundred µg of MCE-fractionated triple depleted human plasma protein was made up to a final volume of 210 µL in a 7 M urea, 2 M thiourea, 10 mM Tris, 2% CHAPS sample buffer. The sample was then ultrasonicated for 30 s, and then 810

Journal of Proteome Research • Vol. 4, No. 3, 2005

Hunt et al.

reduced (by adding tributylphosphine to a final concentration of 5mM and incubating for 1 h at ambient temperature) and alkylated (by adding iodoacetamide to a final concentration of 15mM for 1 h at ambient temperature and protected from light). Before rehydration of immobilized pH gradient (IPG) strips, samples were ultrasonicated for 2 min and then centrifuged at 21 000 × g for 5 min. The supernatant was collected and 10 µL of Orange G added as an indicator dye. Bacterial Sample Preparation. Twenty mg of lyophilized Escherichia coli bacterial cells (Sigma product EC-1, strain K12) were resuspended in 10 mL of sample buffer (7 M urea, 2 M thiourea, 40 mM Tris, 1% C7 BzO). The suspension was sonicated with the ultrasonic probe (Branson digital sonicator, model 450) for a total of 1 min (4 × 15 s pulses at 70% amplitude, chilled on ice between each step), centrifuged at 14 000 × g for 15 min at 15 °C to pellet cell debris. The supernatant was transferred into a clean tube and reduced and alkylated as described in the above section for the plasma protein preparations. Two-Dimensional Gel Electrophoresis. Dry 11 cm IPG strips (Amersham Biosciences, NSW, Australia) were rehydrated for 8 h with 210 µL of protein sample. Rehydrated strips were focused to 100 kVh on a Protean IEF Cell (Bio-Rad, Hercules, CA) or an IsoelectrIQ2 (Proteome Systems, Sydney, Australia) electrophoresis equipment. Focused IPG strips were equilibrated for 20 min in 6 M urea, 2% SDS, 0.01% bromophenol blue in 50 mM Tris-acetate buffer pH 7.0. Equilibrated IPG strips were placed on top of 6-15% trisacetate sodium dodecyl sulfate polyacrylamide precast 10 cm × 15 cm × 1 mm gels (Proteome Systems, Sydney, Australia). Electrophoresis was performed at 50 mA per gel for 1.5 h or until the tracking dye front reached the bottom of the gel. Proteins were stained using SYPRO Ruby (Molecular Probes, Eugene, OR) according to the manufacturer’s instructions, then destained for 4 to 7 h in 10% methanol, 7% glacial acetic acid. Image Capture and Analysis. Images of gels were acquired using the AlphaImager 3300 software (Alpha Innotech Corporation, San Leandro, California). Aperture and exposure times were adjusted so that only the most abundant proteins on the gels reached saturation (at pixel intensity level as determined by the software). For the analysis of E. coli, the above procedure was used for the gels separating 320 µg of protein, and the same settings were used for all other gels with lesser protein loads. The gel images were saved as 16-bit tagged image file format (TIFF) and analyzed using ImagepIQ version 1.0.1, a 2-DE image analysis software (Proteome Systems, Sydney, Australia). The images were imported into the ImagepIQ database under separate experiments according to sample type. Image manipulation was done within ImagepIQ and consisted of inverting the pixels to obtain an adsorption image (dark spots on a light background), flipping the image into the correct orientation and cropping the image where required. The spot detection parameters were optimized on one gel image representative of each experiment. To do this, spot detection was applied to the image using the default settings in the first instance. The optimal threshold values for the spotintensity and spot-area parameters were then determined by applying real time filters and visually determining these optimal values; the aim being to minimize the detection of artifacts and maximize the detection of real spots. These optimized settings were saved with the experiment and applied to all gel images within that experiment. The region of interest for each gel image and the settings were ultimately saved with the image.

research articles

Experimental Design for 2-DE Quantitative Proteomics

After spot detection, manual spot editing was done for each image. Editing consisted of deleting spots at the periphery of the gel, and removal of obvious specks that escaped the filtering process. Missed spots were not edited at this stage. The images in each match set were then matched. In the case of the E. coli gels, all the images were selected and a single layer matching process applied. In the case of plasma gels for data sets 1 and 2, multilayer matches were done. The triplicate gels were grouped and matching was done for all the groups simultaneously using the batch process feature. A composite image (replicate composite) was generated for each group of triplicate gel images. Spots that were matched in 2 of the 3 replicate images were visually examined for mismatches and these were edited. Spots that were only found in one of the three images were deemed to be artifacts and were deleted from the match and hence from the composite image. The edited composite images from each sample group were subsequently matched to each other to generate a composite image representative of each group. Post match editing at this level was restricted to making sure that the correct spots were matched together. At the final level of matching, the two group composite images were matched to generate the final experiment composite image. The match IDs (numbers used as identifiers for all spots in a match) and normalized spot volumes columns were exported from the generated match report as a text file for further manipulation using Microsoft Excel.

Statistical Analyses Data Screening. The analyses used in this study assume that the data come from a normally distributed underlying population. To check the distribution of our data, we plotted frequency histograms for the nontransformed normalized spot volumes, and for the data after log transformation. This was done using the BiostatistIQ tools within the BioinformatIQ software platform (Proteome Systems, Sydney, Australia). Estimation of Variance Components. The coefficient of variation (CV% calculated by the standard deviation of the normalized spot volumes divided by the mean, expressed as a percent) was calculated as a measure of inter- and intra-sample variation. The correlation coefficient, as R2, was also calculated. CV% and R2 values are commonly used for measuring gel-togel variation and would therefore allow for a direct comparison with other publications. However, there are more powerful means of estimating variance. These variance components calculations are automated when using the tools at www.emphron.com. Here we describe the details of this analysis. Data were analyzed using a mixed effects linear model.6 The model had a fixed effect term representing the group of samples (e.g., healthy vs diseased), and random deviations representing the effects of samples within groups and gels within samples. Between sample variation represents biological variation, and between gel variation represents analytical variation in the experimental process. Variance components for between samples and between replicate gels variation were estimated using the REML algorithm, as implemented in the R (Open Source statistical package) function lme. Power Calculations. To assist in experimental design, and to understand the number of samples that need to be analyzed in order to confidently discover differentially expressed proteins, a power analysis was undertaken using specially developed tools. We have made these available at www.

emphron.com. Note that for power analysis there cannot be any missing values. Here, we describe the steps used in the automated analysis. Most biologists are familiar with the use of the significance test, and most understand the significance level of a test in terms of the probability of incorrectly rejecting the Null hypothesissthat is the probability of falsely deciding that there is an effect. This probability is often referred to as the type I error rate. By convention, the type I error rate is often fixed at 5%. In some studies, the experimenter is likely to make a different type of errorsthat of failing to reject the null hypothesis when a real effect exists. This is referred to as a type II error. Obviously, we wish to ensure that our experiments have a low probability of producing each type of error. Conventionally, rather than working with the type II error rate, we usually consider the power of a study. The power is the probability of correctly rejecting the null hypothesis, given that a difference exists. That is, the power is the probability of not making a type II error. Elementary statistics text books generally focus on the significance level (type I error) rather than on the power (type II error). This is because we can usually control the type I error rate by choosing an appropriate critical value for our test statistic. The power, however, requires us to know rather more about the system we are studying. Power is influenced by four factors. Increasing the effect size (the true difference between means) makes it more likely that we will reject the null hypothesis and therefore increases power. That is, it is easier to find big differences than small differences. Reducing the experimental variability increases the power. That is, differences are easier to detect when there is little variation. Increasing the sample size increases the powerswe are more likely to detect effects when we have many observations than when we have few. Finally, the power is influenced by the significance level we require. The smaller the type I error rate we are prepared to tolerate, the smaller our chance of detecting real effects becomes. That is decreasing the significance level decreases the power. Power was calculated using the tools on www.emphron.com. These tools use estimates of analytical and biological variability to determine the standard error of differences between two treatment means for a given experimental design. The experimental design is defined by the number of samples per group and the number of gels per sample. These tools then generate estimates of the minimal detectable differencesthe difference between means which will give a power of 80% for the given number of samples and gels.

Results The determination of an appropriate design for a 2-DE based experiment requires high quality protein expression data from image analysis. However, there is relatively little attention paid to the issues that affect the quality of image analysis data. Accordingly, we wished to carefully evaluate our approaches used in generating these data prior to its use for experimental design. Figure 2 shows the issues that are faced in generating high quality image analysis data and outlines some of the approaches that can be used to minimize their effects on data quality. These issues are explored in detail below. Steps that need to be robust in the image analysis process are as follows: the ability to detect all, if not most, of the spots arrayed on a 2-DE gel; to correctly determine the boundary of Journal of Proteome Research • Vol. 4, No. 3, 2005 811

research articles

Hunt et al.

Figure 2. Flowchart showing the steps involved in an experimental design for a typical quantitative proteomic experiment. This includes the hierarchical structure of the pilot experiment, and outlines the potential hazards associated with each step as well as the precautions that can be taken to minimize their effects on the experimental results.

these spots and hence obtain an accurate measure of the spot volumes; to generate normalized spot volumes in a gel or group of gels using a method that corrects for variations in absolute spot volumes (due to differences in protein loads per gel, in protein staining regime, and in image capture settings); to correctly align and match all the images in a group and accurately determine the corresponding spots across all images. Spot Detection. To check the accuracy of spot detection, 2-DE gel images from http://www.umbc.edu/proteome, that have been used in other publications to test image analysis software,7 were analyzed. Spot detection was undertaken as described in the methods section, the spot detection parameters were optimized for the image type but no manual spot editing was done. The results of the spot detection were then visually compared to the same image annotated with the expected real spots (also available from the above website). A total of 900 and 1403 spots were detected respectively on image gel-a and gel-b. Visual comparison with the expected results showed the detection of 93.6% and 95.2% of the expected real spots (true positives), the missing of 6.4% and 4.8% of the expected real spots (false negatives), and detection of 13.1% and 8.8% artifacts. These results compare favorably to the accuracy of spot detection in the literature using other image analysis softwares, the best reported value for percentage of spots missed being 6% for Melanie 3.0 software.7 Spot Quantitation. To evaluate the accuracy of spot quantitation, a set of eleven artificial images from Raman et al.,7 that were designed to test the accuracy of spot quantitation of image analysis software, was downloaded (http://www. umbc.edu/proteome). The expected volume ratio of the center spot in images (b) through to (k), relative to the center spot in image (a), are 2, 4, 6, 10, 14, 18, 22, 26, 30, and 40, respectively. 812

Journal of Proteome Research • Vol. 4, No. 3, 2005

Spot detection was carried out on the 11 artificial images using default spot detection parameters (see Materials and Methods). The images were then matched to each other to generate a match report with the appropriate relative intensity ratios. The observed spot ratios correlated well with the expected spot ratios, giving a correlation coefficient R2 value of 0.99. Although the images analyzed are artificial and do not completely mimic a set of 2-DE gels, they are a useful indicator of the complexity of spot quantitation and a valid test of quantitative image analysis approaches. Spot Volume Normalization. Slight variations in protein load per gel, protein staining efficiency and image capture can have a considerable impact on the raw spot volumes generated by image analysis. Normalization of raw spot volumes, necessary to minimize gel to gel variation, is imperative for quantitative proteomic studies using 2-DE. The efficiency of normalization methods in correcting for analytical variations in raw spot volumes due to uneven protein loading was tested. Increasing amounts of E. coli protein extracts (20, 40, 80, 160, or 320 µg) were subjected to 2-DE using 11-cm, pH 3-10 IPGs (see Materials and Methods). A representative image, annotated with the 278 spots that matched across all gels following automated matching, is shown in Figure 3. The range of spot volumes for the 278 matched spots relative to that of all the spots on the representative gel also shows that there was no bias in the selection of the spots for statistical analyses (Figure 3). The volumes for the matched spots, before and after normalization, were graphed using a box-and-whisker plot (Figure 4, graphs A and B). With an increase in protein load, there is a steady increase in the raw spot volumes, but there is little change in the normalized spot volume. These results show that the method of normalization

Experimental Design for 2-DE Quantitative Proteomics

research articles of triple depleted plasma sample preparations from two groups of subjects (a healthy and a diseased group), 4 samples per group, and triplicate gels per sample. Data set 2 consisted of 18 gel images generated similarly from two other healthy and diseased groups of subjects (3 samples per group, triplicate gels per sample). Image analysis was carried out as described in the Materials and Methods. Any matching conflicts (for example nonmatching spots) were deliberately not resolved, and only spots that matched across all the images were used. This strategy was used because it was imperative, for the purpose of these statistical analyses, that there were no mismatches in the data set. It also ensured that the spots used in the analysis were randomly spread across the images, and that no subjective editing was applied that might affect the automated spot quantitation. Plasma data sets 1 and 2 consisted of 63 and 119 matched spots respectively, these were randomly spread across the gels indicating that there was no bias in the selection of the spots for statistical analyses, as shown in Figure 6 for a representative gel for data set 2.

Figure 3. E. coli 2-DE gel image annotated with the 278 matched spots (out of a total of 500 detected spots) used in the statistical analyses. The spot volumes for all the 500 spots detected, and for the 278 matched spots, were sorted in ascending order and plotted in a line graph. The graphs show that the 278 matched spots covered a similar range of volumes relative to the 500 spots detected, illustrating that we have a representative set of protein spots from the gel.

employed can globally correct for differences in amount of protein loaded onto the gels. To investigate the efficiency of normalization in controlling analytical variation in replicate gels from the same sample, and in controlling variation from replicate samples run on different gels, we plotted the spot volumes, before and after normalization, for the 119 spots from data set 2 as described for the E. coli data in the above paragraph (Figure 4, graphs C and D). This plot shows that for raw spot volumes, there is random variation in the median and other quartiles across the 18 gels. After normalization, it can be seen that the random analytical variations in raw spot volumes has been largely corrected for. Normalization should correct for differences in raw spot volumes such that, for any given matched spot across replicate gels of the same sample, you would expect the ratio of spot volumes to be close to 1. Figure 5 shows X-Y plots of the ratio of the log transformed normalized spot volume for two sets of duplicate gels (E. coli_80 and E. coli_40 from the E. coli data set; and replicate gels 2.1.1 and 2.1.3 from data set 2) plotted against the log transformed normalized spot volume for one of the replicate gels (see section: Spot Volume Data Screening, for more information on logarithmic transformation). The ratios were close to 1 for all except the lowest spot volumes, where we see a departure from 1 for the faintest spots. Image Analysis Data for Statistical Analyses. Two plasma protein expression data sets were generated for the purpose of estimating between sample and between gel variance. Data set 1 consisted of 24 protein 2-DE gels (11-cm pH 4-7 IPGs)

Coefficient of Variation and Correlation Coefficient. Most published reports that study biological and analytical variation have been based on evaluation of the percent coefficient of variation (CV%) and correlation coefficient (R2) (see e.g., ref 8). We believe that these tests are not sufficiently robust, and so we explored alternative approaches based on variance analyses and power calculations, below. However, to allow for a direct comparison of our study with previously published work, we calculated CV% and R2 from our plasma data sets 1 and 2. Table 1 shows the average CV% ( SD for normalized spot volumes for the replicate images from plasma data sets 1 and 2. These values may appear high but when compared to similar analyses of 2-DE image analysis data (see discussion), our results are very favorable. A better illustration of these data is, however, to graph the cumulative percent of spots that fall below given CV% values for the 8 sets of replicate gels from data set 1, and the 6 sets of triplicate gels from data set 2 (Figure 7). This reveals that there can be notable differences in the CV% from one sample to the next, and can aid the identification of outlying samples in a group. One simple, but widely used, means of evaluating within and between sample variance is to establish the correlation coefficient of nontransformed data. Accordingly, we used automatic image analysis to generate the correlation coefficient values (R2) of nontransformed normalized spot volumes, for every pairwise combination of gel images from plasma data sets 1 and 2. The means and standard deviations of the R2 values for each replicate set were calculated for the healthy and diseased groups of data sets 1 and 2 (Table 2). The average R2 (( SD) values for within-samples comparisons, which indicates analytical variation, ranged from 0.87 ( 0.09 to 0.99 ( 0.001, and from 0.94 ( 0.03 to 0.99 ( 0.002 for the healthy and diseased groups of data set 1, respectively. The equivalent values for data set 2 were 0.95 ( 0.01 to 0.98 ( 0.01, and 0.93 ( 0.03 to 0.98 ( 0.01. The values for between samples comparison, which indicate biological variation, ranged from 0.73 ( 0.03 to 0.88 ( 0.04, and from 0.93 ( 0.04 to 0.97 ( 0.01 for the healthy and diseased groups of data set 1, respectively. The equivalent values for data set 2 were 0.70 ( 0.06 to 0.90 ( 0.01, and 0.43 ( 0.02 to 0.66 ( 0.04. While this approach is not as powerful as other methods of analyzing variance (see below), it has revealed that our between sample variation (biological variation) is clearly greater than our within sample variation (analytical variation). Journal of Proteome Research • Vol. 4, No. 3, 2005 813

research articles

Hunt et al.

Figure 4. Efficiency of normalization in correcting for differences in spot volumes due to varying protein loads (A and B) or analytical variations (C and D). The loge transformed spot volumes (graph A), and loge transformed normalized spot volumes (graph B), for the 278 analyzed spots across the five E. coli gels with increasing protein load per gel, are plotted in a box-and-whisker format against the five gels. The loge transformed spot volumes (graph C), and loge transformed normalized spot volumes (graph D), for the 119 analyzed spots across the 18 gels from data set 2 are plotted in the same format against the 18 gels. The graphs show the effect of normalization on the median and the range of spot volumes across the gels.

Figure 5. Effectiveness of normalization of spot volumes in correcting for gel-to-gel variation between sets of replicate gels from E. coli (A) and human plasma (B, data set 2). The ratios of the loge transformed normalized spot volumes are plotted, using an X-Y plot format, against the loge transformed normalized spot volumes.

Figure 6. Representative 2-DE gel image of human plasma from data set 2 annotated with the 119 spots used in the statistical analyses. The spot volumes for all the 367 spots detected, and for the 119 matched spots, were sorted in ascending order and plotted in a line graph. The graphs show that the 119 matched spots covered a similar range of volumes relative to the 367 spots detected, illustrating that we have a representative set of protein spots from the gel.

Spot Volume Data Screening. On the basis of previous work in a separate study (data not shown), and on published work,9

we understood that spot volume data from image analysis of 2-DE gels does not fit a normal distribution, but requires

814

Journal of Proteome Research • Vol. 4, No. 3, 2005

Experimental Design for 2-DE Quantitative Proteomics

research articles

Table 1. Average CV% for normalized Spot Volumes Across Replicate Gelsa data set

replicate set

average CV% ( SD

1

1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 2.1 2.2 2.3 2.4 2.5 2.6

16.1 ( 8.1 19.8 ( 13.0 13.0 ( 5.9 14.8 ( 16.4 26.3 ( 17.3 31.6 ( 22.9 11.6 ( 9.8 7.4 ( 4.5 17.20 ( 12.16 21.24 ( 20.13 19.84 ( 15.86 16.02 ( 12.43 22.41 ( 22.28 19.54 ( 15.23

2

a For each of the matched spots, CV% of normalized spot volumes was calculated across replicate 2-DE gels from data set 1 (replicate sets 1.1 to 1.8) and data set 2 (replicate sets 2.1 to 2.6). For each replicate set of gels, the average and SD values were then calculated and tabulated.

Figure 7. Percentage of spots that fall below a given CV% for normalized spot volume across replicate 2-DE gels. The CV% for normalized spot volumes across each of the 8 sets of triplicate gels from plasma data set 1 (replicate sets 1.1 to 1.8) was calculated and the cumulative percent of matched spots that fall below each of the given CV% values were plotted against the CV%. A shows the graphs for plasma data set 1, replicate sets 1.1 to 1.8; B shows the graphs that were similarly generated for the 6 sets of triplicate gels from plasma data set 2 (replicate sets 2.1 to 2.6).

logarithmic transformation. To confirm that this was the case with the data generated here, the normalized spot volumes, before and after log transformation, were plotted using a frequency histogram. Figure 8 shows the plots for a representative image from data set 1. These results showed that our log-

Figure 8. Comparison of frequency distribution of nontransformed and transformed normalized spot volumes. The median normalized spot volumes and loge transformed median normalized spot volumes was calculated for each replicate set of gels from data sets 1 and 2. The median values were then plotted in a frequency histogram, with the kernel density estimate of the distribution superimposed. The graphs are shown for a representative replicate set (replicate set 1.1) from data set 1. Note that the non transformed data does not fit a normal distribution, and how the loge transformation yields a data set that has approximately normal distribution. Similar effects were obtained with all other replicate sets of gels (data not shown).

transformed data are approximately normally distributed, compared to the nontransformed data. Accordingly, only transformed data was used for further analyses. Estimation of Variance Components. The normalized spot volumes from plasma data sets 1 and 2 were analyzed using statistical applications that we have made available under the Tools section at www.emphron.com. A table of betweensamples and within-samples variance was generated, which was then viewed in a box-and-whisker plot (Figure 9). The median between-samples variance components for data sets 1 and 2 were calculated to be 0.22 and 0.29, respectively, and the median within-samples (between replicate gels) variance components were 0.05 and 0.04, respectively. When analyzed on a spot by spot basis, the within-samples (analytical) variance was smaller than the between-samples (biological) variance in 92% and 87% of the cases for data sets 1 and 2, respectively. The variance values were then used in the power calculations to determine exactly how many gels and samples one should aim for in any given experimental design. Power Calculations. The variance components for betweensamples and within-samples can be used in power calculations, to predict the best experimental design for use in a twopopulation comparison study. Statistical power is defined as the probability of correctly identifying a difference between the groups. The power is determined by the sample size (number of samples and number of gels), the variability (biological and analytical variation), the significance level of the test, and the effect size (being the size of the difference you are looking to identify). The significance level is the probability of incorrectly rejecting the null hypothesissthat is, the probability of making a type I statistical error. A type I error arises when we decide that there is a group difference, when, in truth, there is not. A type II error is made when we incorrectly accept the null hypothesis. That is, a type II error arises when we decide that there is no group difference, when, in truth, there is. The effect size is defined by percentage increase. That is, an effect size of 100% represents a doubling of spot volume between groups. An effect size of 50% represents a between group ratio of 1.5. Minimum detectable difference is the size of effect required to give a required power at a specified significance level, given the variance known from a particular number of samples and gels. It can be extrapolated from the plots of effect size % against number of samples for a given number of gels per sample. Journal of Proteome Research • Vol. 4, No. 3, 2005 815

research articles

Hunt et al.

the number of replicates done per sample does not. This is not surprising, since the between-samples variance is so much larger than the within-samples variance, and serves to illustrate that duplicate or triplicate 2-DE gel runs per sample will be appropriate in the data presented here. An added advantage of doing replicate gels per sample is that it facilitates image analysis; artifacts detected as spots are usually unique to a replicate gel and can therefore be filtered out on that basis. In examining the graphs for plasma data sets 1 and 2, it can also be seen that the minimal detectable differences are dissimilar. Data set 2 has a greater variance for between sample effects, and a correspondingly larger minimum detectable effect size with an asymptote toward 65%. This shows that the minimum detectable effect size is sensitive to features of the experimental data that vary from time to time. Effective use of sample size calculations will depend on the availability of suitable pilot data for the system of interest. Figure 9. Between-samples and within-sample variances in normalized spot volumes. The normalized spot volumes from plasma data sets 1 and 2 were analyzed using statistical applications that we have made available under the Tools section at www.emphron.com. The tool: Minimum detectable differences spot data, was selected, the data file was uploaded, and the following values were entered; Minimum number samples per group: 4, Maximum number samples per group: 20, Minimum number gels per sample: 1 and Maximum number gels per sample: 4. The default values of 0.05 and 0.8 were used for Required significance level and Required power. The variance values generated by the software were copied into an Excel spreadsheet and graphed using box-and-whisker plots. Graphs A and B show the variance for between-samples and withinsample in data sets 1 and 2, respectively. Y-axes are of different scale. Note that the within-sample variation is smaller than the between-samples variation in both cases.

To investigate what number of samples and analytical replicates would need to be analyzed in a differential display experiment where different levels of protein expression are sought, we analyzed plasma data sets 1 and 2 using the tool “Minimum Detectable Difference-Spot Data” from the web site www.emphron.com. The analysis generated a table of effect size % for 4 to 20 samples per group and 1 to 4 gels per sample, based on a given power of 80% and significance level of 5% (data not shown). We plotted the effect size % against number of samples for 1 to 4 gels per sample (Figure 10). Figure 10 shows curves whereby the extent of the difference that can be significantly established between two sample types is influenced predominantly by the number of samples that are analyzed. For example, there will be approximately 9 samples (in data set 1) and 11 samples (in data set 2) required from control and experimental groups to significantly detect a 2-fold difference between the populations. However, more substantial differences (e.g., a 3-fold difference) can be detected significantly with smaller numbers of samples (in data set 1, with approximately 5 samples from control and experimental groups). The top graph in Figure 10 also makes clear that there is an asymptote toward ∼50% effect size, illustrating that even with a very large number of samples analyzed, there would be a minimal detectable difference of approximately 1.5-fold. Similar trends were observed in data set 2. A further trend evident in both graphs (Figure 10) is that while an increase in the number of samples has a marked impact on the detectable differences observed, an increase in 816

Journal of Proteome Research • Vol. 4, No. 3, 2005

Experimental Design In the above sections, we have described approaches and tools required to formulate a good experimental design for quantitative 2-DE based proteomics. The main aim of the design is to determine how many samples and gels should be analyzed in a given study to ensure that any proteins identified as differentially displayed will hold true in a subsequent larger study such as a clinical trial. Figure 2 shows a schematic description of the steps to be followed to ensure good experimental design, and highlights some hazards and precautions to be taken at each step. The overall strategy is to do a pilot study consisting of running triplicate 2-DE gels from each of 3 samples from each of two groups (for example a healthy and a diseased group); perform image analysis and normalize the data, and then do statistical variance analysis of the data for spots that match across all the gels. Transformation of the normalized spot volumes is done automatically by the analysis tool. The between and within sample variance data can then be used to run power analysis which allows you to determine how many gels and samples you will need to analyze in a total experiment to confidently detect differences of a certain level between the two groups.

Discussion We have presented an approach for assaying the quality of image analysis and a set of robust statistical approaches to measure variability in gel-based proteomics. We have also described a set of web-based tools that can be used for experimental design based on these measurements. We have shown how data from a pilot study can then be used to evaluate how many samples should be analyzed in a larger study and how many replicate gels need to be run to be confident that any differences detected in subsequent experiments will be significant. We believe these approaches, while straightforward, should be applied in proteomic experiments to ensure that the results reported reflect discoveries relevant to a biological system, and not just analytical variation. For example, a 2-fold up- or down-regulation of protein expression between two groups may be of little importance unless appropriate numbers of samples have been analyzed. Our gel-to-gel variability results (analytical variation) were very favorable when compared with previous reports. If we consider CV%, we have shown that an average of 50%, 83%,

research articles

Experimental Design for 2-DE Quantitative Proteomics Table 2. Correlation Coefficient for normalized Spot Volumes between Replicate Gels

Correlation coefficient R2 values (average ( SD) was calculated for normalized spot volumes (not transformed) for all combinations of replicate sets of images for the healthy and diseased groups from data set 1 (A) and data set 2 (B).

and 95% of spots matched across triplicate gels with CV% values of