A Probabilistic Treatment of the Missing Spot ... - ACS Publications

Mar 13, 2007 - Morten Krogh,*,†,| Ce´line Fernandez,‡ Maria Teilum,§ Sofia Bengtsson,| ... Neuroscience Center, Lund University, Sweden, and Dep...
0 downloads 0 Views 208KB Size
A Probabilistic Treatment of the Missing Spot Problem in 2D Gel Electrophoresis Experiments Morten Krogh,*,†,| Ce´ line Fernandez,‡ Maria Teilum,§ Sofia Bengtsson,| and Peter James| Computational Biology and Biological Physics, Department of Theoretical Physics, Lund University, Sweden, Department of Experimental Medical Science Division of Diabetes, Metabolism and Endocrinology Molecular Endocrinology Group, Lund University, Sweden, Laboratory for Experimental Brain Research, Wallenberg Neuroscience Center, Lund University, Sweden, and Department of Protein Technology, BMC D13, Lund University, Sweden Received March 13, 2007

Abstract: Two-dimensional SDS-PAGE gel electrophoresis using post-run staining is widely used to measure the abundances of thousands of protein spots simultaneously. Usually, the protein abundances of two or more biological groups are compared using biological and technical replicates. After gel separation and staining, the spots are detected, spot volumes are quantified, and spots are matched across gels. There are almost always many missing values in the resulting data set. The missing values arise either because the corresponding proteins have very low abundances (or are absent) or because of experimental errors such as incomplete/over focusing in the first dimension or varying run times in the second dimension as well as faulty spot detection and matching. In this study, we show that the probability for a spot to be missing can be modeled by a logistic regression function of the logarithm of the volume. Furthermore, we present an algorithm that takes a set of gels with technical and biological replicates as input and estimates the average protein abundances in the biological groups from the number of missing spots and measured volumes of the present spots using a maximum likelihood approach. Confidence intervals for abundances and p-values for differential expression between two groups are calculated using bootstrap sampling. The algorithm is compared to two standard approaches, one that discards missing values and one that sets all missing values to zero. We have evaluated this approach in two different gel data sets of different biological origin. An R-program, implementing the algorithm, is freely available at http://bioinfo.thep .lu.se/MissingValues2Dgels.html. Keywords: 2D-PAGE • Missing values • maximum likelihood

* Corresponding author. E-mail: [email protected]. † Department of Theoretical Physics, Lund University. ‡ Department of Experimental Medical Science Division of Diabetes, Metabolism and Endocrinology Molecular Endocrinology Group, Lund University. § Wallenberg Neuroscience Center, Lund University. | Department of Protein Technology, BMC D13, Lund University. 10.1021/pr070137p CCC: $37.00

 2007 American Chemical Society

Introduction Proteomics is the study of the function, structure, and regulation of proteins and includes the ability to quantify and identifylargenumbersofproteinssimultaneouslyandreproducibly.1-4 A typical application of proteomics is to find the proteins that are up- or downregulated or post-translationally modified when comparing two or more biological groups. Usually, each group is represented by one or more biological and technical replicates. A very common way of measuring relative protein abundances is by means of two-dimensional polyacrylamide gel electrophoresis (2D-PAGE)5-7 followed by mass spectrometry for identification of the proteins. An alternative method is multidimensional peptide separation of differentially isotopically labeled samples directly coupled with a mass spectrometer, for example, ICAT,8 cICAT,9 and iTRAQ.10 The focus of this paper is the use of 2D-PAGE for protein abundance quantification. After the separation, the gel is scanned optically, and an image is generated. The spots, which usually represent proteins, are detected either manually or using a spot detection algorithm. The volumes of the spots, which should be more or less proportional to the protein abundances, are measured. In a newer technique called differential in-gel electrophoresis (DIGE),11 several samples, stained with different fluorescent dyes, are run on the same gel, which allows the use of a reference sample. To compare protein abundances across samples, it is necessary to match spots across gels. This is usually done by designating one of the gels as the master gel and matching all others to that one, either manually or automatically,12 though some newer algorithms use an ‘all against all’ matching approach such as the Definiens (Munich, Germany) Proteomeweaver software package. Sometimes, virtual spots are added to the master gel to allow for spots that are not present on the master gel to be matched. In principle, all spots should be matched in a one-to-one fashion resulting in a data set of spot volumes arranged as a rectangular matrix with proteins as rows and gels as columns. The number of rows in the data set is equal to the number of spots on the master gel including possible virtual spots. Each row represents in principle one isoform of a protein. However, in practice, a spot could be a mixture of several proteins or even noise (dye precipitation, streaking, etc.). From hereon, we will use the term protein to represent each row in the data set; that is, a protein is a collection of matched spots across the gels. Journal of Proteome Research 2007, 6, 3335-3343

3335

Published on Web 07/11/2007

Probabilistic Treatment of the Missing Spot Problem in 2DE

technical notes

Unfortunately, a matching spot cannot always be found on all gels, and the corresponding entries in the data matrix must be designated as missing values. There are two distinct explanations for the absence of a spot: either the abundance of the protein corresponding to that spot is below the threshold of detection or an error has occurred. Errors include problems with the gel, incorrect matching, and problems with detection and separation of nearby spots. Here, it is necessary to distinguish between DIGE gels and single sample gels. This study solely concerns post-stained single sample gels. For DIGE gels, the situation is different, since a common reference sample is used on all gels in one dye channel. The DIGE reference sample should contain a detectable amount of all studied proteins, which is usually done by using a pool of all biological sample. The presence of a common reference sample implies that a spot cannot be missing due to low abundance but only due to errors. The protein is known to be present in the reference sample because it is present on other gels. The analysis of missing values in DIGE gels is interesting but will not be pursued in this paper.

values by extrapolating the values of spots from proteins with similar expression behavior in the measured values;13-19 and (4) transformation of all values including the missing values.20 This method is a generalization of the imputation methods in (3).

The fact that there are two reasons for the absence of spots makes it hard to perform a statistical analysis of a 2D-gel data set. Traditionally, two simple approaches are taken. One approach is to replace all missing values with zero. This would be approximately correct if the absence of a spot was solely due to low abundance. However, this approach would give severely misleading results if the spot was absent due to an error. In this study, we call this method ‘Missing value to Zero’ (MZ). The other simple approach is to ignore all the missing values and only perform the statistical analysis on the measured values. In this study, we call this method ‘Discard Missing values’ (DM). Proteins identified with DM as significantly differentially expressed between biological groups are unlikely to be false positives. The problem, however, is that the proteins with the highest fold changes would be excluded from the analysis. The highest fold changes occur when a protein is truly absent in one group. In this case, most or all of the measurements would be missing in the low group, and the DM method would not have enough measurements left to see a significant change between the groups. These proteins are often biologically relevant, for example, a protein that is absent in a tumor with a chromosomal deletion of the corresponding gene. We give examples of proteins that are missed by MZ and DM below. Missing values also occur in microarray data, albeit on a relatively much smaller scale; whereas post-stained 2D gels have around 50% missing values, microarrays typically have, at the very most, 20% missing values. Usually, microarray data are filtered by removing genes with more than 10-20% missing values. The vast majority of genes survive this filtering, unlike 2D gel data where such a filtering would discard most of the proteins. Since, after initial filtering, every gene in microarray data sets has a small proportion of missing values, the fold change of genes between biological groups is quite insensitive to the treatment of the missing values. The reason that missing values need to be taken care of is actually more technical; many multivariate methods, including various versions of unsupervised and supervised classification, cannot handle missing values. In the microarray literature, missing values have so far, to our knowledge, been treated in one out of four ways: (1) discarding missing values and only performing analysis that does not require complete data sets; (2) replacement of missing values with zero or the row average; (3) imputation of missing 3336

Journal of Proteome Research • Vol. 6, No. 8, 2007

The focus of this study is to find population quantities, that is, the average protein abundances in a group along with confidence intervals, fold changes between groups, and pvalues for up- or downregulation between two biological groups. Our aim is not to predict the value of particular missing spots or to devise methods for treating missing values in multivariate analysis like unsupervised and supervised classification. The first two methods mentioned above for microarrays correspond to the DM and MZ methods as defined above. The imputation and transformation methods are unreasonable for population quantities of individual proteins in 2D gel data, because the resulting fold changes and p-values would be too dependent on the imputed values, which, for many proteins, would constitute the majority of the measurements. The treatment of missing values is most important for proteins with many missing values, where imputation methods have the least validity. Imputation has been applied to single sample 2D gels21 and DIGE gels.22,23 In Chang et al.,21 values were only imputed for proteins with at most 20% missing values. An approach must be taken that employs the information of missing spots, but in a more sophisticated way than setting all missing spot volumes to zero. Such an approach was presented by Wood et al.,24 where the probability, for each protein, of observing a set of measured values and missing values is modeled as a product of a binomial distribution for the absence/presence of spots and a conditional probability for the values of the measured spots. We also assign probabilities of absence/presence of spots, and in case of presence, a conditional probability of the measured values. In contrast, we model the probability of spot presence with a global function of protein abundance. This function is estimated from all spots on the gels. So, the effect of a missing spot depends on how large the volumes of the replicate measurements are compared to other proteins on the gels and the number of missing values in those other proteins. This has the advantage that the interpretation of a missing value depends on the replicate measurements; if the replicates are highly expressed relative to most other spots on all the gels, the missing value is more likely to be an error than a sign of very low abundance, and vice versa. We estimate the parameters of the model by maximum likelihood (ML). The model, which is explained in detail in the Experimental Section, can be used in conjunction with various experimental designs. We have implemented the model in the case of a number of biological groups with nested biological and technical replicates. The conditional probability distribution of the measured values is taken to be a linear mixed model,25 and the probability of presence is modeled by a logistic function of logarithmic abundance. By the use of this model in conjunction with bootstrap sampling,26 confidence intervals and statistical significance for proteins to be up- or downregulated, in a comparison between two or more groups, can be assigned. We have analyzed two different 2D gel data sets from our laboratory, and for each of these, we have compared our Maximum Likelihood (ML) algorithm with the ‘Discard Missing values’ (DM) and ‘Missing values to Zero’ (MZ) approaches. The R27 program implementing the maximum likelihood

technical notes

Krogh et al.

algorithm is publicly available at http://bioinfo.thep.lu.se/ MissingValues2Dgels.html.

not necessarily independently distributed over s. The total probability density of the observed data is

Experimental Section

P({Vpg}|{Hp}, {σpG},R, β) )

Gel Experiments. Two data sets of different biological origin were chosen. The mouse liver data set consists of 48 gel images. Two sets of 8 animals were kept under two different conditions, group A and B, and then sacrificed. Group A animals were fed with a fat diet, and group B animals were fed with a normal diet. The protein extracts were run as experimental triplicates. The first dimension was run using 24 cm Immobiline drystrips pH 4-7 (GE Healthcare). The second dimension was run in an Ettan DALT II system on 12.5% SDS-PAGE gels. The gels were stained in ruthenium bathophenanthroline disulfonate (RuBPS) as in ref 28. The gels were scanned using a Typhoon 9400 (GE Healthcare) variable imager. The 2D-PAGE image computer analysis was carried out using ImageMaster 2D platinum software. The rat brain data set consists of 34 gel images. The rats were decapitated, and the brains were kept at 33 °C (group A, 6 animals) or at 37 °C (group B, 6 animals) for 15 min, after which they were snap-frozen to -80 °C. The 5 group C animal brains were snap-frozen immediately after decapitation. Neocortical tissue was homogenized and centrifuged, and the cytosolic fraction, as opposed to the membrane fraction, was used. Each protein extract was run in duplicate. The first dimension was run using 24 cm Immobiline drystrips pH 3-10 NL (GE Healthcare). The second dimension was run on 12.5% SDSPAGE gels. The gels were stained using Gelcode Blue Stain Reagent (Pierce Biotechnology, Rockford, IL) according to the manufacturer’s description and scanned on an ImageScanner scanner (GE Healthcare). Spot detection and matching was done in ImageMaster 2D platinum v. 5.0 (GE Healthcare). These data sets are available upon request from the authors. Maximum Likelihood Model. This model deals with a set of single sample gels. The log2 abundance of protein p in sample s is denoted by Tps. The abundance is measured in units of intensity volume, and we do not address the problem of converting it to actual protein copy number. We assume that the probability for a spot to be present q(Tps) depends only on the abundance and has the form log

(

q(Tps)

)

1 - q(Tps)

) R + βTps

(1)

with global parameters R and β common to all gels in one experiment. We assume that the log2 intensity volume of a measured spot, Vps , follows a normal distribution Vps ∼ N(Tps, σpG)

q(Tps)

x

2πσ2pG

-((Vps-Tps)2/2σpG2)

e

ps

p

and the probability of absence is 1 - q(Tps). For each protein, p, the log2 abundances Tps themselves are assumed to follow a joint distribution with density DHp({Tps}), where Hp denotes protein dependent parameters of the distribution. The Tps are

(

∏ (1 - q(T

ps )))

s

∏ s

q(Tps )

)

e-((Vps-Tps) /2σpG ) 2

x

2

2πσ2pG

(4)

where again q(Tps) is given by eq 1. This framework is general and can be used in connection with various experimental designs including time series. The specific information about the design is encoded in the distribution DHp. The parameters {Hp} are the biologically relevant quantities. Restricting to the designs relevant for the experiments in this study, that is, a number of biological groups with nested biological and technical replicates, we model the log2 abundance, Tpijk , as Tpijk ) µpi + pij + pijk

(5)

where p denotes the protein, i the biological group, j the biological replicate, and k the sample replicate and µpi is the mean log2 protein abundance of protein p in biological group i. The biological and sample variations, pij and pijk are assumed to be normally distributed with mean zero and proteindependent standard deviations σpB and σpS , respectively. This model is called a linear mixed model and is well-known in the literature.25 The corresponding probability distribution D({Tpijk}) is obtained by integration over pij and pijk while imposing eq 5. D({Tpijk}) )

∏∫ p,i,j

e-(pij /2σpB ) 2

2

x

∏∫

dpij

e-(pijk /2σpS ) 2

k

2πσ2pB

2

x

δ(Tpijk -

2πσ2pS

(µpi + pij + pijk))dpijk (6)

where δ(x) is the usual delta function, or more correctly delta distribution, which satisfies

∫ δ(x)dx ) 1

and δ(x) ) 0

for x * 0

(7)

The integrals in eq 6 can be performed analytically. First, the integrals over pijk are performed by removing the delta function and replacing Tpijk with µpi + pij + pijk. Next, the integrals over pij are performed by repeated use of the general formula -((x-µ1)2/2σ12)

∫e

e-((x-µ2) /2σ2 )

x2πσ12

2

x2πσ22

e-((µ1-µ2) /2(σ1 +σ2 )) 2

2

dx )

2

2

x2π(σ12 + σ22)

(8)

The final result is D({Tpijk}) )

∑ ∏ x2π(σ e-((

(3)

absent Hp({Tps })( present

(2)

where σpG is the technical variation in log2 intensity volume of protein p due to the gel experiment. So, the probability density that a measurement Vps is obtained for a spot which comes from a protein with log2 sample abundance Tps is given by 1

∫ ∏ dT ∏ D

kTijk-Nijµi)

p,i,j

2 pS

2/2N

ij(σpS

2+N

2

ijσpB

+ NijσpB )

2))

e-(





l(Tijl -

2 2 kTijk/Nij) /2σpS )

x2πσ

(9) 2 pS

where Nij is the number of technical replicates of the jth biological replicate in group i. Substituting D into eq 4 gives the full model for this experimental design. The integral over Tpijk is analytically intractable because of the Tpijk dependence Journal of Proteome Research • Vol. 6, No. 8, 2007 3337

Probabilistic Treatment of the Missing Spot Problem in 2DE

technical notes

Figure 1. The rate of presence as a function of volume. The rate of presence for the spots is plotted as a function of average loge spot volume in the 24 gels in group A of the mouse data set. The rate of presence is calculated, for each of the 899 proteins, as the fraction of the 24 gels where the protein was present. The volumes are binned in intervals of size 0.3, and the rate of presence is depicted in a box plot. The bold lines represent the mean values of the bins, the boxes span the 25% to the 75% percentiles, and the whiskers extend to the most extreme points which have a distance to the box of no more than 1.5 times the box size. Points outside the whiskers (outliers) are shown with circles. The solid curve is a logistic regression fit to the un-binned data. The regression function is of the form of eq 1 with R ) 3.7 and β ) 1.2. The p-value that β is different from zero is < 2 × 10-16.

of q(Tpijk). We make the approximation of replacing q(Tpijk) with q(µpi), which is justifiable when the variation of q is small on the scale set by the deviation between Tpijk and µpi. This deviation is governed by the biological and sample variation which typically are at most 0.5. As seen in Figure 1, q does not vary much on that scale. The probability of presence, q, varies over the full dynamic range on the gel, but not much over the replicates within one protein. Making this assumption, the integral over Tpijk can be calculated analytically by use of eq 8. The result is P({Vpijk}|{µi}, {σB}, {σS}, {σG}, R, β) )

∏ (1 - q(µ p,i,j

x

2 pS

Mij

q(µpi)Nij



1

x2π(σ

pi))

e- ((

2 2 2 2 kVijk-Nij µi) /2Nij(σpS +σpG +NijσpB ))

+ σpG2 + NijσpB2)

1

e- (

∑ ∑ l(Vijl-

2 2 2 kVijk/Nij) /2(σpS +σpG ))

(10)

2π(σpS2 + σpG2)

where Nij is the number of present values in replicate j of group i, Mij is the number of missing values in replicate j of group i, and q is given by eq 1. The sample variation σpS and gel variation σpG combine into one common technical variation, σpT, as usual σ 2pT ) σ 2pS + σ 2pG

(11)

In the case of no technical replicates, that is, Nij ) 1, the 3338

Journal of Proteome Research • Vol. 6, No. 8, 2007

biological variation combines with the other two as well. Eq 10 is the formula that is used, with maximum likelihood estimation of parameters, to analyze all gel data in this paper. Implementation of Maximum Likelihood. We have implemented an R27 program that estimates the parameters µi, σpB, σpT, R, and β by an approximation to maximum likelihood of eq 10. To increase the speed of the implementation, we set σpT and σpB equal to the sample standard deviations of the measured values within the technical and biological replicates, respectively. When µi is close to the mean value of the present values alone, this estimate of σpT and σpB is very close to the exact value. When µi is much smaller than the mean value of the present values alone, this estimate of σpT and σpB is too small, and will have the conservative effect of bringing the estimate of µi closer to the measured values. This is not unreasonable, since it is very unlikely that the measured values could be distributed with a much smaller variance than the real variance. The parameters µi, R, and β are updated iteratively. For fixed R and β, the likelihood function splits into a product over i of functions of µi, and the µi are updated to maximize these functions. For fixed µi, R and β are updated iteratively; first R, then β, then R again, and so forth. until convergence. Then the µi are updated again, and so forth. The optimization stops when the current values of all parameters are close (0.01) to their values one step earlier. All onedimensional optimizations are done by the R function ‘optimize’. For proteins with only one measured value in every group, σpB and σpT are undefined and the means of σT and σpB over all other proteins are used. For proteins with no measured

technical notes

Krogh et al.

values, σpT and σpB are not needed and µi is set to minus infinity, or equivalently, the group mean abundance is set to zero. The R implementation of the algorithm is available at http:// bioinfo.thep.lu.se/MissingValues2Dgels.html.

method, which, using the individual p-values, controls the falsediscovery rate.

Bootstrap Confidence Intervals and p-Values. The maximum likelihood approach provides estimates for the parameters but does not give confidence intervals for the group abundances. Confidence intervals could be deduced from σpB and σpT. However, there are three problems that then arise. First, for proteins with zero or one present value in a biological group, there is no reliable estimate of σpB and σpT. Second, the estimates of σpB and σpT themselves are uncertain, and it is not appropriate to use them to deduce confidence intervals and p-values. Third, the confidence intervals depend crucially on the model assumptions, for example, normality and the logistic function for the probability of presence. To overcome these problems, we have chosen to employ bootstrap sampling.26 Given N gels, a bootstrap sample is a set of N gels drawn with replacement from the original set, that is, some gels could now occur more than once. We choose to fix the number of gels within each biological group but to allow the bootstrap sample to have another distribution of biological and technical replicates; that is, we bootstrap freely within each biological group. The information about biological samples is kept, but the number of technical replicates can vary. For example, if there are 10 gels in a group split into two biological samples with 5 technical replicates of each, a bootstrap sample could contain 7 and 3 technical replicates of the two biological samples, respectively. Alternatively, one could have kept all replicate numbers fixed. The advantage of allowing the number of technical replicates to change is to gain more bootstrap samples for small experiments. The distribution of these bootstrap values can be used to find confidence intervals: for example, the 2.5% percentile and the 97.5% percentile in the bootstrap distribution define the lower limit and the upper limit, respectively, in a 95% confidence interval. In a comparison between two groups, the bootstrap values are calculated for each of the two groups separately. If group 1 has a higher value than group 2 for a protein, the one-sided p-value that rejects the null hypothesis of equal volumes is taken to be the fraction of pairs for which group 2 is higher than group 1. The two-sided p-value is taken as twice the one-sided p-value.

Missing Values as a Function of Abundance. If missing values are to be corrected for, it must be the case that spots are more likely to be present the higher the protein abundance is. Even though such a connection between spot presence and protein abundance is known in the literature,24,30 this must be confirmed. As a first simple check of this connection, we used the mouse liver group A data set, which consists of 24 gels with 899 detected proteins (or isoforms of proteins). Totally, 205 of the 899 proteins had 0 missing values, and the average of the corresponding 205 × 24 ) 4920 spot volumes was 0.26. In the other extreme, 9 of the proteins had 23 missing values and the 9 × 1 ) 9 corresponding volumes averaged 0.04. This strongly suggests that spots from high-abundance proteins are more likely to be present than spots from low-abundance proteins. Thus, we calculated the rate of presence, for each of the 899 proteins, as the fraction of the 24 gels where the protein was present. Figure 1 is a box plot of the rate of presence as a function of the average logarithmic volume of the present spots. A box plot, with binned volumes, is used to visualize the distribution instead of a dot plot because of the high number of data points. It is clear that the rate of presence increases with volume. Also shown in the figure is a logistic function of the form given in eq 1 fitted to the data. The logistic function is increasing (p-value < 2 ‚10-16). An increase in spot presence with volume was seen for all groups in both data sets. The logistic regression function is only an approximate description of the connection between presence and abundance. However, it is best to avoid overfitting the function so as to achieve a more robust algorithm. Estimated Abundances. For both data sets, the mean protein volumes in all groups were estimated in three ways: (1) using the maximum likelihood (ML) algorithm presented here; (2) taking the exponential of the average of the logarithm of all measured values, discarding missing values (DM); and (3) setting missing values to zero, and then taking the average (MZ). The full details are given in the Experimental Section. Besides the treatment of missing values, an important difference is that ML and DM work with logarithmic volumes, whereas MZ works directly with volumes. MZ cannot use logarithms since missing values are set to zero. Figure 2 shows the values for group A in the mouse data set as an example. The upper part of Figure 2 shows ML/DM for all proteins ordered according to the number of missing values. The lower part of Figure 2 shows the equivalent for ML/MZ. As expected, the three methods more or less agree for proteins with few missing values, whereas the ML method typically gives larger values than the MZ method and smaller values than the DM method for proteins with many missing values. The same behavior is seen for all groups in both data sets. Overlap of Significant Proteins between the Three Methods. Using bootstrap sampling, we calculated the p-values of differential expression between all pairs of groups in the mouse liver and rat brain data set, respectively, for each of the three methods DM, MZ, and ML. In all cases, the DM method found more significant proteins than the MZ method which again found more than the ML method. For example, in the comparison of groups A and C in the rat brain data, which was the pairwise comparison with the most clear separation between the groups, the DM method found 219

Discard Missing Values Method. In the Discard Missing values method (DM), all missing values are ignored. The estimates of population means are obtained with a linear mixed model in the same way as in the maximum likelihood approach discussed above. Logarithmic spot volumes are used in the linear mixed model. Bootstrap sampling is used in the same way to obtain confidence intervals and p-values. Missing Values to Zero Method. In the Missing values to Zero method (MZ), all missing values are replaced by zero. Actual volumes instead of logarithmic volumes are used because of the zeroes. A linear mixed model with normal distributions does not make sense since actual volumes, including zeroes, are not normally distributed. Population means are calculated by averaging values within each biological sample, and averaging the biological replicates to get the average within a group. Bootstrap sampling is used for confidence intervals and p-values. Multiple Hypothesis Testing. All three methods provide p-values for group-wise comparisons. To account for multiple hypothesis testing, we employ the Benjamini-Hochberg29

Results and Discussion

Journal of Proteome Research • Vol. 6, No. 8, 2007 3339

Probabilistic Treatment of the Missing Spot Problem in 2DE

technical notes

Figure 2. Protein abundances measured by the three methods. The mean protein abundances for the 899 proteins in group A of the mouse data set were estimated with the three methods ML, DM, and MZ. The upper part is a plot of ML/DM as a function of the number of missing values. The lower plot is for ML/MZ. Every black dot represents a protein. It is seen that, for proteins with few missing values, all methods agree, whereas for proteins with many missing values, the ML method yields a lower estimate than the DM method but higher than the MZ. Note that MZ and DM do not exactly agree for zero missing values, because MZ averages volumes and DM averages log volumes.

proteins at the 0.01 level (Benjamini-Hochberg FDR ) 0.03), the MZ method found 178 proteins at the 0.01 level (BenjaminiHochberg FDR ) 0.04), and the ML method found 123 proteins at the 0.01 level (Benjamini-Hochberg FDR ) 0.06). There could be two fundamentally different reasons for the fact that the

ML method finds fewer proteins than the two others; either the ML method misses some proteins, or the other two report false positives. These cannot easily be distinguished with the data sets here. Figure 3 shows Venn diagrams of the overlap between the most significant proteins found by the three

Figure 3. Overlap of differentially expressed proteins. The overlap of the most significantly differentially expressed proteins with the three methods are shown in Venn diagrams. For the upper 4 diagrams, the K most significant proteins are used for a fixed K. For the lower 4 diagrams, a fixed p-value cutoff is used. The title of each Venn diagram shows the data set, the groups that are compared, and the fixed number of proteins (K) or p-value cutoff, respectively. The number of fixed proteins (K) and p-value cutoffs are chosen such that all protein lists have a false-discovery rate of at most 0.05. From the upper 4 diagrams, it can be seen that ML and MZ are quite similar compared to DM, and that the ML method captures essentially all proteins that are common to the two other methods. The lower diagrams have a different distribution of numbers because the DM method finds more significant proteins than MZ, which again finds more significant proteins that ML. However, even in the lower diagrams, where the ML method has the lowest number of proteins, it is true that the ML method captures most proteins that are common to the two other methods. 3340

Journal of Proteome Research • Vol. 6, No. 8, 2007

technical notes

Krogh et al.

Figure 4. Disagreement between the three methods. A plot of the expression values of three proteins from the rat brain data set. There are 10 gels in group A, and 12 in each of group B and C. Only present spots are plotted. (Left plot) There are 9 missing values in group A and 2 in group B. For the ratio (group A)/(group B), MZ finds a value of 0.3 (p ) 0.001), ML finds 0.1 (p ) 0), whereas DM finds 1.0. (Middle plot) There are 0 missing values in group A and 7 in group C. For the ratio (group A)/(group C), DM finds a value of 0.5 (p ) 0), ML finds 0.7 (p ) 0.01), whereas MZ finds 1.4. (Right plot) There are 3 missing values in group B and 7 in group C. For the ratio (group B)/(group C), DM finds a value of 0.2 (p ) 0), MZ finds 0.3 (p ) 0.03), whereas ML finds 1.2.

methods. The upper part of Figure 3 shows the Venn diagrams using a fixed number of significant proteins, whereas the lower part is made with a fixed p-value cutoff. The Venn diagrams with a fixed number of significant proteins have the advantage that they show the overlap between the most significant proteins irrespective of the magnitudes of the p-values. At least two conclusions can be drawn from Figure 3: (1) The ML and MZ methods are similar, while the DM method is an outlier. This is probably due to the fact that the MZ and ML methods both utilize the missing values unlike the DM method. (2) Most proteins found by both DM and MZ are found by ML as well. This is reassuring, since such proteins are likely to really be differentially expressed. A similar statement is true for MZ albeit to a slightly lesser extent. To gain more insight into the differences between the methods, we calculated the number of proteins that are significantly differentially expressed by two of the methods in the same direction (up or down, respectively), and where the third method finds a, not necessarily significant, regulation in the opposite direction (down or up, respectively). For all four pairwise comparisons combined and with a p-value cutoff of 0.05, the DM method disagreed on a total of 11 proteins, the MZ method disagreed on a total of 3 proteins, and the ML method disagreed on a total of one protein. None of the disagreeing methods claimed significance. They just report an

insignificant fold change in the opposite direction. It is illustrative to look at the pattern of such proteins. Figure 4 shows the expression values of three proteins where one of the three methods disagreed with the other two. The left part of the figure shows a protein where DM disagrees. All the missing values are in group A, which makes MZ and ML conclude that the protein has lower expression in group A. DM, on the other hand, only sees the single measurement, and finds a fold change around 1. This protein illustrates very well the major flaw with DM; it ignores missing values completely. The middle part of Figure 4 shows a protein where MZ disagrees. DM and ML find the protein to be higher in group C, whereas MZ reports a higher expression in group A. Whereas the result of DM is obvious, it is interesting to understand why MZ and ML disagree here. MZ sets the 7 missing values in group C to zero, which pulls group C down below group A. ML, on the other hand, explains the 7 missing values in group C as a chance event. The probability to be missing, 1 - q , is 0.70 for spot volumes around 0.02, so missing values are expected and should not pull down the estimate of the group mean as severely as MZ does. The right part of Figure 4 is very similar to the middle part. However, here MZ, together with DM of course, assigns higher mean expression to group C, and ML assigns higher, but insignificant, expression in group B. The crucial difference between the middle and right part of the Journal of Proteome Research • Vol. 6, No. 8, 2007 3341

Probabilistic Treatment of the Missing Spot Problem in 2DE

technical notes

figure is that the volumes are 20-40 times higher in the right part of the figure. For volumes around 0.6, the probability to be missing, 1 - q, is 0.02. For the ML method, a mean expression of around 0.6 in group C is irreconcilable with the 7 missing values out of 12 in group C. Instead ML assigns a much lower mean value in group C, and explains the high measured values with a large variance. MZ also pulls down the mean value in group C but not as much as ML, which might seem counterintuitive. The main reason for that is that ML works with logarithms of volumes, whereas MZ works with volumes. A small volume pulls an average down more when logarithmic volumes are used as compared to when raw volumes are used. Thus, one cannot make a definitive statement about the true fold changes for these three proteins with the given information. However, manual interpretation provides some clarification. The protein where DM disagrees has lower expression in group A, and the 9 missing values out of a total of 10 are due to low protein abundance. So MZ and ML are probably correct with regard to this protein. The interpretations of the other two proteins are harder. For both of them, the values in group C have an unusual distribution. There are 5 measured values with relatively high spot volumes and 7 missing values. For both proteins, all 5 measured values are from three animals; 2 animals have both duplicates present and one animal has a present and a missing spot. Furthermore, the three animals with missing values in one of the proteins are the ones with present values in the other protein and vice versa. Thus, one can surmise that group C is heterogeneous, with three animals in one subgroup and the other three in another subgroup, and for some proteins, including these two, the distribution of abundances is, at the minimum, bimodal. The two proteins are not in the same region on the gels. The whole concept of mean abundance, and fold changes, implicitly assumes a distribution with a peak in the center. The manual inspection of the proteins highlights another point, namely, that statistical methods only work within their assumptions, and that automated high-throughput analysis runs the risk of claiming significant up- or downregulation in cases where it makes little sense. One should note that these two proteins represent rare cases as is evident from the numbers above.

Since we do not know the absolute protein abundances in the samples, it is not possible to compare the algorithm with real values. However, since the ML algorithm only uses very plausible assumptions, we believe that the ML method provides reasonable estimates of protein abundance. It is a strength of the ML method that it finds most of the significant proteins common to the other two methods as illustrated in Figure 3.

Conclusion We have developed a maximum likelihood approach that estimates the mean protein abundances in a group of gels by using both the number of missing values and the measured spot volumes. By bootstrap sampling, confidence intervals and p-values are obtained as well. The smallest number of gels where the ML method could be used is two gels per biological group. However, only a simple estimate of the protein abundance would be obtained in that case. It is commonly assumed that 200 bootstrap samples, requiring at least 6 gels, in a biological group are sufficient to obtain reliable confidence intervals and p-values. The main assumption of the algorithm is that the probability of a spot to be missing only depends on the abundance of the corresponding protein. As Figure 1 illustrates, it is a reasonable assumption. However, it is not perfect. Possible refinements could be to introduce a global parameter for each gel representing its quality, to distinguish between different regions on the gels, to use the background intensity on the gels, or more advanced, to have the spot detection and matching program assign quality scores to the matching of each spot. 3342

Journal of Proteome Research • Vol. 6, No. 8, 2007

The R implementation of the ML algorithm, together with DM and MZ, is available at http://bioinfo.thep.lu.se/ MissingValues2Dgels.html.

Acknowledgment. M.K. and P.J. were supported by the Swedish Foundation for Strategic Research and the Knut and Alice Wallenberg Foundation through the Swegene consortium and the Strategic Science Foundation (SSF) CREATE Health centre. C.F. thanks Ann-Helen Thore´n Fischer for animal breeding and Anna-Karin Påhlman for help with 2-D gel electrophoresis. C.F. was supported by the Swedish National Research School in Genomics and Bioinformatics. M.T. thanks The Lundbeck Foundation for financial support and Ulrika Brynnel for expert technical help. S.B. thanks cancerfonden for financial support. We are grateful to Carl Troein for proofreading the manuscript. References (1) Stults, J. T; Arnott, D. Methods Enzymol. 2005, 402, 245-289. (2) Govorun, V. M.; Archakov, A. I. Biochemistry (Moscow) 2002, 67, 1109-1123. (3) Pandey, A.; Mann, M. Nature 2000, 405, 837-846. (4) James, P. Proteome Research: Mass Spectrometry; SpringerVerlag: Berlin, 2001. (5) O’Farrell, P. H. J. Biol. Chem. 1975, 250, 4007-4021. (6) Klose, J. Humangenetik 1975, 26, 231-243. (7) Scheele, G. A. J. Biol. Chem. 1975, 250, 5375-5385. (8) Gygi, S. P.; Rist, B.; Gerber, S. A.; Turecek, F.; Gelb, M. H.; Aebersold, R. Nat. Biotechnol. 1999, 17, 994-999. (9) Hansen, K. C.; Schmitt-Ulms, G.; Chalkley, R. J.; Hirsch, J.; Baldwin, M. A.; Burlingame, A. L. Mol. Cell. Proteomics 2003, 2, 299-314. (10) Ross, P. L.; Huang, Y. N.; Marchese, J. N.; Williamson, B.; Parker, K.; Hattan, S.; Khainovski, N.; Pillai, S.; Dey, S.; Daniels, S.; Purkayastha, S.; Juhasz, P.; Martin, S.; Bartlet-Jones, M.; He, F.; Jacobson, A.; Pappin, D. J. Mol. Cell. Proteomics 2004, 3 (12), 1154-1169. (11) U ¨ nlu ¨ , M.; Morgan, M. E.; Linden J. S. Electrophoresis 1997, 18, 2071-2077. (12) Anderson, N. L.; Taylor, J.; Scandora, A. E.; Coulter, B. P.; Anderson, N. G. Clin. Chem. 1981, 27 (11), 1807-1820. (13) Troyanskaya, O.; Cantor, M.; Sherlock, G.; Brown, P.; Hastie, T.; Tibshirani, R.; Botstein, D.; Altman, R. B. Bioinformatics 2001, 17 (6), 520-525. (14) Ouyang, M.; Welsh, W. J.; Georgopoulos, P. Bioinformatics 2004, 120 (6), 917-923. (15) Kim, K. Y.; Kim, B. J.; Yi, G. S. BMC Bioinf. 2004, 5, 160. (16) Sehgal, M. S. B.; Gondal, I.; Dooley, L. S. Bioinformatics 2005, 21 (10), 2417-2423. (17) Kim, H.; Golub, G. H.; Park, H. Bioinformatics 2005, 21 (2), 187198. (18) Bø, T. H.; Dysvik, B.; Jonassen, I. Nucleic Acids Res. 2004, 32 (3), e34. (19) Scheel, I.; Aldrin, M.; glad, I. K.; Sorum, R.; Lyng, H.; Frigesi, A. Bioinformatics 2005, 21 (23), 4272-4279. (20) Johansson, P.; Hakkinen, J. BMC Bioinf. 2006, 16 (7), 306. (21) Chang, J.; Van Remmen, H.; Ward, W. F.; Regnier, F. E.; Richardson, A.; cornell, J. J. Proteome Res. 2004, 3, 1210-1218. (22) Jung, K.; Gannoun, A.; Sitek, B.; Meyer H. E.; Stu ¨ hler, K.; Urfer, W. REVSTAT - Statist. J. 2005, 3 (2), 99-111. (23) Jung, K.; Gannoun, A.; Sitek, B.; Apostolov, O.; Schramm, A.; Meyer H. E.; Stu ¨ hler, K.; Urfer, W. REVSTAT - Statist. J. 2006, 4 (1), 67-80. (24) Wood, J.; White, I. R.; Cutler, P. Signal Process. 2004, 84, 17771788.

technical notes (25) McCulloch, C. E.; Shayle R. S. Generalized, Linear, and Mixed Models; John Wiley & Sons: New York, 2001. (26) Efron, B.; Tibshirani, R. J. An Introduction to the Bootstrap; Chapman & Hall/CRC: New York, 1993. (27) Ihaka, R.; Gentleman, R. J. Comput. Graph. Statist. 1996, 5 (3), 299-314.

Krogh et al. (28) Rabilloud, T.; Strub, J. M.; Luche, S.; van Dorsselaer, A.; Lunardi, J. Proteomics 2001, 1 (5), 699-704. (29) Benjamini, Y.; Hochberg, Y. J. R. Stat. Soc. 1995, B 57, 289-300. (30) Meleth, S.; Deshane, J.; Kim, H. BMC Biotechnol. 2005, 11, 5:7.

PR070137P

Journal of Proteome Research • Vol. 6, No. 8, 2007 3343