Analyzing Large-Scale Proteomics Projects with Latent Semantic

Nov 30, 2007 - at least part of this heterogeneity can be compensated for, we here apply a latent semantic analysis to the data contributed by the Hum...
0 downloads 4 Views 871KB Size
Analyzing Large-Scale Proteomics Projects with Latent Semantic Indexing Sebastian Klie,*,†,§ Lennart Martens,*,‡,§ Juan Antonio Vizcaíno,‡ Richard Côté,‡ Phil Jones,‡ Rolf Apweiler,‡ Alexander Hinneburg,† and Henning Hermjakob‡ Martin Luther University Halle-Wittenberg, Halle-Saale, Germany, and EMBL Outstation, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, U.K. Received July 23, 2007

Since the advent of public data repositories for proteomics data, readily accessible results from highthroughput experiments have been accumulating steadily. Several large-scale projects in particular have contributed substantially to the amount of identifications available to the community. Despite the considerable body of information amassed, very few successful analyses have been performed and published on this data, leveling off the ultimate value of these projects far below their potential. A prominent reason published proteomics data is seldom reanalyzed lies in the heterogeneous nature of the original sample collection and the subsequent data recording and processing. To illustrate that at least part of this heterogeneity can be compensated for, we here apply a latent semantic analysis to the data contributed by the Human Proteome Organization’s Plasma Proteome Project (HUPO PPP). Interestingly, despite the broad spectrum of instruments and methodologies applied in the HUPO PPP, our analysis reveals several obvious patterns that can be used to formulate concrete recommendations for optimizing proteomics project planning as well as the choice of technologies used in future experiments. It is clear from these results that the analysis of large bodies of publicly available proteomics data by noise-tolerant algorithms such as the latent semantic analysis holds great promise and is currently underexploited. Keywords: bioinformatics • data mining • proteomics • latent semantic analysis

Introduction The field of proteomics has undergone several dramatic changes over the past few years. Advances in instrumentation and separation technologies1,2 have enabled the advent of highthroughput analysis methods that generate large amounts of proteomics identifications per experiment. Many of these data sets were initially only published as supplementary information in PDF format and, while available, were not readily accessible to the community. Obviously, this situation led to large-scale data loss and was perceived as a major problem in the field.3,4 Several public proteomics data repositories, including the Global Proteome Machine (GPM),5 the Proteomics Identifications Database (PRIDE),6,7 and PeptideAtlas8 were constructed to turn the available data into accessible data, thereby reversing the trend of increasing data loss. As a case in point, several large-scale proteomics projects that have recently been undertaken by the Human Proteome Organisation (HUPO), including the Plasma Proteome Project * Corresponding authors. Sebastian Klie, Martin Luther University HalleWittenberg, Halle-Saale, Germany, E-mail: [email protected]. Lennart Martens, EMBL Outstation, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, U.K. E-mail: lennart.martens@ ebi.ac.uk. † Martin Luther University Halle-Wittenberg. ‡ EMBL Outstation, European Bioinformatics Institute. § These authors contributed equally to this work.

182 Journal of Proteome Research 2008, 7, 182–191 Published on Web 11/30/2007

(PPP)9 and the Brain Proteome Project (BPP),10 have published all of their assembled data in one or more of these repositories. As a result, their findings are readily accessible to interested researchers. It is therefore remarkable to see that very little additional information has so far been extracted from the available data. One of the rare examples where the analysis of large proteomics data sets resulted in a practical application is the recent effort by Mallick and co-workers in which several properties of a large amount of identified peptides were used to fine-tune an algorithm that can predict proteotypic peptides from sequence databases.11 We here present a novel way to reveal the information that lies hidden in large bodies of proteomics data, by analyzing them for latent semantic patterns. The analysis performed here focused on the HUPO PPP data as available in the PRIDE database. Briefly, the HUPO PPP sent out a variety of plasma and serum samples, collected from different ethnic groups and at different locales worldwide. All of the five resulting plasma samples were additionally treated with one of three distinct methods of anticoagulation: EDTA, citrate, or heparin. The total amount of distinct samples amounts to 20: five serum samples, and three times five plasma samples.9 We used both remapped protein accession numbers as well as the original peptide sequences to evaluate experiment similarity by performing a latent semantic analysis, a technique often employed in natural language processing in combination 10.1021/pr070461k CCC: $40.75

 2008 American Chemical Society

research articles

Analyzing Large-Scale Proteomics Projects with a vectorial representation of documents. Striking patterns were revealed by subsequently grouping the obtained similarities by their experimental metadata. Examination of these patterns resulted in interesting insights in the applied technologies, leading to concrete recommendations for future proteomics experiments and large-scale project planning. Additionally, a closer inspection of the semantic relationships that the algorithm discovered between peptide sequences yielded findings that may contribute to future solutions for important proteomics data analysis bottlenecks such as peptide-to-protein mapping.12 It is therefore clear from our results that heterogeneous proteomics data sets should not be a priori considered unfit for analysis, and that data repositories should not to be considered data graveyards, but rather sources of invaluable information that could not be obtained in any other way.

Experimental Section Data Set. The HUPO PPP data set was obtained from the PRIDE database (http://www.ebi.ac.uk/pride), under accession numbers 4-98. These data sets are also accessible as PRIDE XML files via FTP at ftp://ftp.ebi.ac.uk/pub/databases/pride. Protein Sequence Database. All protein identifiers shown here are derived from the IPI human database,13 version 3.21. The relevant IPI database release can be downloaded from ftp://ftp.ebi.ac.uk/pub/databases/IPI/old/HUMAN. Protein Assembly Algorithm. The protein assembly algorithm used in this work was closely modeled on the algorithm described previously in ref 14. A slight change in the input routines was carried out which allowed the software to read protein sequences and descriptions from a relational database built using MySQL (http://www.mysql.com). Briefly, the algorithm attempts to construct a minimal explanatory list of proteins for the presented peptide sequences, while maintaining maximal annotation (a peptide matching both a hypothetical protein and a UniProtKB/Swiss-Prot entry will preferentially be matched to the UniProtKB/Swiss-Prot protein). Protein Assembly of the HUPO PPP Peptides. Assembling the 25 052 nonredundant peptide sequences obtained from the 95 HUPO PPP PRIDE experiments against the database resulted in 7884 nonredundant protein accession numbers. Experiment Similarity Calculation and the Use of Information Retrieval Algorithms. To assess experiment similarity in an all-against-all comparison, we used two information retrieval algorithms: the vector space model (VSM) and its extension, the latent semantic analysis (LSA; also referred to as latent semantic indexing, LSI). Both algorithms are wellknown15 and have been widely used for various information retrieval tasks. Originally designed for natural language processing, the key objective is to retrieve those documents which are most similar (in ranked order) out of a corpus of indexed documents. The main task in extending the VSM to an LSA lies in a transformation of the term-based document vectors to a topicbased representation. Interestingly, however, this step also effectively reduces the influence of noise (caused by the occurrence of random terms). Applied to the context of proteomics, experiments represent the concept of documents, while identified proteins (more precisely their IPI accession numbers) or peptide sequences represent the terms within each document (experiment). The algorithm can thus report a similarity score for each pair (or tuple) of experiments, based on the protein identifications (or peptide sequences) they

Table 1. Example Document Term Matrixa term/document

exp 45

exp 38

exp 40

exp 57

exp 53

exp 42

IPI00000001 IPI00000005 IPI00000006 IPI00000013 IPI00000015 IPI00000023 IPI00000024 IPI00000026 IPI00000030 IPI00000041 IPI00000045 IPI00000051 IPI00000057 IPI00000070

0,26 0 0 0,51 0,27 0,51 0,49 0,47 0 0 0 0 0 0,54

0,26 0,54 0,54 0 0,27 0,51 0,49 0,47 0 0 0,47 0 0 0,54

0,26 0,54 0,54 0,51 0,27 0,51 0,49 0,47 0 0,57 0,47 0 0,6 0,54

0,26 0,54 0,54 0,51 0,27 0,51 0,49 0,47 0 0 0,47 0 0 0,54

0,26 0,54 0,54 0,51 0,27 0,51 0,49 0,47 0 0,57 0,47 0 0,6 0,54

0,26 0,54 0,54 0,51 0,27 0,51 0,49 0,47 0 0,57 0,47 0 0 0,54

a Six experiments, containing 14 protein accession numbers, are shown. Note that the occurrences are weighted in this example (see main text).

contain. Note that the noise reduction performed by the LSA translates into a much reduced effect of potential spurious protein (or peptide) identifications when applied to proteomics experiments. 1. Vector Space Model. The general principle of the VSM is the representation of term-document association data in a vector space. This is achieved by creating a document/term matrix W wherein each value wi,j represents the occurrence of a certain term i (rows of W) within the jth document (columns of W). Therefore each column-vector w*j in W can be interpreted as a document-vector b dj which characterizes a document (proteomics experiment) by its terms (protein accession numbers or peptide sequences).16 The actual sequential order in which the terms occur in the indexed document is disregarded, and the general assumption is that this simplified ‘bagof-words’ representation will maintain most of the relevant information in such a document.17 Note that since we do not have quantitative information available for the HUPO PPP, the matrix is a bit-matrix. An example of such a matrix is shown in Table 1. The similarity between the document vectors of two documents a and b is determined by the cosine distance measure: sim(a, b) f [0, 1] : sim(a, b) ) cos R(a, b) )

a·b ) |a| · |b| n

∑ab

i i

i)1

∑ ∑ n

ai2

i)1

For which: sim(a, b) ) 1 S a ) λ · b,

(1)

n

bi2

i)1

λ∈R

The cosine distance measure returns 0 for orthogonal vectors, and 1 if the angle between the vectors is 0. To compensate for the effect of poorly differentiating, oftenoccurring terms, the ‘inverse document frequency’ (IDF) of a term is applied as a weighting factor to that term. The IDF for a term t is given by

( )

IDFt ) log

ND dt

(2)

Journal of Proteome Research • Vol. 7, No. 01, 2008 183

research articles

Klie et al.

where ND denotes the total number of documents, dt denotes the number of documents containing term t over the complete document corpus, and therefore ND g dt. In the case of the HUPO PPP data sets, application of the IDF will result in low weights for proteins such as hemoglobin and serum albumin due to their widespread presence in the experiments,18 which ultimately results in a very low impact in the experiment similarity scores for these ubiquitous proteins. Although the VSM is a simple model it tends to yield good results. Several drawbacks exist, however. The first problem lies in the fact that the algorithm relies on exact term matching (because of the dot product in eq 1), thus, making it impossible for the model to detect synonyms or differentiate between homonyms. An example of a synonym in proteomics is furnished by the isobaric amino acids leucine and isoleucine, or when different accession numbers are used to denote the same protein sequence. A second problem is the sparseness of nonzero values in the obtained document/term matrix.19 The matrix is thus largely composed of zeroes, highlighting the fact that many proteins failed to be identified in more than one experiment, a common situation in (shotgun) proteomics experiments where undersampling occurs.20 Taken together, these two problems result in an underestimation of the true semantic similarity in the HUPO PPP data when using only the VSM for similarity scoring. 2. Latent Semantic Analysis. To reduce the sparseness of the document/term matrix and to detect hidden (latent) term relations, an LSA projects the original document vectors into a lower dimensional semantic space where documents which contain repeatedly co-occurring terms will have a similar vector representation. This effectively overcomes the fundamental deficiencies of the exact term-matching employed in a VSM.21 As such, an LSA might predict that a given term should be associated with a document, even though no such association was observed in the original matrix.22 The core principle for achieving this is the application of singular value decomposition (SVD), a type of factor analysis which can be applied to any rectangular matrix in the form of: W ) UΣVT

(3)

where U  Rn×n and V  Rm×m are orthogonal matrices (i.e., UUT ) Im, and VTV ) In), and Σ is a diagonal matrix containing the ordered singular values σi,j g σi+1,j+1 g ... g σr,r g 0, where r ) min(m,n). LSA makes use of the matrix approximation theorem, which states that argmin |W - Wk |2 ) UkΣkVTk

Wk has rank k

(4)

Where Uk and Vk consist of the first k columns of U and V, respectively, and Σk ) diag(σ1, ..., σk). So, for k e r, Wk is the best rank k approximation of W in the least-squares sense. The approximation error is bounded with respect to the Frobenius norm by ||W - Wk||F e σk+1. The column vectors (ΣkVkT) are the new document vectors in the latent space. The mapping of an original document vector d into this latent space is described by UTd. Concretely, a dimensional reduction of Σ (of rank r) to Σk (of rank k) is performed, which in turn requires adaptations to U and V, which are reduced from (n × r) to (n × k), and from (r × m) to (k × m), respectively: W ) UΣVT ≈ Wk ) UkΣkVTk 184

Journal of Proteome Research • Vol. 7, No. 01, 2008

(5)

where Wk denotes the approximated (i.e., reduced to k e n) latent semantic representation of the document/term matrix W. Figure 1 provides a graphical illustration of this decomposition and reduction step. The SVD thus alters the original values in the matrix W by new estimates, based on the observed co-occurrences of terms and their ‘true semantic meaning’ within the whole corpus of documents.23 The latter is achieved because terms with a common meaning are roughly mapped to the same direction in the latent space. Weak patterns and noise are filtered out during this process as the smallest singular values are left out first. The choice of k determines the degree of reduction, and it is therefore important to note that a high k value (corresponding to a small reduction) might not be able to filter out all noise or unimportant fluctuations in the source data, while a very small k value (strong reduction) will retain too little information from the original data structure.15 Algorithm Implementation. The algorithms were implemented in the Java programming language (http://www.java. sun.com), and the sources can be obtained by contacting the authors. The Jama library (http://math.nist.gov/javanumerics/ jama) was used for the SVD calculations, and the dbtoolkit package24 was used to access the IPI database and the peptide mapping algorithm. The k-means clustering as well as the construction of the heat maps was performed using R.25 Similarity Score Calculation for the HUPO PPP Data Set. All identified peptides from the 95 HUPO PPP experiments were remapped against the human IPI database, version 3.21, resulting in 7884 nonredundant IPI accession numbers. On the basis of those identifications, a document/term matrix was created with the restriction that proteins which were only found in one single experiment were omitted; this more than halves the dimensions of the matrix to 3649 by 95 (the number of nonredundant proteins identified in more than one experiment by the number of experiments). Apart from an advantage in computational speed, this reduction also has a positive influence on the ability of the analysis to discover meaningful patterns. Indeed, a protein present only in a single experiment would decrease the similarity score between any two experiments because the inner product of those vectors is divided by their magnitude (see eq 1). The components of the document/term matrix were subsequently weighted using the IDF global weighting scheme (see eq 2). Finally, the similarity for each pair of the 95 experiments was computed using the cosine measure, which resulted in [{n(n + 1)/2} - n )] 4465 similarity scores, disregarding the trivial cases where an experiment is compared to itself.

Results Application of the Information Retrieval Algorithms to the HUPO PPP Data Set. The distribution of the similarity scores obtained for the HUPO PPP experiments with the VSM approach (Figure 2) shows a low overall similarity (more than 90% of the similarity-scores are less than or equal to 0.1). This paradoxical finding seems to contradict the intuitive assumption that proteomics experiments from the same tissue should yield highly similar results. The lack of reproducibility in proteomics experiments that suffer from undersampling20 plays a considerable role in this divergence of the results.26 This is nicely illustrated for the HUPO PPP data by the fact that not even a single protein is seen in every experiment. Moreover, only 40 identifications out of the 7884 proteins are found in at

Analyzing Large-Scale Proteomics Projects

research articles

Figure 1. A graphical depiction of the matrix transformation for SVD with a reduction from r to k singular values. (a) This image shows the decomposition of the original matrix into three matrices by singular value decomposition. Matrix Σ is a square matrix of size r where r is the rank of the original matrix W. (b) Shown here is the reduction of the three matrices to k singular values (grayed out area).

wide array of techniques applied by the HUPO PPP contributors, with the express purpose of enhancing coverage and allowing subsequent method evaluation.9 Furthermore, a shotgun proteomics approach to analyze a complex mixture results in approximately 30% of all proteins identified by only a single peptide.12 As a result, nonzero entries constitute less than 3% of the document/term matrix, which, although better than a typical natural language set, is still quite poor, especially considering that this data set describes the proteome of a single tissuesplasma. Since the VSM approach to calculate experiment similarity further suffers from the previously mentioned limitations of the exact term matching employed, the overall result is a substantial underestimation of experiment similarity which hinders thorough analysis.

Figure 2. Histogram of the similarity score distribution obtained with the VSM. The very low overall similarity between experiments is clear; more than 4000 pairs of experiments have a similarity less or equal than 0.2.

least half of the experiments, well below 1%. In contrast, more than 70% of the reported proteins are only found in one or two experiments. This effect can be further explained by the

With the application of an LSA to the VSM, however, both these problems can be resolved, and a revised matrix can be obtained that now supports detailed analysis of experiment overlap and complementarity. Finally, we found that one additional step is instrumental in obtaining easily interpretable patterns: grouping the individual experiments together in a graphical plot based on four metadata criteria: depletion step(s) applied to the sample, protein fractionation technique, search engine used for identification, and finally mass spectrometer type (see Supplementary Table 1 in Supporting Information). This information was extracted directly from the metadata available in the PRIDE database. Repeat experiments with Journal of Proteome Research • Vol. 7, No. 01, 2008 185

research articles

Klie et al.

Figure 3. A visualization of the protein-based experiment similarity matrix, obtained from an LSA with k ) 75. White is used for a score of 0, black for a score of 1. The experiments have been grouped by depletion technique, search engine, separation method, and mass spectrometer. These parameters are annotated above, below, and to the left of the plot. See main text for further details.

identical values for the above four parameters are internally ordered by protein identification count. Technology Comparison Based on the Similarity Scores. A gray scale map of the similarity scores between all experiments, obtained after an LSA analysis with k ) 75, is shown in Figure 3 (note that the upper bound for the value of k is 95 in this case, as there were 95 individual HUPO PPP experiments). White is used for a score of 0, black for a score of 1. The experiments have been grouped according to the four abovementioned parameters, and these have been annotated above, below, and to the left-hand side of the map. Obviously, an experiment is identical to itself, which is why the top-left to lower-right diagonal is black. If we now consider the first two dark-colored squares (I and II) in the upper left-hand corner, we can see that they represent two clusters of experiments, each with high internal similarity, but showing relatively little similarity between them. Reading the separation protocol and the depletion technique used from the annotations above the map, we can see that the difference between the two clusters lies in the separation technique. The leftmost cluster was obtained after 2D-PAGE separation, while the right-hand cluster resulted from a peptide-centric shotgun 186

Journal of Proteome Research • Vol. 7, No. 01, 2008

approach (no protein separation technique annotated). Additionally, reading the mass spectrometer annotations for these same experiments on the left-hand side of the map, we learn that the two clusters also differ here. The leftmost cluster was analyzed using a MALDI machine, while the right-hand cluster was run through an ESI-quadrupole instrument. Finally, the annotations on the bottom reveal that the experiments in both clusters relied on the Mascot search engine for identification. Because two parameters differ between these clusters (separation method and mass spectrometry), it is difficult to extract the exact cause for the differences. When we continue to follow the diagonal, we find two weak clusters that have been analyzed by PepMiner (III and IV), and then a stronger cluster that represents the only ESI FT-ICR experiments (V), all of which were analyzed using the VIPER search engine. The next cluster (VI) is derived from a set of 2D-PAGE experiments, and these can be compared to the next cluster along the diagonal (VII, the largest and most consistent cluster on the map), because both clusters represent experiments performed by the same laboratory (see Supplementary Table 2 in Supporting Information). After consulting the annotations, it is clear that the only difference between the two clusters is the use of the top-6

Analyzing Large-Scale Proteomics Projects

research articles

Figure 4. A visualization of the protein-based experiment similarity matrix, obtained from an LSA with k ) 10. This gray scale map used a comparatively high dimensional reduction with a k value of 10 which results in an overall higher similarity between the experiments. Former clusters visible in Figure 3 are seen to have become part of larger areas of high similarity. This is due to a shift in focus towards more generic semantic patterns in the data. Interestingly, some clusters (e.g., two of the 2D-PAGE derived data sets) still stand out, indicating that they capture a distinct part of the plasma proteome. Additionally, band (a), a shotgun proteomics approach without a protein fractionation step, has practically no internal or external semantic connections, indicating a low similarity when compared to each other (light gray colour throughout band a), while contributing many unique identifications.

depletion in the rightmost cluster, while no depletion was employed for the left cluster. The last three clusters (VIII, IX, and X) are also by a single laboratory, and the difference between VIII and IX can be read from the annotations: the use of a MALDI TOF-TOF versus an ESI Ion Trap. To see what separated cluster X from the previous two, we had to consult the HUPO PPP overview manuscript, as the necessary level of metadata detail was not transmitted to the PRIDE database. The difference turned out to be the iodoacetamide derivatization used in cluster X, but not in IX. This also showed in the peptide composition of these clusters, since the average number of cysteine-containing peptides was higher in the experiments of cluster X than those of cluster IX (7% versus 3%, respectively). It is interesting to observe the high level of reproducibility within the clusters containing 2D-PAGE experiments (I, VI, and VII). Another extremely interesting observation concerns the use of depletion steps. If we consider only clusters VI and VII, the only difference lies in the application of a top-6 depletion

step in one of them. The very low similarity between these two clusters (nearly white overlap regions) shows that removal of the six most abundant proteins in plasma resulted in the detection of an almost completely different part of the plasma proteome. Finally, all the observed clusters in this analysis derive from differences introduced by the various methodologies and technology platforms employed, rather than from differences between the samples. It is noteworthy in this context that a previously published analysis, performed on quantitative data, was able to group related samples together.27 By reducing the k value in the LSA, the focus is shifted from fine-grained differences to the rougher distinctions. Figure 4 shows a gray scale map of an LSA with k ) 10. It is immediately obvious that the fine-grained distinctions are lost in the large sections of high similarity (large, darkly chequered areas). Indeed the choice of a much smaller k reveals the abilities of the LSA: the similarities between the experiments increase overall. The influence of protocol steps becomes less and less significant, since the reduction to smaller k focuses on the Journal of Proteome Research • Vol. 7, No. 01, 2008 187

research articles

Klie et al.

Figure 5. Venn-diagrams of the number of nonredundant protein identifications as contributed by different technology platforms. (a) This Venn diagram shows the distribution of the identifications based on protein separation technology. For clarity, the platforms have been divided in three rough categories: 2D-PAGE approaches (2D gel), shotgun proteomics experiments (none), and all other technologies (other). (b) Shown here is the influence of the depletion technology used. It is clear that top 6 depletion contributed the majority of the identifications.

stronger patterns in the data, in this case the common biological origin. Clusters V, VI, and VII still stand out clearly, however, as does the group of shotgun experiments on top-6 depleted samples, located near the right of the map (a). The shotgun experiments, thus, reveal very little similarity, both within the repeated experiments as well as compared to the rest of the experiments. For the three clusters, their persistent segregation indicates that they contribute truly unique identifications (i.e., they can not be semantically connected to other identifications), but in contrast to the shotgun experiments, the 2D-PAGE results are strongly internally consistent. For cluster V, the unique instrument and search engine most probably result in a different semantic theme, while the 2DPAGE approaches in clusters VI and VII are distinguished by applying either rigorous depletion (top-6) or no depletion at all. Additionally, the results of the depleted and nondepleted samples, respectively, remain very dissimilar from each other while retaining high internal consistency. This provides strong evidence for an effect reminiscent of the ‘albumin sponge’effect.28 This implies the necessity of changing our reasoning about depletion steps; rather than simply removing unwanted proteins from a mixture, they simply amount to another separation step. Nonredundant Protein Identification Contributions per Technology Platform. Figure 5 shows Venn diagram representations of the total number of nonredundant protein identifications contributed by several different technology platforms. From Figure 5a, it is clear that the shotgun proteomics experiments contribute most identifications. This can be explained in part by the popularity of these approaches in the HUPO PPP, but the many unique contributions also stem from the low overlap typically obtained in these experiments. This ties in with the observation in Figure 4, band (a) as noted above. It is interesting to see, however, that despite the large contribution by the shotgun approaches the 2D-PAGE methods still contribute 198 unique identifications, hinting at an important complementarity between these methods with regards to achieving maximal proteome coverage. This observation is strengthened by the fact that the overlap between 2D-PAGE and the other two groups is substantially less than the overlap between the nongel technologies. The depletion comparison shown in Figure 5b clearly indicates that experiments using top-6 depletion were most productive in terms of protein identifications contributed. It is again interesting, however, to 188

Journal of Proteome Research • Vol. 7, No. 01, 2008

see that 10% of all identifications are contributed solely by experiments that did not use any protein separation technology. Using Peptides Instead of Proteins for Validation. To test whether the protein matching algorithm introduced any artifacts, we repeated the LSA for k ) 75 using the peptide sequences for each experiment instead of the reassembled protein identifiers. The results were practically identical (see Supplementary Figure 1 in Supporting Information), proving that the protein mapping algorithm did not introduce any observable biases. Interpretation of the Peptide-Based LSA. Since the literature suggested that a comparison based on exact matching of peptide sequences dramatically underestimates the overlap between experiments,14 it is particularly interesting to see how the LSA map of the peptides is similar to that of the proteins. It is obvious that the specific ability of the LSA to detect latent semantic relations provides a solution for the low exact sequence matching between experiments. It is therefore of considerable interest to examine the way in which the LSA manages to group peptide sequences together in semantic units. This analysis can be carried out by studying the term vectors rather than the document vectors (i.e., looking at the rows of the document/term matrix rather than the columns). Since we are interested in the peptide sequences that result in semantic units, and since the LSA method groups semantically related terms together in n-dimensional space, a simple k-means clustering29,30 allows detection of these related terms. Upon analysis of the resulting groups, three distinct patterns emerge. First, the LSA merges peptide sequences distinguished only by the occurrence of isobaric amino acids (e.g., isoleucine/ leucine); since these amino acids are indistinguishable to the mass spectrometer, their substitution does not affect the semantic representation of the containing peptide sequence. Second, the LSA succeeds in grouping together peptides that represent subsequences of longer peptides with those ‘parent’ peptides; these subsequences can be the result of missed cleavages (e.g., YLGNATAIFFLPDEGK and YLGNATAIFFLPDEGKLQHLENELT), in-source decay, or in vivo proteolytic degradation (e.g., YLGNATAIFFLPDEGKLQHLENELT and YLGNATAIFFLPDEGKLQHLENELTHD). A third group created by the LSA is composed of sequences that are unrelated and appear to be grouped randomly (e.g., EFNAETFTFHADICTLSEK, HPDYSVVLLLR, HPYFYAPELLFFAK, QNCELFEQLGEYK, RPCFSA-

Analyzing Large-Scale Proteomics Projects

research articles

Figure 6. Visualization of the peptide-based sample similarity matrices. (a) This plot shows the sample similarity when computed from a VSM. It is clear that no real patterns can be observed. (b) When analysed by an LSA with k ) 2, the distinction between plasma and serum becomes obvious.

LEVDETYVPK, TAGWNIPMGLLYNK, VPQVSTPTLVEVSR, and VVSVLTVVHQDWLNGK). When matched to the protein sequence database, however, six of these peptides match to several variants of the serum albumin protein in IPI. The LSA method appears to be able to find patterns of co-occurrence in the peptide sequences that then allow grouping of these sequences into logical clusters. At least some of these clusters in turn map closely to the objects studied: the precursor proteins of the peptides. It is important to realize that the LSA derives such relations solely based on a large list of peptide sequences, organized into individual experiments that can essentially be considered to be poorly overlapping replicates. Sample Analysis. We also tested the ability of the LSA to distinguish between complete sample sets rather than the more specific distinction between individual experiments. For this, we grouped together all the peptides identified for the five different sample types: sodium citrate plasma, EDTA plasma, heparin plasma, citrate-phosphate-dextrose plasma, and serum.9 The resulting matrix of 5 by 25 052 (sample types by unique peptide sequences) did not show any meaningful similarities when a VSM was applied, but did show easily interpretable results when an LSA was applied with k ) 2 (we expect two semantic topics: plasma and serum). The results are plotted in Figure 6. Obviously, the clear difference between plasma and serum is caused by the absence of clotting factors (such as fibrin) in the latter.

Discussion Comparison of Protein Lists. We have demonstrated the usefulness of the LSA methodology for the comparison of protein lists derived from many widely differing proteomics analyses performed on the same tissue. By applying this approach to the publicly available HUPO PPP data sets, we have been able to extract several interesting yet readily understandable pieces of information that can inform decisions for future (large-scale) projects. We have also encountered several potentially interesting differences between groups of experiments that could not be traced to a single origin, enabling improvements to the planning of future pilot projects. Indeed, a newly planned pilot can now base the distribution of the different platforms and methodologies to assay on the information obtained here and thereby aim to vary only one significant parameter between groups of experiments for optimal results. In practice, however, as this may not always be possible, more detailed follow-up studies can still provide the required information at the cost of more complex design. From the PPP data,

it can already be concluded that when the goal of a project is to achieve maximal proteome coverage for a particular sample, shotgun proteomics experiments, repeated over multiple replicates, achieve the most gain. 2D-PAGE analysis should not be disregarded as an analytical tool, however, since it can contribute a substantial fraction of unique identifications. Because of the high internal reproducibility of the 2D-PAGE analyses as performed in the HUPO PPP, it seems that carrying out many replicates on this platform does not necessarily lead to a proportional increase in novel identifications. In the specific case of plasma, the influence of various depletion techniques is also of interest. While methods employing top-6 depletion contributed more than 50% of the identifications, about 10% of all proteins were only found when no depletion was used at all. It therefore seems reasonable to analyze the eluate of depletion columns (the ‘depletome’). Indeed, depletion steps should rather be considered separation steps. Planning Large-Scale Proteomics Experiments. With the analysis of previous experiments and experimental approaches for their specific strengths, the planning of large-scale proteomics collaborations can be informed of the appropriate (mix of) methodologies, technologies, and instruments to employ in order to achieve a desired result. Additionally, a cost-benefit prediction can be made for each platform, taking into account the specific focus of a project. It is clear that the overall lack of internal reproducibility for shotgun proteomics experiments is a great advantage to a project that aims to achieve high proteome coverage, while the inherent reproducibility of the 2D-PAGE methodology within one laboratory as employed in the HUPO PPP is quite interesting from the perspective of quantitative studies. Interpretation of the Semantic Associations. An interesting finding is the ability of the LSA algorithm to tease out semantic relationships between apparently unrelated sets of peptide sequences, based solely on the patterns by which they aggregate within and between experiments. We have found that at least some of these semantic links can be explained by underlying methodological or biological concepts. As such, application of the LSA approach to replicate shotgun analyses can potentially help to alleviate one of the primary caveats of peptide-centric proteomics: the protein inference problem. The LSA is able to reconstruct groups of peptides that were derived from a single protein, and it therefore holds promise as either a preprocessing step or a validation step in peptide-to-protein mapping strategies. An application of the LSA approach directly Journal of Proteome Research • Vol. 7, No. 01, 2008 189

research articles on the original fragmentation spectra would be an interesting next step, as any reliance on identification processes is thereby removed. Since the semantic structures underlying peptide lists (at least in part) represent entities of biological interest, and since the same processes work for the protein accession numbers, the nature of the semantic relationships that occur at that level are also of considerable interest. Potential candidates of methodological explanations could be the solubility or abundance of the proteins in relation to the sample handling and protein separation protocol, while biologically important semantic relations could include protein complexes or protein components of the same pathway. Importance of Metadata and Central Repositories. The analyses here have only been possible through (a) the availability of the data and (b) the presence of detailed annotation for the different experiments. The situation was still somewhat suboptimal, however, as no fragmentation spectra were available to support the peptide identifications, and we also occasionally had to refer to the original publications in order to obtain the necessary experiment metadata. We therefore conclude that it is important that (a) proteomics data can be stored in centralized, easily accessible repositories and (b) these repositories should attempt to capture as much annotation as possible, prompting the submitters to provide context to their data in a well-structured manner. While several proteomics data repositories are now available, the level of metadata provision varies widely, both between experiments and between repositories.

Conclusion We presented a method that allows large collections of heterogeneous data sets to be mined relatively easily and reliably, resulting in the extraction of valuable and reasonable information despite the presence of substantial noise. The analyses carried out here effectively open up many paths for further investigation. By extending the analysis to include other tissue data sets (for instance the HUPO Brain Proteome Project (HUPO BPP),10 and eventually any available proteomics data) and by carefully choosing an appropriate k value, the focus of investigation could be shifted from the fine-grained effects resulting from the application of different technology platforms to the course-grained distinctions derived from differences in tissue type, disease state, or developmental stage, in a similar way to the sample analysis we performed here. As such, we feel that, far from being the places where data goes to die, the proteomics data repositories instead free the data to play an important part in promising meta-analyses. It is important to consider that the latent semantic analysis employed here greatly benefits from the large number of varying experimental repetitions on the same sample, leaving room for both large-scale collaborative proteomics project such as the HUPO initiatives, as well as individual efforts by laboratories with a particular focus, with the condition that the resulting data is made available to the community in centralized repositories. Equally important is the presence of sufficiently detailed annotation of the data submitted to such repositories. As a general conclusion, we would therefore like to emphasize that submission of well-annotated data to public repositories should quickly become standard practice for the field. Abbreviations: IR, information retrieval; VSM, vector space model; LSI (or LSA), latent semantic indexing (analysis); SVD, 190

Journal of Proteome Research • Vol. 7, No. 01, 2008

Klie et al. singular value decomposition; PRIDE, proteomics identifications database.

Acknowledgment. This work has been supported by the EU “ProDaC” grant number LSHG-CT-2006-036814 and the BBSRC “ISPIDER” grant. J.A.V. is a Postdoctoral Fellow of the “Especialización en Organismos Internacionales” program from the Spanish Ministry of Education and Science. Supporting Information Available: Figure showing visualization of the HUPO PPP similarity matrix, based on peptide sequences rather than protein identifiers, and obtained from an LSA with k ) 75; table listing the metadata, retrieved from the PRIDE database for the HUPO PPP experiments. This material is available free of charge via the Internet at http:// pubs.acs.org. References (1) Aebersold, R.; Mann, M. Mass spectrometry-based proteomics. Nature 2003, 422 (6928), 198–207. (2) Domon, B.; Aebersold, R. Mass spectrometry and protein analysis. Science 2006, 312 (5771), 212–217. (3) Prince, J. T.; Carlson, M. W.; Wang, R.; Lu, P.; Marcotte, E. M. The need for a public proteomics repository. Nat. Biotechnol. 2004, 22 (4), 471–472. (4) Hermjakob, H.; Apweiler, R. The Proteomics Identifications Database (PRIDE) and the ProteomExchange Consortium: making proteomics data accessible. Expert Rev. Proteomics 2006, 3 (1), 1–3. (5) Craig, R.; Cortens, J. P.; Beavis, R. C. Open source system for analyzing, validating, and storing protein identification data. J. Proteome Res. 2004, 3 (6), 1234–1242. (6) Martens, L.; Hermjakob, H.; Jones, P.; Adamski, M.; Taylor, C.; States, D.; Gevaert, K.; Vandekerckhove, J.; Apweiler, R. PRIDE: the proteomics identifications database. Proteomics 2005, 5 (13), 3537–3545. (7) Jones, P.; Cote, R. G.; Martens, L.; Quinn, A. F.; Taylor, C. F.; Derache, W.; Hermjakob, H.; Apweiler, R. PRIDE: a public repository of protein and peptide identifications for the proteomics community. Nucleic Acids Res. 2006, 34 (Database issue), D659– 663. (8) Desiere, F.; Deutsch, E. W.; Nesvizhskii, A. I.; Mallick, P.; King, N. L.; Eng, J. K.; Aderem, A.; Boyle, R.; Brunner, E.; Donohoe, S.; Fausto, N.; Hafen, E.; Hood, L.; Katze, M. G.; Kennedy, K. A.; Kregenow, F.; Lee, H.; Lin, B.; Martin, D.; Ranish, J. A.; Rawlings, D. J.; Samelson, L. E.; Shiio, Y.; Watts, J. D.; Wollscheid, B.; Wright, M. E.; Yan, W.; Yang, L.; Yi, E. C.; Zhang, H.; Aebersold, R. Integration with the human genome of peptide sequences obtained by highthroughput mass spectrometry. GenomeBiology 2005, 6 (1), R9. (9) Omenn, G. S.; States, D. J.; Adamski, M.; Blackwell, T. W.; Menon, R.; Hermjakob, H.; Apweiler, R.; Haab, B. B.; Simpson, R. J.; Eddes, J. S.; Kapp, E. A.; Moritz, R. L.; Chan, D. W.; Rai, A. J.; Admon, A.; Aebersold, R.; Eng, J.; Hancock, W. S.; Hefta, S. A.; Meyer, H.; Paik, Y. K.; Yoo, J. S.; Ping, P.; Pounds, J.; Adkins, J.; Qian, X.; Wang, R.; Wasinger, V.; Wu, C. Y.; Zhao, X.; Zeng, R.; Archakov, A.; Tsugita, A.; Beer, I.; Pandey, A.; Pisano, M.; Andrews, P.; Tammen, H.; Speicher, D. W.; Hanash, S. M. Overview of the HUPO Plasma Proteome Project: results from the pilot phase with 35 collaborating laboratories and multiple analytical groups, generating a core dataset of 3020 proteins and a publicly-available database. Proteomics 2005, 5 (13), 3226–3245. (10) Hamacher, M.; Apweiler, R.; Arnold, G.; Becker, A.; Bluggel, M.; Carrette, O.; Colvis, C.; Dunn, M. J.; Frohlich, T.; Fountoulakis, M.; van Hall, A.; Herberg, F.; Ji, J.; Kretzschmar, H.; Lewczuk, P.; Lubec, G.; Marcus, K.; Martens, L.; Palacios Bustamante, N.; Park, Y. M.; Pennington, S. R.; Robben, J.; Stuhler, K.; Reidegeld, K. A.; Riederer, P.; Rossier, J.; Sanchez, J. C.; Schrader, M.; Stephan, C.; Tagle, D.; Thiele, H.; Wang, J.; Wiltfang, J.; Yoo, J. S.; Zhang, C.; Klose, J.; Meyer, H. E. HUPO Brain Proteome Project: summary of the pilot phase and introduction of a comprehensive data reprocessing strategy. Proteomics 2006, 6 (18), 4890–4898. (11) Mallick, P.; Schirle, M.; Chen, S. S.; Flory, M. R.; Lee, H.; Martin, D.; Ranish, J.; Raught, B.; Schmitt, R.; Werner, T.; Kuster, B.; Aebersold, R. Computational prediction of proteotypic peptides for quantitative proteomics. Nat. Biotechnol. 2007, 25 (1), 125– 131.

research articles

Analyzing Large-Scale Proteomics Projects (12) Nesvizhskii, A. I.; Aebersold, R. Interpretation of shotgun proteomic data: the protein inference problem. Mol. Cell. Proteomics 2005, 4 (10), 1419–1440. (13) Kersey, P. J.; Duarte, J.; Williams, A.; Karavidopoulou, Y.; Birney, E.; Apweiler, R. The International Protein Index: an integrated database for proteomics experiments. Proteomics 2004, 4 (7), 1985– 1988. (14) Martens, L.; Muller, M.; Stephan, C.; Hamacher, M.; Reidegeld, K. A.; Meyer, H. E.; Bluggel, M.; Vandekerckhove, J.; Gevaert, K.; Apweiler, R. A comparison of the HUPO Brain Proteome Project pilot with other proteomics studies. Proteomics 2006, 6 (18), 5076– 5086. (15) Dumais, S. Improving the retrieval of information from external sources. Behav. Res. Methods, Instrum. Comput. 1991, 23 (2), 229– 236. (16) Salton, G.; Wong, A.; Yang, C. S. Vector-space model for automatic indexing. Commun. ACM 1975, 18 (11), 613–620. (17) Salton, G.; McGill, M. J. Introduction to Modern Information Retrieval; McGraw-Hill: New York, 1983; p. xv, p. 448. (18) Anderson, N. L.; Anderson, N. G. The human plasma proteome: history, character, and diagnostic prospects. Mol. Cell. Proteomics 2002, 1 (11), 845–867. (19) Katz, S. M. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Trans. Acoust., Speech Signal Process. 1987, 35 (3), 400–401. (20) Gevaert, K.; Van Damme, P.; Ghesquie`re, B.; Impens, F.; Martens, L.; Helsens, K.; Vandekerckhove, J. A la carte proteomics with an emphasis on gel-free techniques. Proteomics 2007, 7(16), 2698– 2718. (21) Landauer, T. K.; Foltz, P. W.; Laham, D. {An introduction to latent semantic analysis}. Discourse Processes 1998, 25 (2), 259–284. (22) Deerwester, S.; Dumais, S. T.; Furnas, G. W.; Landauer, T. K.; Harshman, R. Indexing by Latent Semantic Analysis. J. Am. Soc. Inf. Sci. 1990, 41 (6), 391–407.

(23) Furnas, G. W.; Deerwester, S.; Dumais, S. T.; Landauer, T. K.; Harshman, R. A.; Streeter, L. A.; Lochbaum, K. E. Information retrieval using a singular value decomposition model of latent semantic structure, SIGIR ’88. In Proceedings of the 11th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; ACM; New York, 1988; pp. 465–480. (24) Martens, L.; Vandekerckhove, J.; Gevaert, K. DBToolkit: processing protein databases for peptide-centric proteomics. Bioinformatics 2005, 21 (17), 3584–3585. (25) Team, D. C. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing: Vienna, Austria, 2005. (26) Reidegeld, K. A.; Muller, M.; Stephan, C.; Bluggel, M.; Hamacher, M.; Martens, L.; Korting, G.; Chamrad, D. C.; Parkinson, D.; Apweiler, R.; Meyer, H. E.; Marcus, K. The power of cooperative investigation: summary and comparison of the HUPO Brain Proteome Project pilot study results. Proteomics 2006, 6 (18), 4997– 5014. (27) Adkins, J. N.; Monroe, M. E.; Auberry, K. J.; Shen, Y.; Jacobs, J. M.; Camp, D. G., 2nd; Vitzthum, F.; Rodland, K. D.; Zangar, R. C.; Smith, R. D.; Pounds, J. G. A proteomic study of the HUPO Plasma Proteome Project’s pilot samples using an accurate mass and time tag strategy. Proteomics 2005, 5 (13), 3454–3466. (28) Zolotarjova, N.; Martosella, J.; Nicol, G.; Bailey, J.; Boyes, B. E.; Barrett, W. C. Differences among techniques for high-abundant protein depletion. Proteomics 2005, 5 (13), 3304–3313. (29) Smellie, A. Accelerated K-means clustering in metric spaces. J. Chem. Inf. Comput. Sci. 2004, 44 (6), 1929–1935. (30) Steinley, D. K-means clustering: A half-century synthesis. Br. J. Math. Stat. Psychol. 2006, 59, 1–34.

PR070461K

Journal of Proteome Research • Vol. 7, No. 01, 2008 191