Comparison and Evaluation of Clustering Algorithms for Tandem Mass

Sep 29, 2017 - In proteomics, liquid chromatography–tandem mass spectrometry (LC–MS/MS) is established for identifying peptides and proteins. Dupl...
0 downloads 9 Views 555KB Size
Subscriber access provided by UNIV OF ESSEX

Article

Comparison and evaluation of clustering algorithms for tandem mass spectra Vera Rieder, Karin Ulrike Schork, Laura Kerschke, Bernhard Blank-Landeshammer, Albert Sickmann, and Jörg Rahnenführer J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.7b00427 • Publication Date (Web): 29 Sep 2017 Downloaded from http://pubs.acs.org on October 2, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 34

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Comparison and evaluation of clustering algorithms for tandem mass spectra Vera Rieder,† Karin U. Schork,†,‡ Laura Kerschke,†,¶ Bernhard Blank-Landeshammer,§ Albert Sickmann,§,‡,k and Jörg Rahnenführer∗,† †Department of Statistics, TU Dortmund University, Germany ‡Medizinische Fakultät, Medizinisches Proteom-Center, Ruhr-University Bochum, Germany ¶Institut für Biometrie und Klinische Forschung (IBKF) der Westfälischen Wilhelms-Universität und des Universitätsklinikums Münster, Germany §Leibniz-Institut für Analytische Wissenschaften - ISAS - e.V., Dortmund, Germany kDepartment of Chemistry, College of Physical Sciences, University of Aberdeen, Aberdeen, Scotland, United Kingdom E-mail: [email protected]

Abstract In proteomics liquid chromatography-tandem mass spectrometry (LC-MS/MS) is established for identifying peptides and proteins. Duplicated spectra, i.e. multiple spectra of the same peptide, occur both in single MS/MS runs and in large spectral libraries. Clustering tandem mass spectra is used to find consensus spectra, with manifold applications. First, it speeds up database searches, as performed for instance by Mascot. Second, it helps to identify novel peptides across species. Third, it is used for quality control to detect wrongly annotated spectra. We compare different clustering algorithms based on the cosine distance between spectra. CAST, MS-Cluster, and PRIDE Cluster are popular algorithms to cluster

1 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

tandem mass spectra. We add well-known algorithms for large data sets, hierarchical clustering, DBSCAN, and connected components of a graph, as well as the new method N-Cluster. All algorithms are evaluated on real data with varied parameter settings. Cluster results are compared with each other and with peptide annotations, based on validation measures such as purity. Quality control, regarding the detection of wrongly (un)annotated spectra, is discussed for exemplary resulting clusters. N-Cluster proves to be highly competitive. All clustering results benefit from the so-called DISMS2 filter that integrates additional information, e.g. on precursor mass.

Keywords Clustering, Tandem mass spectra

Introduction Liquid chromatography-tandem mass spectrometry (LC-MS/MS) is an established method for the identification of peptides and proteins. In single MS/MS runs and in large spectral libraries redundant spectra, i.e. multiple spectra of the same peptide, arise. Usually databasedependent search algorithms are used to identify peptides and proteins. The annotation implicitly yields the grouping of MS/MS spectra with the same annotations into clusters. An alternative is to omit this annotation step and to directly compare the spectra. The idea behind the clustering of tandem mass spectra is the replacement of duplicates by a single representative, typically called consensus spectrum. Three main applications emerge from this approach. First, the number of spectra can be reduced, which accelerates database searches, as performed for instance by Mascot. Processing of MS/MS spectra is limited by computing time and data storage infrastructure. Each MS/MS spectrum is searched against a database, thus clustering as a pre-step saves computing time. 1,2

2 ACS Paragon Plus Environment

Page 2 of 34

Page 3 of 34

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Second, clustering helps to identify novel peptides across species. A majority of spectra remain unidentified by the established annotation methods. Clustering can then be used in a pipeline to find novel peptides through alignment with multiple related peptides. 3 Reasons for unidentified spectra are, for example, co-isolation of co-eluting precursors or analyzerdependent parameters, such as mass accuracy and resolution. Third, clustering is used for quality control to detect wrongly annotated spectra. Background noise leads to wrong peaks, and consequently to misidentification of spectra. 4 This problem will be described and explained in the results section, based on selected exemplary clusters. Several clustering algorithms, such as Pep-Miner 4 using CAST, 5 MS-Cluster, 2 PRIDE Cluster, 6 and MaRaCluster, 1 were introduced within the proteomics community. The general idea of these approaches is clustering based on distances of tandem mass spectra. For clustering large datasets based on distances, well-known algorithms, including hierarchical clustering (complete-linkage clustering), density-based clustering (DBSCAN), 7 and graphbased clustering (connected components of a graph), are popular and thus promising alternatives for clustering spectra. MaRaCluster uses a new p value distance measure followed by hierarchical clustering (complete-linkage clustering). MaRaCluster is not included in the analysis since the impact of the new distance measure cannot be controlled. Furthermore, we present the new algorithm Neighbor clustering (N-Cluster). Similar to DBSCAN it uses neighborhoods of objects to build clusters, but avoids the chaining effect. A chain occurs, if iteratively more and more points with pairwise small distances are clustered together. As a result, points at the opposite end of a chain can be very distant to each other. Apart from comparisons of specific clustering algorithms 8 no comprehensive comparison of tandem mass spectra clustering algorithms can be found in the literature yet, to the best of our knowledge. We compare seven different clustering algorithms based on the cosine distance between spectra. We evaluate all clustering algorithms on exemplary real data of samples from species with and without database search annotation. Parameter settings are

3 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

varied for all clustering algorithms, and the cluster results are compared with each another and with peptide annotations. Previously used evaluation measures are included in the analysis. For example, the purity of a cluster is the relative frequency of the most frequent annotation in this cluster. 1,8 A low proportion of spectra remaining (few clusters) in connection with a high retainment of identified peptides (many different annotations kept) are desired. 1 Furthermore the proportion of clustered spectra and the proportion of incorrectly clustered spectra are two connected measures. 8 A large proportion of clustered spectra and less incorrectly clustered spectra at the same time are required to optimize these measures.

Materials and Methods MS/MS Datasets We used published data of 27 MS/MS runs. 9 Samples were measured in triplicate from five sequenced organisms, namely (i) human (Homo sapiens, H, HeLa cell line), (ii) mouse (Mus musculus, M, C2C12 cell line), (iii) yeast (Saccharomyces cerevisiae, Y), (iv) roundworm (Caenorhabditis elegans, C), and (v) fruit fly (Drosophila melanogaster, D), and from four organisms without sequenced genome, namely (vi) fresh water snail Radix species: molecular operational taxonomic unit (MOTU) 2 (R2) and MOTU 4 (R4), and (vii) foraminifera species Amphistegina lessonii (Al) and Amphistegina gibbosa (Ag). The mass spectrometry proteomics data have been deposited to the ProteomeXchange Consortium via the PRIDE 10 partner repository with the dataset identifier PXD004824. Each sample was measured in triplicate (1 µg each) by an Ultimate 3000 nano RSLC system coupled to a Q Exactive HF mass spectrometer (both Thermo Scientific). In DDA mode the MS survey scans were acquired from m/z 300 to 1,500 at a resolution of 60,000 followed by isolation of precursors with a window of 0.4 m/z. Top 15 most intense signals

4 ACS Paragon Plus Environment

Page 4 of 34

Page 5 of 34

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

were subjected to HCD with a NCE of 27% at a resolution of 15,000, taking into account a dynamic exclusion of 12 s. Proteome Discoverer 1.4 (Thermo Scientific) and Mascot 2.4 (Matrix Science) were used for data interpretation. Database searches were performed in a target/decoy mode against their respective protein sequence (FASTA) databases. 9 The following settings were used: Enzyme Trypsin, two missed cleavage allowed, Carbamidomethylation of cysteine as fixed modifications, oxidation of methionine as dynamic modifications, 10 ppm MS tolerance, 0.02 Da MS/MS tolerance, and only PSMs with search engine rank 1 and FDR < 1%.

Clustering Algorithms Clustering tandem mass spectra has several applications. In particular, groups of similar spectra belonging to one peptide variant improve mass spectrometry based proteomics analyses. In the following, we describe in detail all clustering methods that we compare, namely CAST, DBSCAN, hclust, igraph, MS-Cluster, N-Cluster, and PRIDE Cluster. All these methods are useful for large datasets, do not require a predefined number of clusters and are based on the distances between tandem mass spectra.

Clustering Affinity Search Technique (CAST) Pep-Miner uses the CAST algorithm applied to spectra. Unfortunately its code is not publicly available. 2 CAST was originally introduced for clustering of gene expression data. 4,5 A threshold affinity value t∗ = 1 − t for the affinity/similarity of spectra is fixed. The average similarity of an object p to all objects in a cluster must exceed or fall below a threshold t∗ to add or remove p from a cluster. First a random point is chosen to build a cluster. Then, alternate steps, adding objects with high affinity and removing objects with low affinity, are repeated as long as changes occur. The cluster is closed, if the cluster members do not change anymore. Afterwards, the algorithm starts again with a point randomly selected from all points that are not yet assigned to a cluster. 5 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

We used the cosine similarity, the reciprocal of the cosine distance, for clustering spectra and compared different threshold affinity values t∗ (0.8, 0.9, 0.95, 0.975). We implemented the algorithm in R. 11

Density-Based Spatial Clustering of Applications with Noise (DBSCAN) The basic concept of the widely used clustering algorithm DBSCAN 7 is that the density within a cluster is bigger than between clusters. Density is given by the number of points within a fixed distance . Starting with a random point p, all neighbors of p within  are visited. If the number of neighbors is greater than a threshold minP ts, p builds a new cluster with its neighbors, otherwise it is labeled as noise. Proceeding with the neighbors of p, more points are put into the same cluster, if again minP ts points are found within  of a neighbor of p. In the next step, the number of neighbors of new added points is checked. The cluster is closed, if no further unvisited point can be reached. Then, the algorithm starts again with a random unvisited point. In our comparison we varied the fixed distance  (0.025, 0.05, 0.1, 0.2) as well as the threshold minP ts (2, 3 ,5, 10) for the minimum number of neighbors. We implemented the algorithm in R. 12

Complete linkage hierarchical clustering (hclust) A hierarchical representation of merging or splitting groups is obtained by hierarchical clustering. Agglomerative clustering starts with single observations as clusters and iteratively merges a selected pair of clusters into one cluster with smallest distance. 13 Complete linkage means that the distance of two clusters is defined as the maximum distance of all corresponding objects. If the distance of all clusters falls below the distance threshold h, we stop merging clusters. The result of a hierarchical clustering can be visualized with a dendrogram, a rooted binary tree. Cutting this dendrogram at the level h is equivalent to the procedure. We

6 ACS Paragon Plus Environment

Page 6 of 34

Page 7 of 34

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

removed singletons before using the R function hclust and varied the parameter h (0.025, 0.05, 0.1, 0.2) to allow for different levels.

Connected components of a graph (R package igraph) The distance matrix of spectra can be represented as an undirected graph. Nodes represent spectra, and edges between nodes are added if the distance threshold cdis is not exceeded. Each connected subgraph builds a separate cluster. 14 We used the function clusters in the R package igraph 15 to build clusters and compared different distance threshold values (0.025, 0.05, 0.1, 0.2).

Neighbor clustering (N-Cluster) We present Neighbor clustering (N-Cluster) 12 that circumvents the chaining effect observed when applying DBSCAN. The basic idea is that a center point of a cluster should have many neighbors within a distance c. For every point the number of neighbors with distance smaller than c is computed. Points with only one neighbor, the point itself, are marked as singletons. The point with most neighbors and its neighbors are chosen as a new cluster. The procedure is repeated for all points not yet assigned to a cluster. Again, the next cluster is build based on the point with most neighbors. Since the maximum distance of each point to a cluster center is c, the so called chaining is avoided. We implemented N-Cluster in R and varied the distance threshold c (0.025, 0.05, 0.1, 0.2), resulting in different clusterings. 12

MS-Cluster MS-Cluster 2 is an approximate hierarchical clustering algorithm with a fixed number of rounds rounds and a similarity threshold τ that decreases in each round. The first identified clusters with a similarity exceeding τ are merged. Two heuristics, namely testing for a

7 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

common peak in top 5 peaks and a similarity indicator excluding low similarities between clusters, speed up the computing time. We implemented an R wrapper (system3, R package BBmisc 16 ) to apply the MS-Cluster v2 algorithm 17 (MsCluster v2.0), an executable that can be run from the command line via an R function. We set fragment-tolerance= 0.5, window = 5 and rounds = 5 as in PRIDE Cluster and used the default mixture–prob = 0.05. Based on the value of similarity (e.g. 0.8), decreasing values of τ are selected (e.g. τ = 1.00, 0.95, 0.90, 0.85, 0.80).

PRIDE Cluster PRIDE Cluster 6 is a modified version of MS-Cluster. In contrast to MS-Cluster the quality of MS/MS spectra is assessed with the signal-to-noise-ratio by SpectraST. 18 Additionally, a cluster can be split when adding new spectra to the cluster. Merging clusters with highest similarity instead of merging the first identified clusters with reasonable similarity improves MS-Cluster. We implemented an R wrapper (system3, R package BBmisc 16 ), to use spectra-clustercli, 8 a stand-alone Java application of the algorithm (spectra-cluster API Version 1.0 by Rui Wang and Johannes Griss). We set rounds = 5 and precursor_ tolerance = 5 and used the default value fragment_tolerance = 0.5. The similarity threshold starts with threshold_start = 1 in the first round, with decreasing values for threshold_end (0.8, 0.9, 0.95, 0.975) in the last round.

Measures for assessment of the quality of cluster algorithms The quality of cluster algorithms can be assessed by different measures. A short description of all measures we used in the analysis is given in Table 1. The adjusted Rand index, 19,20 a similarity measure between two clusterings P1 and P2 , compares partitions by counting pairs of clustered objects. Given two partitions P1 and P2 , the classification of pairs of objects is counted. For example, a (or d) is the number of 8 ACS Paragon Plus Environment

Page 8 of 34

Page 9 of 34

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Table 1: Overview of measures for assessment of the quality of cluster algorithms Measure Adjusted Rand index (ARI)

Description Transformation of Rand index, the fraction of pairs of clustered objects being classified both same or both different in two partitions Purity Relative frequency of the most frequent annotation Proportion of spectra remaining Number of clusters relative to the number of spectra Retainment of identified spectra Number of different annotations when selecting only the most frequent annotation in each cluster, relative to the total number of different annotations before clustering Proportion clustered spectra Proportion of spectra clustered with at least one other spectrum Proportion incorrectly clustered Proportion of spectra not identified as the most spectra common annotation in the cluster pairs being classified to the same (or a different) cluster regarding both partitions (P1 and P2 ), and b (or c) is the number of pairs being classified to the same cluster in partition P1 (or P2 ) and to different clusters in partition P2 (or P1 ):

same clustering (P2 )

different clusterings (P2 )

same clustering (P1 )

a

b

different clusterings (P1 )

c

d

The Rand index RI 19,20 is equivalent to the simple matching coefficient, and is defined as the fraction of pairs being classified consistently:

RI =

a+d a+b+c+d

Since RI does not attain values in the entire interval [0, 1], a transformation is used resulting in the adjusted Rand index (ARI), equivalent to Cohen’s kappa: 20,21

9 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ARI =

RI − E(RI) 2(ad − bc) = max(RI) − E(RI) (a + b)(b + d) + (a + c)(c + d)

If two partitions are identical, ARI has the value 1. A value of zero corresponds to the expected value of independent partitions. Values smaller than zero are possible if two partitions are less similar than expected assuming independence. ARI is widely used and is most appropriate compared to other external criteria. 22 An external evaluation criterion of cluster quality is the average purity. Purity is defined as the largest fraction of spectra sharing their matched peptide. 1,8 Assuming that the most frequent annotation in each cluster is correct, let ni be the number of correctly annotated spectra in cluster i, i = 1, . . . , k. Then the purity is the sum of the numbers ni divided by the total number of spectra: 23 Pk

Purity =

ni #{spectra} i=1

On the one hand, the proportion of clustered spectra should be large. On the other hand, the proportion of incorrectly clustered spectra should be small. Optimizing these two criteria can be illustrated as a scatter plot. 8 On the y-axis the proportion of spectra clustered with at least one other spectrum is shown. If the number of singletons is low, then this value is large. On the x-axis the proportion of spectra not identified as the most common annotation in the cluster is shown. The most common annotation in a cluster represents the cluster’s representative annotation. Thus an annotation that differs from the representative annotation results in poorer quality. In the original publication terms Rel. (incorrectly) clustered spectra were used instead Proportion (incorrectly) clustered spectra. 8 Two competing goals are the retainment of identified spectra and the proportion of spectra remaining after clustering. 1 In a scatter plot the number of clusters relative to the number of spectra is shown on the x-axis. For example, a small proportion of spectra remaining is desired to speed up database searches. At the same time, the number of annotations should be stable after clustering. The number of annotations after clustering

10 ACS Paragon Plus Environment

Page 10 of 34

Page 11 of 34

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

is assessed by the number of different annotations regarding the most common annotation in each cluster. Therefore the retainment of identified spectra is defined as the latter number relative to the total number of annotations before clustering. If two peptides appear with the same frequency, both are added to the set of most common annotations.

Results and discussion First, the clustering of tandem mass spectra in single runs is presented in detail, applying the different cluster methods and corresponding parameter settings. Second, the resulting clusterings are evaluated with the measures for assessment of cluster quality. Third, clustering as a helpful tool for quality control is illustrated on two exemplary clusters. Fourth, the analysis is extended to multiple MS/MS runs.

Clustering of tandem mass spectra We used R 24 to perform clustering separately on each MS/MS run for in total 27 MS/MS runs. Each run comprises between 30012 and 40236 tandem mass spectra (Supporting Information, Table S-1). Samples of human, mouse, yeast, roundworm, fruit fly, two different Radix species, and two different foraminifera species were analyzed. 9 The R Code is available at https://www.statistik.tu-dortmund.de/genetics-publications-clusspec.html. We compare seven different clustering algorithms, CAST, DBSCAN, hierarchical clustering (complete linkage clustering), connected components of a graph, MS-Cluster, PRIDE Cluster and the new method Neighbor clustering. In Table 2 an overview over the parameter settings for the different clustering algorithms is given. Values of the parameters c, cdis, , h and t are chosen equal since they all refer to a cutoff of the cosine distance between spectra. Values are iteratively halved (0.2, 0.1, 0.05, 0.025). Values of the parameters similarity and threshold_end directly refer to the similarity of spectra and hence are chosen according to the equation similarity = 1−distance (0.8, 0.9, 0.95, 0.975).

11 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 34

Table 2: Parameter settings used for the clustering algorithms Algorithm CAST DBSCAN hclust igraph N-Cluster MS-Cluster PRIDE Cluster

Parameter t  minP ts h cdis c similarity threshold_end

Values 0.025, 0.05, 0.025, 0.05, 2, 3, 5, 10 0.025, 0.05, 0.025, 0.05, 0.025, 0.05, 0.975, 0.95, 0.975, 0.95,

0.1, 0.2 0.1, 0.2 0.1, 0.1, 0.1, 0.9, 0.9,

0.2 0.2 0.2 0.8 0.8

For a fair comparison of all algorithms a consistent distance calculation between tandem mass spectra is required. The ProteoWizard tool MSConvertGUI 25 was used to convert Thermo RAW files into the open data format mzXML that can be read with the R package readMzXmlData. 26 We preprocessed original spectra by binning them with a fixed bin size (bin = 0.2). The cosine distance of all pairs of tandem mass spectra was calculated. The resulting distance matrix is used as input of all clustering algorithms except MS-Cluster and PRIDE. The latter two include preprocessing steps and specialized distance calculations, thus making a direct comparison inappropriate. To improve the clustering results another version of each distance matrix is constructed using DISMS2 filtering. 9 If certain constraints regarding retention time (half size of retention time window ret = 3000 ranked scan numbers), precursor charge, and precursor mass (accepted precursor mass shift prec = 10 ppm) are not fulfilled, pairwise distances are replaced by the maximum distance 1. Hence these tandem mass spectra cannot be clustered together. The choice of the retention time window size ret depends on the reproducibility of the HPLC and is only applicable for the same HPLC conditions. A necessary condition for peptides with the same amino acid sequences and the same post-translational modifications are same precursor masses. Filtering out similar precursor masses thus simplifies clustering of tandem mass spectra.

12 ACS Paragon Plus Environment

Page 13 of 34

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Annotations by a Mascot peptide-to-spectrum-match of the five model organisms (human, mouse, yeast, roundworm and fruit fly) are used as a further clustering that hints to the true clustering of spectra. Furthermore annotations of Radix MOTU 4 were added on the basis of a draft genome for Radix auricularia. 27 In total for 18 of 27 MS/MS runs clustering by the same peptide annotation is available for clustering evaluation, e.g. purity. In total 1962 clusterings of single MS/MS runs are generated. For each MS/MS run 32 clusterings based on the cosine distance of spectra and 32 clusterings with additional DISMS2 filtering are calculated. Along with 8 clusterings of MS-Cluster and PRIDE Cluster with a different distance calculation and, where applicable, one annotation clustering, 72 or 73 different clusterings can then be compared for each MS/MS run. As a second step, clusterings of sets of multiple MS/MS runs of three annotated replicates per species are generated and analyzed. However, for large datasets running time and memory usage is limited. Thus for each of the 12 algorithms only one parameter setting is considered, according to maximal mean ARI for single runs.

Evaluation of clusterings The performance of all cluster algorithms (see Table 2) is evaluated based on all measures presented in Table 1 (adjusted Rand index, purity, proportion of spectra remaining, retainment of identified spectra, proportion of clustered spectra, number of spectra not identified as the most common peptide in a cluster) and cluster size. First, the similarity between different clusterings is assessed by ARI, where larger values indicate better quality and the maximum possible value is 1. The ARI between each clustering and the annotation of 18 MS/MS runs is averaged, see Figure 1. ARI values range between 0.00 and 0.65. In Table 3 highest mean ARI values of single MS/MS runs for different approaches of clustering, cosine distance, DISMS2 filter and distance calculation as part of the algorithm, are summarized. In addition, in the last column of Table 3 respective mean ARI values of cluster results for multiple MS/MS runs are shown. These values are 13 ACS Paragon Plus Environment

Journal of Proteome Research

discussed in more detail in the last paragraph of the results section. For the two widely used algorithms for tandem mass spectra, MS-Cluster and PRIDE Cluster, MS-Cluster yields a higher value for single runs. If cosine distances of tandem mass spectra are used as input of the clustering algorithms and the parameters are chosen adequate, all five algorithms lead to similar values higher than 0.53. Hierarchical clustering (h = 0.1) yields the best results, and N-Cluster (c = 0.05) keeps up. DISMS2 filter generally improves the mean ARI. DBSCAN ( = 0.2, minP ts = 2) and igraph (cdis = 0.2) are best in this case, but N-Cluster (c = 0.2) is almost as good as the best performing methods. 1.0 Cosine distance Cosine distance + DISMS2 filter Part of algorithm 0.8 Mean adjusted Rand index

0.6

0.4

0.2 0.1 0.05 0.025

0.2 0.1 0.05 0.025

0.2 0.1 0.05 0.025

0.2 0.1 0.05 0.025

0.2 0.1 0.05 0.025

0.2 0.1 0.05 0.025

0.8 0.9 0.95 0.975

0.8 0.9 0.95 0.975

0.0

0.2 0.1 0.05 0.025

0.2

0.2 0.1 0.05 0.025

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 34

CAST

DBSCAN minPts=2

DBSCAN minPts=3

DBSCAN minPts=5

DBSCAN minPts=10

hclust

igraph

N−Cluster

MS−Cluster

PRIDE Cluster

Figure 1: Mean adjusted Rand index between annotation and clusterings of 18 annotated MS/MS runs. Cosine distance was used without and with DISMS2 filter. MS-Cluster and PRIDE include preprocessing and distance calculation as part of the algorithm. The adjusted Rand index does not require an annotation, thus all clusterings can be compared pairwise. Heatmaps visualizing groups of similar clusterings are shown in the Supporting Information (Figure S-1, S-2, S-3 and S-4), as well as comparisons of all 27 MS/MS runs. In the following only a subset of the clusterings is considered. For each clustering algorithm parameters with highest mean ARI as listed in Table 3 are selected. 14 ACS Paragon Plus Environment

Page 15 of 34

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Table 3: Mean adjusted Rand index of single and multiple MS/MS runs Distance of spectra Cosine distance

Cosine distance + DISMS2 filter

Part of algorithm

Clustering single multiple CAST (t = 0.1) 0.538 0.642 DBSCAN ( = 0.05, minP ts = 2) 0.544 0.667 hclust (h = 0.1) 0.557 0.608 igraph (cdis = 0.05) 0.544 0.667 N-Cluster (c = 0.05) 0.534 0.649 CAST (t = 0.2) 0.623 0.721 DBSCAN ( = 0.2, minP ts = 2) 0.648 0.743 hclust (h = 0.2) 0.590 0.648 igraph (cdis = 0.2) 0.648 0.743 N-Cluster (c = 0.2) 0.642 0.730 MS-Cluster (similarity = 0.8) 0.552 0.593 PRIDE Cluster (threshold_end = 0.975) 0.450 0.559

The purity describes the fraction of objects in a cluster with dominant annotation. We compare clusterings with regard to the average purity grouped by cluster size (Figure 2). For most algorithms the average purity is high, even for large cluster size. Especially N-Cluster (c = 0.05) that avoids chaining performs well for larger cluster size with purity values above 0.9. The average purity of PRIDE Cluster (threshold_end = 0.95) and CAST (t = 0.1) clusters decreases with growing cluster size. The DISMS2 filter mostly leads to better results, in particular CAST (t = 0.2) clusters benefit of its usage. Only for very large hierarchical (h = 0.2) clusters with size between 32 and 63 the average purity is below 0.7. On the one hand, the number of identified peptides should not be smaller than before clustering. On the other hand, the number of spectra representing a cluster (number of clusters) should be as small as possible. We plotted the number of identified peptides against the number of cluster representatives, both relative to the numbers before clustering (Figure 3). Both values are averaged across 18 annotated MS/MS runs. The average proportion of spectra remaining is for all clusterings close to 50%, except for PRIDE Cluster with 40%. The price for this smaller number of clusters is a smaller average retainment of identified peptides, with values above 0.98 for all clusterings expect 0.93 for PRIDE cluster. Figure 3 15 ACS Paragon Plus Environment

Journal of Proteome Research

Cosine distance + DISMS2 filter



0.8

● ●

● ● ● ● ●

● ● ●

Average Purity



0.6 0.4

● ●

● ●

● ●

● ●

● ● ●

● ●

● ● ●







● ●

● ● ●

● ●

● ● ●

● ●

● ●

● ●

64−127

128+

CAST



DBSCAN



hclust



igraph



N−Cluster



MS−Cluster PRIDE Cluster







Cluster size

128+

64−127

32−63

16−31

8−15

4−7

2−3

1

32−63

16−31

8−15

4−7

2−3

1

0.0



0.0

● ●





0.2



0.2

Average Purity

● ● ●

1.0

● ● ●



0.8

● ● ●

0.6

● ●

0.4

1.0

Cosine distance

Cluster size

Figure 2: Purity grouped by cluster size for clusterings with best mean ARI per clustering algorithm, based on cosine distance of spectra (left) and additionally with DISMS2 filter (right).

Cosine distance + DISMS2 filter

0.40

0.42

0.44

0.46

0.48

0.50

1.00 0.98 0.96





CAST



DBSCAN



hclust



igraph



N−Cluster



MS−Cluster PRIDE Cluster



0.40

Average proportion of spectra remaining



0.94

0.98 0.96 0.94 0.90

0.92



●●● ● ●

0.92

●● ● ●



0.90



Average retainment of identified peptides

1.00

Cosine distance Average retainment of identified peptides

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 34

0.42

0.44

0.46

0.48

0.50

Average proportion of spectra remaining

Figure 3: Number of identified peptides against number of consensus spectra relative to the numbers before clustering. 16 ACS Paragon Plus Environment

Page 17 of 34

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(right side) shows the analogous plot, when additionally the DISMS2 filter is applied. The results are similar, only values for retainment of annotated peptides are now even above 0.99 for all results except for PRIDE cluster. For example, N-Cluster(c = 0.2) with DISMS2 filter retains 99.7% of the identifications with 47.8% of the total number of spectra before clustering. More peptide identifications are retained and less spectra remain in case of use of the DISMS2 filter since it integrates effectively the same additional information as database searches, e.g. the precursor mass. The proportion of clustered spectra should be high, whereas the proportion of incorrectly clustered spectra should be low. We plotted both relative values in a scatter-plot (Figure 4), meaning that points plotted at the top left indicate a good result. Since most of the spectra represent single clusters (see Supplementary Figure S.5), the proportion of clustered spectra is generally low, about 25% to 30% in most cases. Only PRIDE(threshold_end = 0.975) generates higher values (53%), but this implies also a relatively large number of incorrectly clustered spectra (22%). N-Cluster(c = 0.05), for example, leads to only 2% incorrectly clustered spectra, with in total 25% of the spectra clustered together. The DISMS2 filter has a positive impact on the number of incorrectly clustered spectra, for CAST, e.g., the proportion falls from 7% to 2%.

Quality control demonstrated on two exemplary clusters One goal of clustering spectra is quality control. The mean ARI attains its highest value for hierarchical clustering with parameter h = 0.1. Thus, two exemplary clusters obtained with this algorithm are presented in detail, for the human sample H1 (HeLa 1). We present analyses for the large cluster Clus1 with an erroneously not annotated spectrum and the medium size cluster Clus2 with falsely annotated spectra. Clus1 is the fourth largest cluster for clustering H1 with hclust(h = 0.1). It contains 24 MS/MS spectra, one without annotation, and all the others with only two different annotations that also differ only with respect to oxidation of Methionine at the fifth and eighth 17 ACS Paragon Plus Environment

Journal of Proteome Research

Cosine distance + DISMS2 filter 0.5 0.4

● ● ● ●



CAST



DBSCAN



hclust



igraph



N−Cluster



MS−Cluster PRIDE Cluster



0.0

0.0

0.1

0.2

● ● ●

0.3

●●



0.2

0.3

0.4

Proportion clustered spectra



0.1

0.5

Cosine distance

Proportion clustered spectra

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 34

0.0

0.1

0.2

0.3

0.4

0.5

0.0

Proportion incorrectly clustered spectra

0.1

0.2

0.3

0.4

0.5

Proportion incorrectly clustered spectra

Figure 4: Proportion of clustered spectra versus proportion of incorrectly clustered spectra of clusterings based on cosine distance of spectra (left) or additional with DISMS2 filter (right). position of the peptides (HQGVmVGMGQK and HQGVMVGmGQK). It is remarkable that the unannotated spectrum is very similar to the other annotated spectra (see Figure 5, left). All spectra are close to each other with pairwise distances smaller than 0.1. Ion scores vary between 51 and 76 except for one spectrum with score 43 that is also a bit separated from the other spectra. All close neighbors of the unannotated spectrum have the annotation HQGVmVGMGQK, with smallest distance only 0.013. This suggests that we can imply the annotation HQGVmVGMGQK also for the unannotated spectrum. The reason for the missing annotation for this spectrum was a wrong precursor mass correction during database annotation. For clustering H1 with hclust(h = 0.1)), Clus1 is one of 83 clusters with at least 5 members, maximum three different annotations and at least one missing annotation. Clus2 is a medium size cluster with 13 MS/MS spectra of which 7 are not annotated and the others are annotated with different peptides (see Figure 6). Accordingly ion scores (Mascot) are low, ranging from 28 to 48. All sequences start with phenylalanine (F) and 18 ACS Paragon Plus Environment

Page 19 of 34

HQGVmVGMGQK (Score range 43−76)



HQGVMVGmGQK (Score range 51−66)



Annotation missing

0.04



15



● ● ● ● ●●



●● ● ●● ●



● ●●



−0.02

5



10

● ●

Density

0.02 0.00

Coordinate 2

● 0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

−0.02

0.00

0.02

0.04

0.0

Coordinate 1

0.2

0.4

0.6

0.8

1.0

Cosine distance

Figure 5: Graphical representation of the fourth-largest cluster Clus1 (24 MS/MS spectra) for clustering H1 with hclust(h = 0.1), visualized by multidimensional scaling (left) including ion scores (Mascot) and by a histogram (right) of all pairwise cosine distances of 24 spectra. are quite short with maximum length 8. The pairwise distances are small, but the precursor masses range between 619 and 1393 Dalton. All spectra have an extreme high peak at 120 m/z that corresponds to a phenylalanine immonium ion (Figure S-7). The clustering based on the cosine distance of spectra with a dominant peak of phenylalanine immonium ions is misleading. The DISMS2 filter protects against this type of wrong clusters, because the precursor mass is considered. For clustering H1 with hclust(h=0.1), Clus2 is one of 169 clusters with at least 5 members and at least one annotation. The relative frequency of the most common annotation in relation to the total number of spectra in each cluster is less or equal 1/3 in 38 clusters (22%). In particular, the relative frequency of Clus2 is 7%.

Extension to multiple MS/MS runs In the previous paragraphs single MS/MS runs were evaluated. However, applications of clustering such as the identification of novel peptides across species require clusterings of 19 ACS Paragon Plus Environment

Journal of Proteome Research

FGDFVLK (35) ●

FGFYEVFK (48)



FGDLILK (38) ●

FGTVLK (40)



FGFGAK (33) ●

FNNFIK (28)



Annotation missing

0.04

25



20



10



● ●



● ●



0



5

0.00



Density

● ●

15

0.02



−0.02

Coordinate 2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 34

−0.04

−0.02

0.00

0.02

0.04

0.0

Coordinate 1

0.2

0.4

0.6

0.8

1.0

Cosine distance

Figure 6: Graphical representation of the specific cluster Clus2 for clustering H1 with hclust(h = 0.1), visualized by multidimensional scaling (left) including ion scores (Mascot) and by a histogram (right) of all pairwise cosine distances of all 13 spectra. multiple MS/MS runs. Computational speed and large memory are necessary to analyze large datasets. Single runs require up to 13.5 GB RAM and about 13 minutes (Supporting Information, Table S-3). Computational speed and memory usage differs among cluster algorithms. MS-Cluster and PRIDE Cluster are outstandingly good. For computation we used the LiDOng high performance cluster at TU Dortmund University. We requested depending on the data up to 64 GB RAM on nodes with Intel Xeon E7340 (2.4 GHz) CPUs. Three replicates per species (C, D, H, M, Y, and R4) in one go were clustered based on up to 85032 annotated spectra (Supporting Information, Table S-2). Since running time and memory usage are limited, optimized parameter settings of single MS/MS runs were used. The corresponding values of the measures of assessment and additional figures are contained in the Supporting Information (Tables S-6, S-7, S-8, S-9, S-10, Figures S-8, S-9, S-10, S-11). Memory usage and computing time increase up to 60 GB RAM and about 67

20 ACS Paragon Plus Environment

Page 21 of 34

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

minutes, respectively (Supporting Information, Tables S-4, S-5). The memory limit of 64 GB was exceeded only for hierarchical clustering in some cases (replicates C1, C2, C3; H1, H2, H3; M1, M2, M3). In the direct comparison of single and multiple MS/MS runs, mean ARI values of multiple runs are up to 12.3% higher (Table 3, Supporting Information, Figure S-8, Table S-6). Clustering multiple runs of species replicates leads to larger cluster sizes. The results of average purity grouped by cluster size are similar to those of single runs (Supporting Information, Figure S-9). The average proportion of spectra remaining decreases and the average retainment of identified peptides falls only marginally (Supporting Information, Figure S-10, Tables S-9, S-10). The proportion of clustered spectra is at least 78%. Along with this drastic increase the proportion of incorrectly clustered spectra remains small (Supporting Information, Figure S-11, Tables S-7, S-8).

Conclusions We compared peptide annotation with clusterings of different cluster algorithms with different parameter settings. Especially the similarity between cluster results and database annotation is of interest. The evaluation of seven clustering algorithms has shown that well known clustering algorithms and our new Neighbor clustering (N-Cluster) are at least as good as the established MS-Cluster. Particularly N-cluster has proved to be highly competitive. A clear improvement is achieved by the DISMS2 filter. Precursor charge, precursor mass, and retention time are included, causing database annotation and clusterings to yield similar results. Since R Code is available, our analysis can be extended to other clustering algorithms, distance measures for tandem mass spectra, and evaluation techniques. Here, we clustered tandem mass spectra of different species and, where applicable, compared the cluster results with the annotations. Due to limited resources first single MS/MS runs were analyzed, preselecting parameter settings for all algorithms. The extension to multiple MS/MS runs of

21 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

three technical replicates improved the values of the measures for assessment drastically. A future plan is the evaluation on a large benchmark dataset, e.g. synthetic tryptic peptides representing essentially all canonical human gene products. 28 For such a study with a large amount of data much more memory space and computing time are required.

Notes The authors declare no competing financial interest.

Acknowledgement This work was funded by the Leibniz-Competition Fund (SAW-2014-ISAS-2-D).

Supporting Information Available The following files are available free of charge. • Figure S-1: Heatmap of mean adjusted Rand index for 18 annotated MS/MS runs based on cosine distance. Page S-2. • Figure S-2: Heatmap of mean adjusted Rand index for 18 annotated MS/MS runs based on DISMS2 filter following cosine distance. Page S-2. • Figure S-3: Heatmap of mean adjusted Rand index for all 27 MS/MS runs based on cosine distance. Page S-3. • Figure S-4: Heatmap of mean adjusted Rand index for all 27 MS/MS runs based on DISMS2 filter following cosine distance. Page S-3. • Figure S-5: Percentage of spectra grouped by cluster size of clusterings based on cosine distance of spectra (top) or additional with DISMS2 filter (bottom). Page S-4. 22 ACS Paragon Plus Environment

Page 22 of 34

Page 23 of 34

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

• Figure S-6: Examples of tandem mass spectra of cluster in Figure 5. Page S-5. • Figure S-7: Examples of tandem mass spectra of cluster in Figure 6. Page S-6. • Figure S-8: Figure 1 with extension to multiple MS/MS runs. Page S-7. • Figure S-9: Figure 2 with extension to multiple MS/MS runs. Page S-7. • Figure S-10: Figure 3 with extension to multiple MS/MS runs. Page S-8. • Figure S-11: Figure 4 with extension to multiple MS/MS runs. Page S-8. • Table S-1: Number of (annotated) tandem mass spectra. Page S-9. • Table S-2: Number of annotated tandem mass spectra of three replicates. Page S-10. • Table S-3: Computing time (mins) (and memory usage (Gb)) of clusterings of single runs H1, H2 and H3). Page S-10. • Table S-4: Computing time (mins) (and memory usage (Gb)) of clusterings of three replicates per species C, D and H. Page S-10. • Table S-5: Computing time (mins) (and memory usage (Gb)) of clusterings of three replicates per species M, Y and R4. Page S-11. • Table S-6: Adjusted Rand index between annotation and clusterings of three replicates species C, D, H, M, Y and R4. Page S-11. • Table S-7: Proportion of clustered spectra of clusterings of three replicates species C, D, H, M, Y and R4. Page S-12. • Table S-8: Proportion of incorrectly clustered spectra of clusterings of three replicates species C, D, H, M, Y and R4. Page S-12. • Table S-9: Number of identified peptides relative to the numbers before clustering of clusterings of three replicates species C, D, H, M, Y and R4. Page S-13. 23 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

• Table S-10: Number of consensus spectra relative to the numbers before clustering of clusterings of three replicates species C, D, H, M, Y and R4. Page S-13. This material is available free of charge via the Internet at http://pubs.acs.org/.

References (1) The, M.; Käll, L. MaRaCluster: A Fragment Rarity Metric for Clustering Fragment Spectra in Shotgun Proteomics. J. Proteome Res. 2016, 15, 713–720. (2) Frank, A. M.; Bandeira, N.; Shen, Z.; Tanner, S.; Briggs, S. P.; Smith, R. D.; Pevzner, P. A. Clustering millions of tandem mass spectra. J. Proteome Res. 2008, 7, 113–122. (3) Na, S.; Payne, S. H.; Bandeira, N. Multi-species Identification of Polymorphic Peptide Variants via Propagation in Spectral Networks. Mol. Cell Proteomics 2016, 15, 3501– 3512. (4) Beer, I.; Barnea, E.; Ziv, T.; Admon, A. Improving large-scale proteomics by clustering of mass spectrometry data. Proteomics 2004, 4, 950–960. (5) Ben-Dor, A.; Shamir, R.; Yakhini, Z. Clustering gene expression patterns. Journal of computational biology 1999, 6, 281–297. (6) Griss, J.; Foster, J. M.; Hermjakob, H.; Vizcaíno, J. A. PRIDE Cluster: building a consensus of proteomics data. Nature methods 2013, 10, 95–96. (7) Ester, M.; Kriegel, H.-P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. KDD. 1996; pp 226–231. (8) Griss, J.; Perez-Riverol, Y.; Lewis, S.; Tabb, D. L.; Dianes, J. A.; Del-Toro, N.; Rurik, M.; Walzer, M. W.; Kohlbacher, O.; Hermjakob, H.; Wang, R.; Vizcaino, J. A.

24 ACS Paragon Plus Environment

Page 24 of 34

Page 25 of 34

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Recognizing millions of consistently unidentified spectra across hundreds of shotgun proteomics datasets. Nat. Methods 2016, 13, 651–656. (9) Rieder, V.; Blank-Landeshammer, B.; Stuhr, M.; Schell, T.; Biß, K.; Kollipara, L.; Meyer, A.; Pfenninger, M.; Westphal, H.; Sickmann, A.; Rahnenführer, J. DISMS2: A flexible algorithm for direct proteome- wide distance calculation of LC-MS/MS runs. BMC Bioinformatics 2017, 18, 148. (10) Vizcaino, J. A.; Csordas, A.; del Toro, N.; Dianes, J. A.; Griss, J.; Lavidas, I.; Mayer, G.; Perez-Riverol, Y.; Reisinger, F.; Ternent, T.; Xu, Q. W.; Wang, R.; Hermjakob, H. 2016 update of the PRIDE database and its related tools. Nucleic Acids Res. 2016, 44, D447–456. (11) Kerschke, L. Clustern von massenspektrometrischen Daten. M.Sc. thesis, TU Dortmund University, 2016. (12) Schork, K. U. Verbesserte Annotation von Massenspektren mit Algorithmen der Clusteranalyse. M.Sc. thesis, TU Dortmund University, 2016. (13) Hastie, T.; Tibshirani, R.; Friedman, J. The elements of statistical learning; Springer series in Statistics, Springer, Berlin, 2009; Vol. 2. (14) Kolaczyk, E. D.; Csárdi, G. Statistical analysis of network data with R; Springer, 2014; Vol. 65. (15) Csardi, G.; Nepusz, T. The igraph software package for complex network research. InterJournal 2006, Complex Systems, 1695. (16) Bischl, B.; Lang, M.; Bossek, J.; Horn, D.; Richter, J.; Surmann, D. BBmisc: Miscellaneous Helper Functions for B. Bischl. 2016; R package version 1.10. (17) Frank, A. M.; Monroe, M. E.; Shah, A. R.; Carver, J. J.; Bandeira, N.; Moore, R. J.; Anderson, G. A.; Smith, R. D.; Pevzner, P. A. Spectral archives: extending spectral 25 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

libraries to analyze both identified and unidentified spectra. Nat. Methods 2011, 8, 587–591. (18) Lam, H.; Deutsch, E. W.; Eddes, J. S.; Eng, J. K.; Stein, S. E.; Aebersold, R. Building consensus spectral libraries for peptide identification in proteomics. Nat. Methods 2008, 5, 873–875. (19) Rand, W. M. Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association 1971, 66, 846–850. (20) Warrens, M. J. On the Equivalence of Cohen’s Kappa and the Hubert-Arabie Adjusted Rand Index. Journal of Classification 2008, 25, 177–183. (21) Hubert, L.; Arabie, P. Comparing partitions. Journal of classification 1985, 2, 193–218. (22) Milligan, G. W.; Cooper, M. C. A Study of the Comparability of External Criteria for Hierarchical Cluster Analysis. Multivariate behavioral research 1986, 21, 441–458. (23) Manning, C. D.; Raghavan, P.; Schütze, H. Introduction to information retrieval ; Cambridge university press Cambridge, 2008; Vol. 1. (24) R Core Team, R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing: Vienna, Austria, 2016. (25) Chambers, M. C. et al. A cross-platform toolkit for mass spectrometry and proteomics. Nature biotechnology 2012, 30, 918–920. (26) Gibb, S. readMzXmlData: Reads Mass Spectrometry Data in mzXML Format. 2015; R package version 2.8.1. (27) Schell, T.; Feldmeyer, B.; Schmidt, H.; Greshake, B.; Tills, O.; Truebano, M.; Rundle, S. D.; Paule, J.; Ebersberger, I.; Pfenninger, M. An Annotated Draft Genome for Radix auricularia (Gastropoda, Mollusca). Genome Biology and Evolution 2017, 9, 585–592. 26 ACS Paragon Plus Environment

Page 26 of 34

Page 27 of 34

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(28) Zolg, D. P. et al. Building ProteomeTools based on a complete synthetic human proteome. Nature Methods 2017, 14, 259–262.

27 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

for TOC only

28 ACS Paragon Plus Environment

Page 28 of 34

1.0 29 of 34 Page

minPts=10

0.8 0.9 0.95 0.975

0.2 0.1 0.05 0.025

minPts=5

ACSDBSCAN ParagonDBSCAN Plus Environment hclust igraph

0.8 0.9 0.95 0.975

0.2 0.1 0.05 0.025

DBSCAN minPts=3

0.2 0.1 0.05 0.025

0.2 0.1 0.05 0.025

DBSCAN minPts=2

0.2 0.1 0.05 0.025

0.2 0.1 0.05 0.025

CAST

0.2 0.1 0.05 0.025

0.2 0.1 0.05 0.025

1 2 0.8 3 4 5 0.6 6 7 8 9 0.4 10 11 12 130.2 14 15 16 0.0 17 18 19 20 21

Journal of Proteome Research

Cosine distance Cosine distance + DISMS2 filter Part of algorithm

N−Cluster

MS−Cluster

PRIDE Cluster

Journal of Proteome Research Cosine distance + DISMS2 filter

● ● ● ● ●

● ● ● ● ●

● ● ● ● ●

Average Purity



0.6



0.4

● ●



● ●

● ● ●

● ●

● ● ●

● ● ●

● ●

● ●

● ●

● ●





ACS Paragon Plus Environment

Cluster size

32−63

16−31

8−15

4−7

2−3

1

0.0



128+

64−127

32−63

16−31

4−7

2−3

1

8−15

Cluster size



CAST



DBSCAN



hclust



igraph



N−Cluster



MS−Cluster PRIDE Cluster



0.2

0.2



● ●

● ● ●







● ●

128+



1.0

● ● ●

64−127

0.8



0.8

● ● ●

0.6

● ●

0.4

1.0

Cosine distance

0.0

Average Purity

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Page 30 of 34



Page 31 of 34

Cosine distance + DISMS2 filter

0.92



0.40

0.42

0.44

0.46

0.48

0.50

ACS Average proportion of spectra remaining

1.00 0.96

0.98



0.94

0.94

0.96

0.98



●●● ● ●



0.92

●● ● ●



CAST



DBSCAN



hclust



igraph



N−Cluster



MS−Cluster PRIDE Cluster



0.90



Average retainment of identified peptides

1.00

Cosine distance

0.90

Average retainment of identified peptides

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Journal of Proteome Research

0.40

0.42

0.44

0.46

0.48

0.50

Paragon Plus Environment Average proportion of spectra remaining

Journal of Proteome Research Cosine distance + DISMS2 filter 0.5 0.4

● ● ● ●

0.0

0.1

0.2

0.3

0.4

0.5

ACS Proportion incorrectly clustered spectra



CAST



DBSCAN



hclust



igraph



N−Cluster



MS−Cluster PRIDE Cluster



0.0

0.1

0.2

● ● ●

0.3

●●



0.2

0.3

0.4

Proportion clustered spectra



0.1

0.5

Cosine distance

0.0

Proportion clustered spectra

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Page 32 of 34

0.0

0.1

0.2

0.3

0.4

0.5

Paragon Plus Environment Proportion incorrectly clustered spectra



HQGVmVGMGQK (Score range 43−76)



HQGVMVGmGQK (Score range 51−66)



Annotation missing

● ● ● ●●●



● 5



●● ● ● ●● ● ● ●●

10

● ●

Density

0.00

0.02

15



−0.02

● 0

Coordinate 2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Journal of Proteome Research

0.04

Page 33 of 34

−0.02

0.00

0.02

0.04

0.0

0.2

0.4

0.6

ACS Paragon Plus Environment Coordinate 1

Cosine distance

0.8

1.0

FGDFVLK (35) ●

FGFYEVFK (48)



FGDLILK (38) ●

FGTVLK (40)



FGFGAK (33) ●

FNNFIK (28)



Annotation missing

Page 34 of 34

0.04

25

Journal of Proteome Research

20



10



● ●



● ●



0



5

0.00



15

● ●

Density

0.02



−0.02

Coordinate 2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21



−0.04

−0.02

0.00

0.02

0.04

0.0

0.2

0.4

0.6

ACS Paragon Plus Environment Coordinate 1

Cosine distance

0.8

1.0