On the Predictability of Protein Database Search ... - ACS Publications

We discuss several aspects related to load balancing of database search jobs in a distributed computing environment, such as Linux cluster. Load balan...
0 downloads 4 Views 230KB Size
On the Predictability of Protein Database Search Complexity and Its Relevance to Optimization of Distributed Searches Cosmin Deciu,* Jun Sun,† and Mark A. Wall Diversa Corporation, San Diego, California 92121 Received February 6, 2007

We discuss several aspects related to load balancing of database search jobs in a distributed computing environment, such as Linux cluster. Load balancing is a technique for making the most of multiple computational resources, which is particularly relevant in environments in which the usage of such resources is very high. The particular case of the Sequest program is considered here, but the general methodology should apply to any similar database search program. We show how the runtimes for Sequest searches of tandem mass spectral data can be predicted from profiles of previous representative searches, and how this information can be used for better load balancing of novel data. A well-known heuristic load balancing method is shown to be applicable to this problem, and its performance is analyzed for a variety of search parameters. Keywords: load-balancing • Sequest • distributed computing • LPT rule • protein database search

1. Introduction One of the common tasks in Proteomics is the identification of proteins present in a given sample, a task for which the method of protein sequencing by mass spectrometry has proven to be very successful.1,2 When dealing with complex mixtures of proteins, the separation of individual components is carried on by liquid chromatography (LC), while the identification step is resolved by a tandem of mass spectrometers, operating with an accuracy and scanning speed adequate for the problem at hand (such as to resolve the sample complexity and handle the LC flow). Developments in LC techniques3,4 as well as in mass spectrometry technology have led to an explosion of proteomics data, a situation that presents itself both as a challenge as well as an opportunity, for research labs as well as companies in the biological high-performance computing arena.5 For example, a very fast Thermo LTQ highperformance ion trap spectrometer,6 coupled with the 3D LCMS/MS technology, developed by Jing Wei and her co-workers,4 produces 0.5-1GB of raw data per experiment, corresponding to ∼600 000-1 200 000 MS2 spectra. To efficiently analyze such a large volume of data, database search programs such as Sequest7 have a number of options available. They can employ a data preprocessing/data reduction step, such as spectral quality filters,8 to eliminate spectra that are unlikely to produce credible peptide identification, or parent ion charge state identification.9,10 Another option is to combine algorithm optimization with database search strategies, such as the ones used by the Global Proteome Machine11,12 which uses incremental refinement steps in its searches. Even after applying such optimization techniques, the data volume is still large enough as to require high-performance hardware, be it in the * Corresponding author. E-mail: [email protected]. † Present address: Genomatica, San Diego, CA 92121 USA. 10.1021/pr070066u CCC: $37.00

 2007 American Chemical Society

form of multiprocessor/multinode clusters or specialized hardware, such as the Sage-N Research Sorcerer. In particular, the usage of Linux clusters for protein identification via database searching is a very efficient solution, and it has been implemented since late 1990s, at about the same time as when the ‘Beowulf cluster’ concept was starting to gain popularity.13 Database searches of spectral data are ideally suited for such parallel processing as they represent an ‘embarrassingly parallel’ problem: various nodes in a cluster can search data independent of each other. When a database search program is run in a distributed computing environment, everything that applies to the serial performance of the algorithm is still valid, but additionally, one needs to address the efficiency by which the search jobs are scheduled. In particular, efforts should be made to ensure that all participating nodes are allocated equal work loadssthe so-called load balancing problem. Moreover, should such distributed computing environment be a shared facility (meaning used also for purposes other than database searches of tandem spectral data), the resources needed for a search session need to be accurately estimated for judicious cluster usage. In this context, we will examine the factors that affect the performance of a database search program and how to determine an optimal schedule for such task. Starting with representative 3D LC-MS/MS data, we will run a number of searches under various conditions to obtain enough information to be able to predict the effective database search time for novel data. We will then apply a load balancing scheme and evaluate its performance on a Linux cluster.

2. Experimental Setup 2.1. Spectral Data: Sample Preparation and On-Line 3D LC-MS/MS Analysis. Typical 3D LC-MS/MS spectral data were obtained from Desulfovibrio vulgaris cells which were cultured and harvested as previously reported.14 Cells were Journal of Proteome Research 2007, 6, 3443-3448

3443

Published on Web 07/31/2007

research articles lysed with 1% RapiGest (Waters Corporation, MA), 50 mM TrisHCl, pH 8.0, 100 mM NaCl, and 2 mM EDTA buffer. Cell lysates were sonicated for 15 min, boiled for 5 min, and incubated at room temperature for 60 min to extract total proteins. After centrifugation at 20 000g for 15 min, supernatants were saved, and protein concentrations were measured. The proteins were diluted 10-fold with water and then digested with trypsin. After complete digestion, RapiGest was degraded by acid treatment, and the resulting solution was centrifugated at 20 000g for 30 min. The supernatants contained hydrophilic peptides, whereas the pellets contained hydrophobic peptides and were resuspended with 70% isopropanol. Aliquots of the resulting peptides were analyzed by LC-MS/ MS separately accordingly to previously published methods.4 Briefly, the peptides were fractionated by a 3D microcapillary column using an Agilent 1100 series HPLC (Agilent, CA), and the eluted peptides were analyzed directly by an on-line LCQ Deca XP mass spectrometer (Thermo Finnigan, CA) equipped with a nanospray source. MS/MS spectra between 400 and 2000 m/z were collected for data analysis. The sample preparation method described here allowed the extraction of total protein from the cell. The wide application range of 3D LC-MS/MS analysis can be found in ref 4. 2.2. Databases. Given the nature of the sample, the ‘standard’ database that was used for the searches was a collection of D. vulgaris proteins (as obtained from NCBI on 09/26/04), some standard proteins (common contaminants), and a set of reversed sequences. This database contains 50 748 sequences comprising a FASTA file size of 18 MB (annotations included). Detailed profiling (measurements of run times as a function of search parameters) was performed using two other databases, one containing only the first half of the ‘standard’ database (‘0.5×’) and one containing twice the set of proteins from the ‘standard’ database (‘2×’). Limited profiling was performed using 191 distinct databases, of sizes ranging from 2 to 12 822 085 sequences and file sizes ranging from 531 bytes to 760 MB. For all the searches, we have investigated two cases of in silico enzyme cleavage: one with trypsin-specific rules (‘tryptic searches’) and one without any cleavage restrictions (‘non-tryptic searches’). 2.3. Database Search Program. The program used for protein database searches implemented the widely used Sequest algorithm. We implemented the same scoring methods and data flow as previously described.7 The program was written in the C programming language and compiled with the Intel C compiler on a Linux x86 platform. The numerical results quoted in this paper have been produced on a homogeneous Linux cluster, each machine being a HP DL150 (HewlettPackard, CA) with 2 Pentium IV processors, a clock speed of 3.4 GHz, and 2GB of RAM. The number of processors available to database searches varied between 50 and 200. The job management was performed with LSF (Platform Computing, Ontario Canada).

3. Timing Method To bypass the irregularities introduced by I/O or network operations, we placed the ‘timing hooks’ in the source code right after (or right before) such operations are performed. The original implementation of Sequest performed the protein sequence reads ‘on-line’; that is, one protein sequence is read at a time and compared to the spectrum currently being analyzed. For such an implementation, it would have been less accurate to start and stop the timers for each sequence being 3444

Journal of Proteome Research • Vol. 6, No. 9, 2007

Deciu et al.

read from a file. Instead, we used the functions from the Portable Cray Bioinformatics library16 to load the whole database in RAM and then proceed with the typical Sequest analysis, involving the calculations of the preliminary and crosscorrelations scores. The execution time was measured with the C library function times, which counts all the timer intervals (of length ) 10 ms, under Linux) during which the operating system only performs tasks associated with this program.15 This sort of timing method is known to be too coarse grained for very short calculations (100 ms or less), but it has the advantage of not having a strong dependency on the load of the system. The execution times reported in this paper are all outside the range for which the interval-counting method is inaccurate. What is reported is the total user time; the profiling charts exclude the time spent by the operating system performing tasks unrelated to the protein database search. This timing method may seem to be synthetic, but it ultimately appeared as most coherent to the authors in light of the existing variability in the weight of the input data storage and parsing problems. For a ‘real-world’ measurement, one would also have to estimate the impact of remote file storage versus local disk file storage, simple text parsing versus mzXML parsing, plain text formatting versus binary formatting, and compressed versus uncompressed input data. The computational costs associated with these I/O operatations are of significant importance only when their magnitude is comparable with the size of the costs for the database search per se. This will be the case for low-complexity searches (small databases, enzyme specific, no differential modifications). For high-complexity searches, however, the time spent performing such I/O operations is very small compared to the time taken by the database search; for these cases, the performance of our Sequest implementation is indistinguishable from the one of the original Sequest.

4. On- and Off-Line Scheduling An on-line scheduling is a job allocation method in which new jobs are distributed as they become available or as machines to process them become available. The original parallelized version of Sequest, built with the Parallel Virtual Machine (PVM) library, was based on what is called a ‘crowdcomputation’ model:9,17 a master process initiates the search session and starts sending spectra to the nodes in a PVM cluster to be searched. As the nodes complete a job (one spectrum), they communicate availability to the master process and, in turn, receive a new spectrum to be analyzed. The initial list of spectra to be searched (files in DTA format) is created by the master process, and the order in which the spectra are to be submitted is simply based on the alphanumeric order of the file names. An off-line scheduling corresponds to a static job allocation, in this case, a ‘batch splitting’: spectra are grouped in the batches that are allocated to a specific machine. This sort of ‘batch splitting’ is used by another database search program, Parallel Tandem.12 Usually, these batches consist of a constant number of spectra, taken from a alphanumerically ordered list. As this order is not correlated with the search times, it is essentially an arbitrary order as far as off-line scheduling is concerned. It is obvious that the on-line job allocation for Sequest described above will lead to a more even distribution of the whole workload than a mere random (alphanumerical) batch splitting. However, there exists a method of creating such job batches that will lead to the same

Predictability of Protein Database Search Complexity

even distribution of the workload as on-line scheduling, a method known as Graham’s rule:18 Allocate randomly ordered jobs, one by one, to the machine which currently has the smallest load. The equivalence is based on the fact that smaller loads are composed of spectra which will take less time to complete (shorter runtimes). The off-line scheduling necessitates that the search times for all of the spectra to be analyzed be known in advance. For both on-line as well as off-line scheduling, an even better load-balancing can be achieved if, instead of starting with randomly ordered jobs, the jobs are ordered by execution time19 and then Graham’s rule is applied. In the language of Scheduling Theory, this rule is also known as Longest Processing Time first (LPT) with the objective of minimizing the makespan Cmax, that is, minimize the largest run time from all the batches. Note that a perfectly balanced scheduling would lead to all batches having equal run times. Off-line scheduling with the LPT rule was the scheduling of choice for the results presented in this paper, mostly because it allows for a looser-coupling of the cluster nodes. Notice that the goal of finding the exact distribution of jobs into batches for optimal load balancing is not practical as this problem is known to be NP-hard20 (meaning, informally, that no efficient algorithm for solving this problem exists). The problem of minimizing the makespan on parallel machines (the load balancing problem on hand) reduces to the general Integer Partition class, one of the classes of NP-hard problems.

5. Search Time Variability The effectiveness of any off-line scheduling scheme depends on the accuracy of predicting the search time for a novel spectrum. Even though we are profiling a completely deterministic application, there is a stochastic component associated with the environment in which the program was run. In both real world as well as ‘laboratory conditions’, there will be some variability in the execution time for many programs that are run in the user space. To evaluate the distribution of run times for various search conditions, we have performed repeated runs of the Sequest program. Ideally, running a given program N times on a single machine is equivalent to running the same program once on N identical machines. If both the individual execution time and N are large, the two distributions, repeated runs on a single machine, and runs on a homogeneous cluster, will be different because of scheduled system tasks that will make the interval counting-based timing method less accurate. To compare these two distributions, we searched a random spectrum 500 times on a single machine and also searched the same spectrum 10 times on 50 distinct machines (all of identical hardware profile). The resulting data are plotted in Figure 1. An adequate model for both these distributions seems to be Generalized Extreme Value distribution (GEV).21 To check the equivalency of these two populations, a more revealing test is the nonparametric Kolmogorov-Smirnov test,22 which also concludes that the hypothesis that these two distributions are the same cannot be rejected (at a 5% significance level). The goodness of this particular fit does not imply that its parameters are also appropriate for the rest of the spectra. All three parameters of GEV fits, location (mean), scale, and shape, have been found to vary from one spectrum to another, with the location parameter being the one that changes the most. The magnitude orders for the scale and shape parameters did not vary as much, remaining in the vicinity of 10-1 and 10-2, respectively.

research articles

Figure 1. Distribution of search time for a single spectrum, tryptic search against the ‘0.5×’ database. Overlapped histograms are for a single machine (hatched) and for 50 concurrent machines (solid). The smooth curve is a Generalized Extreme Value fit (same parameters in both cases). The number of bins in the histogram was computed using Scott’s rule.29

The problem at hand is therefore a stochastic load balancing one. Most of the work in the area of stochastic scheduling has focused on scenarios in which the execution times have a Poisson or exponential distribution.20,23 Kleinberg et al.24 have obtained a O(1)-approximation for an arbitrary distribution of execution times, but no attempt was made to optimize the constant of approximation. Goel and Indyk25 give a 2-approximation for Poisson-distributed runtimes and a polynomial time approximation scheme for exponentially distributed runtimes. Given the similarity between the Poisson distribution and the GEV distribution, we adopted the algorithm from ref 25 for stochastic load balancing, without a formal proof regarding its guaranteed performance for now.

6. Factors Affecting the Search Time Any proteomic database search program has two kinds of input: spectral data, characterized by parent ion mass, charge state, and ion sequence; and database and search conditions. We built a model that incorporates these variables and produces an estimate of the search time that can be used in a loadbalancing scheme like Graham’s rule. Figure 2 shows the dependency of search time on parent ion mass, for the databases described in section 2.2. The plot 2A corresponds to a tryptic search and the plot 2B to a nontryptic search. The points from these graphs were obtained by searching concurrently on 75 identical machines a set of 250 spectra, that were randomly selected from the complete mass spectral data such as to cover the whole m/z range. The median search time was then plotted against its corresponding mass value. For both cases, tryptic and non-tryptic searches, the search time increases with parent ion mass. In the mass range 1000-4000 Da, the curves are piecewise smooth. There are three obvious discontinuities at M ∼ 1000, 2000, and 4000 Da. It is no coincidence that these gaps occur at such mass values for they are related to the increase in the expense of the crossJournal of Proteome Research • Vol. 6, No. 9, 2007 3445

research articles

Figure 2. Search time versus parent ion mass: (A) tryptic search, (B) non-tryptic search, (C) search with differential modifications [M 14, C 57]. Profiles for three databases: ‘0.5×’ (dots, red), ‘standard’ (crosses, green), and ‘2×’ (squares, blue) search.

correlation calculations, which use Fast Fourier Transformations (FFT). The size of the vectors involved in these calculations, which depends on the magnitude of the parent ion mass, must be a power of 2; as such, there will exist specific values of the parent ion mass at which the lengths of these vectors and, consequently, the complexity of the FFT calculations change. For the smallest of the three databases, the smoothness of the curve search time versus mass is lost around 4000 Da, and we can see bigger fluctuations in the data points that follow after this value. This is related to the fact that, for a smaller database, peptide candidates of such larger masses are, com3446

Journal of Proteome Research • Vol. 6, No. 9, 2007

Deciu et al.

paratively, rare events, and consequently, there is higher chance that calculations for such larger fragments are significantly different from one another. The plot for the non-tryptic search is similar to the tryptic one, except that a second kind of discontinuity is present, namely, one due to stronger differentiation between charge states. Our implementation of the Sequest algorithm constructs the theoretical spectra for the candidate peptides identically for +1 and +2 charge states, and only the +3 charge state is different (not to be confused with the problem of +2/+3 charge state indeterminacy). The difference in computational expense between these charge states is not significant for tryptic searches, but it is so for non-tryptic searches. What is clear for both plots is that the data points follow a pattern that is suitable for accurate interpolation (aside from some smaller regions of the parent ion mass). The median search time can be approximated as a piecewise, (quasi)linear function on mass, the nontryptic case being the one that shows a stronger linear dependency. The penalty due to the increase in the FFT calculations (marked on the graphs with ∆FFT) is independent from the size of the searched database, unlike the charge-state penalty (marked on the graphs with ∆Z). Moreover, in the non-tryptic case, the slope of the search time function exhibits greater relative changes at each discontinuity point, compared to the tryptic search case. Increased computational complexity is present in the case of a ‘differential modifications’ search, that is, a protein sequence database search for which certain aminoacids are allowed to have different mass values. This type of search is encountered in the case of post-translational modifications.26 The plot in Figure 2C shows the search time profile for a search session that included potential methylation and carboxamidomethylation on cysteine, as well as nonspecific enzyme cleavage. The dependency is no longer linear, but is rather of the form (piecewise) t ) exp(a logb(m)), where a depends on the types of amino acids succeptible to mass modification and b on the size of the database. The FFT and charge-state penalties are still present, the latter one being even more dominant for larger databases than in the case of nondifferential modifications searches. The above profiles are independent of the LC separation method and the nature of the sample. This was proven by the fact that identical profiles were obtained with a random data set ‘HUPO12_run34’, a HUPO plasma sample, obtained from the Peptide Atlas repository.27,28 As evident from Figure 2, the slope of the graphs increased with the size of the database. To further investigate this dependency, we performed the following experiments: given two representative spectra, one with parent ion mass ∼1500 Da, Z ) 2, and one with parent ion mass ∼3000 Da, Z ) 3, we searched them against the collection of databases described in section 2.2. In all three cases analyzed, tryptic, non-tryptic, and differential modifications, the dependency of search time on database size follows a linear trend, albeit at greatly different rates from one search type to another. Compared to the predictibility of the search time versus parent mass analysis, there is a stronger stochastic component in these types of results. Nevertheless, a model with fair accuracy can still be found; in the cases illustrated, the dependency can reasonably be approximated as linear. Together with the profiling data of search time versus mass, a full profile can be built evaluating search time as a function of spectral properties (parent ion mass), database size, and search type.

Predictability of Protein Database Search Complexity

research articles

7. The Performance of the LPT Load-Balancing Scheme Just how efficient is the load-balancing scheme introduced by the LPT rule? To answer this question, we performed the following experiment: we took a set of 16 000 spectra (this number was chosen as it is equal to the average number of spectra per 3D LC-MS/MS step, in our experiments), randomly selected from the dat set corresponding to the whole experiment; using the search-time profiles established in the previous section for the ‘standard’ database, we estimated the median search times for each spectrum. We then used the LPT rule to compute the distribution of these 16 000 spectra into various numbers of batches. The gauge for measuring the performance of the LPT rule was to compare it to a trivial load balancing scheme, one in which each spectrum has the same weight. Consequently, in this trivial load balancing, each batch will have the same number of spectra and the packing order is irrelevant (random). For a meaningful comparison, the number of batches for the LPT load balancing scheme is the same as the number of batches for the trivial load balancing. In total, 10 000 random shuffles of the table of search times for this collection of spectra have been performed in order to estimate the average of the ‘heaviest’ batch (the average of the makespan). A measure of goodness of load-balancing by the LPT rule is then the difference between this average makespan and the makespan by LPT, normalized to the ideal makespan (total number of spectra divided by number of batches) γ) (avg. makespan from random packing) - (makespan by LPT) ideal makespan The efficiency of the LPT rule is affected by the following factors: • search complexity. Both the type of search as well as the size of database will directly impact the distribution of search times. A flat distribution of search times, extending of a large time range, such in the case of a differential modifications search against a large database, will result in such distribution. Search times following such a distribution would be harder to load-balance randomly, with constant size batching only. On the other hand, a simpler type of search, such as a tryptic search against a smaller database, will result in a sharper distribution of search times. Equal-sized batches are expected to lead to a load-balancing that will be close to the ideal one. • number of machines (number of batches). A larger number of machines will make the random balancing be less efficient. A closely related parameter is • size of the batch. These last two can be combined into a single factor, by defining the load ratio )

number of machines number of spectra per batch

We have computed the value of γ for various values of , as resulting from various combinations of number and participating machines in the cluster and number of spectra per batch. The function γ ) γ() follows a power law in the region analyzed, as illustrated by the linear plot in log-log coordinates in Figure 4. For example, for a differential modifications search, for a number of 1600 batches and 10 spectra per batch, there is a 95% difference between LPT makespan and the average random makespan (normalized to the ideal makespan), but for 100 batches of 160 spectra per batch, this difference is only

Figure 3. Search time versus database size for various types of search types. Experimental data and linear fits for M ) 1500 Da (dots, blue) and M ) 3000 Da (crosses, red).

13%. These differences are even smaller in the case of a less complex search, such as a tryptic search: 18% difference in the case of 1600 batches/10 spectra per batch and only 3% difference in the case of 160 batches/100 spectra per batch. Unlike the search time versus mass profiles, these numerical results regarding the performance of the load balancing scheme under LPT do depend on the nature of the sample as well as on the LC separation method almost by definition: these characteristics dictate the distribution of job sizes that need to be load-balanced. A reduced mass range will trigger a reduced search time range, and hence, jobs that are more similar to each other are trivial to load-balance. Journal of Proteome Research • Vol. 6, No. 9, 2007 3447

research articles

Figure 4. Efficiency of load-balancing: relative difference between the makespan by LPT rule and average random makespan as a function of batch size and number of cluster nodes.

8. Conclusion Load balancing of Sequest jobs can lead to significant reductions in total runtime, particularly for computationally complex searches. We have provided an empirical solution in the framework of stochastic load balancing by adopting the ‘Longest Processing Time first’ rule and by using estimates of the runtimes for given spectral data and search conditions. These estimates have been shown to be inferable, in the form of their average values, from a single input type: parent ion mass. Load balancing (by LPT rule) is not always necessary: simpler database searches have a narrow distribution of search times for most types of input data, and as such, a simple constant packing in terms of distributed number of spectra is sufficient. For the more complex searches, when such load balancing scheme is beneficial, the timing profiles presented in this paper are the same as they would be computed with the original Sequest implementation. As such, no modifications are needed to Sequest in order to reproduce these results; load balancing will be achieved by a a mere re-packing of the input data, according to the estimated processing time. As mentioned in section 4, the database search engine Parallel X!Tandem also resorts to an initial packing of the input data. An implementation of the load-balancing LPT scheme is even more straightforward in the case of this engine. The methodology presented in this paper can be used to obtain the timing profiles, and this is the subject of future work.

Acknowledgment. This work was partially supported by the U.S. Department of Energy’s Genomics:GTL program, under grant DEAC0376SF00098. We thank John R. Yates, III, Paul Oeller, and Mick Noordewier for advice and valuable suggestions. References (1) Biemann, K. Sequencing of peptides by tandem mass spectrometry and high-energy collision-induced dissociation. Methods Enzymol. 1990, 193, 455-479. (2) Hunt, D. F.; Yates, J. R., III; Shabanowitz, J.; Winston, S.; Hauer, C. R. Protein sequencing by tandem mass spectrometry. Proc. Natl. Acad. Sci. U.S.A. 1986, 83, 6233-8238.

3448

Journal of Proteome Research • Vol. 6, No. 9, 2007

Deciu et al. (3) Washburn, M. P.; Wolters, D.; Yates, J. R., III. Large-scale analysis of the yeast proteome by multidimensional protein identification technology. Nat. Biotechnol. 2001 19 (3), 242-247. (4) Wei, J. et. al. Global proteome discovery using an online threedimensional LC-MS/MS. J. Proteome Res. 2005, 4 (3), 801-808. (5) The Power of Distributed Computing in Cancer Research: Supercomputing 2005 Keynote Demonstration. Microsoft Corporation. (6) Yates, J. R., III; Cociorva, D.; Liao, L.; Zabrouskov, V. Performance of a linear ion trap-Orbitrap hybrid for peptide analysis. Anal. Chem. 2006, 78, 493-500. (7) Eng, J. K.; McCormick, A. L. and Yates, J. R., III. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 1994, 5, 976-989. (8) Bern, M.; Goldberg, D.; McDonald, W. H.; Yates, J. R., III. Automatic quality assessment of Peptide tandem mass spectra. Bioinformatics 2004, 20 (Suppl. 1), I49-I54. (9) Sadygov, R. G.; Eng, J. K.; Durr, E.; Saraf, A.; McDonald, H.; MacCoss, M. J.; Yates, J. R., III. Code developments to improve the efficiency of automated MS/MS spectra interpretation. J. Proteome Res. 2002, 1 (3), 211-215. (10) Klammer, A. A.; Wu, C. C.; MacCoss, M. J.; Noble, W. S. Peptide charge state determination for low-resolution tandem mass spectra. Proc. IEEE Comput. Syst. Bioinform. Conf. 2005, 175185. (11) The global proteome machine organization proteomics database and open source software Web page, http://www.thegpm.org. (12) Duncan, D. T.; Craig, R.; Link, A. J. Parallel Tandem: a program for parallel processing of tandem mass spectra using PVM or MPI and X!Tandem. J. Proteome Res. 2005, 4 (5), 1842-1847. (13) Sterling, T.; Salmon, J.; Becker, D. J.; Savarese, D. F. How to Build a Beowulf: A Guide to the Implementation and Application of PC Clusters; MIT Press: Cambridge, MA, 1999. (14) Mukhopadhyay, A. et. al. Salt stress in Desulfovibrio vulgaris Hildenborough: an integrated genomics approach. J. Bacteriol. 2006, 188 (11), 4068-4078. (15) Bryant, R. E.; O’Hallaron, D. R. Computer Systems: A Programmer’s Perspective; Prentice Hall: Upper Saddle River, NJ, 2003 (16) Long, J. The Portable Cray Bioinformatics Library. Arctic Region Supercomputing Center. http://cbl.sourceforge.net. (17) Geist, A.; Beguelin, A.; Dongarra, J.; Jiang, W.; Manchek, R.; Sunderam, V. PVM: Parallel Virtual Machine. A Users’ Guide and Tutorial for Networked Parallel Computing; MIT Press: Cambridge, MA, 1994. (18) Graham, R. L. Bounds for certain multiprocessing anomalies. Bell System Tech. J. 1966, 45, 1563-1581. (19) Graham, R. L. Bounds on multiprocessing timing anomalies, SIAM J. Appl. Math. 1969, 17, 263-269. (20) Pinedo, M. Scheduling: Theory, Algorithms and Systems; Prentice Hall: Upper Saddle River, NJ, 1995. (21) Coles, S. An Introduction to Statistical Modeling of Extreme Values; Springer-Verlag: London and New York, 2001. (22) See, for example, F. J. Massey, Jr. The Kolmogorov Smirnov test for goodness of fit. J. Am. Statist. Assoc. 1951, 46, 68-78. (23) Weiss, G. Approximation results in parallel machines stochastic scheduling. Ann. Oper. Res. 1990, 26, 195-242. (24) Kleinberger, J.; Rabani, Y. and Tardos, E. Allocating bandwidth for bursty connections. Proceedings of the 29th ACM Symposium on Theory of Computing, 1997. (25) Goel, A.; Indyk, P. Stochastic load balancing and related problems. Proceedings of the 40th Annual Symposium on Foundations of Computer Science, Oct 17-19, 1999, New York, NY, pp 579. (26) Mann, M.; Jensen, O. N. Proteomic analysis of post-translational modifications. Nat. Biotechnol. 2003, 21, 255-261. (27) Sample accession PAe000100, sample name HUPO12_run34 obtained from http://www.peptideatlas.org/repository/ on January 2007. (28) Omenn, G. S. The Human Proteome Organization Plasma Proteome Project pilot phase: reference specimens, technology platform comparisons, and standardized data submissions and analyses. Proteomics 2004, 4 (5), 1235-1240. (29) Scott, D. W. On optimal and data-based histograms. Biometrika. 1979, 66, pp 605-610

PR070066U