Technical Note pubs.acs.org/jpr
Cloud CPFP: A Shotgun Proteomics Data Analysis Pipeline Using Cloud and High Performance Computing David C. Trudgian and Hamid Mirzaei* Department of Biochemistry, University of Texas Southwestern Medical Center, 5323 Harry Hines Boulevard, Dallas, Texas 75390-8816, United States S Supporting Information *
ABSTRACT: We have extended the functionality of the Central Proteomics Facilities Pipeline (CPFP) to allow use of remote cloud and high performance computing (HPC) resources for shotgun proteomics data processing. CPFP has been modified to include modular local and remote scheduling for data processing jobs. The pipeline can now be run on a single PC or server, a local cluster, a remote HPC cluster, and/or the Amazon Web Services (AWS) cloud. We provide public images that allow easy deployment of CPFP in its entirety in the AWS cloud. This significantly reduces the effort necessary to use the software, and allows proteomics laboratories to pay for compute time ad hoc, rather than obtaining and maintaining expensive local server clusters. Alternatively the Amazon cloud can be used to increase the throughput of a local installation of CPFP as necessary. We demonstrate that cloud CPFP allows users to process data at higher speed than local installations but with similar cost and lower staff requirements. In addition to the computational improvements, the web interface to CPFP is simplified, and other functionalities are enhanced. The software is under active development at two leading institutions and continues to be released under an open-source license at http://cpfp.sourceforge.net. KEYWORDS: cloud computing, search engine, mass spectrometry, pipeline, TPP
■
INTRODUCTION The Central Proteomics Facilities Pipeline (CPFP)1 was released in 2008 as an open-source data analysis platform targeted at the needs of core proteomics facilities. Built upon tools from the Trans-Proteomic Pipeline (TPP),2 it has grown into a comprehensive computational platform for protein identification and quantification that uses and expands on various features of the TPP. Workflows are heavily automated and all results are now stored in a relational database. CPFP offers a more comprehensive user-friendly web-interface than the Petunia interface provided in the TPP. This frees clients from the need to configure and run searches on multiple search-engines using command-line tools and/or configuration files. Our web interface uses a Model-View-Controller (MVC) framework and can scale to support a large number of users. The database permits quick access to results, with efficient filtering of peptides and proteins in the simplified results viewers. Compared to the TPP viewers, which operate by parsing XML files, CPFP significantly speeds up access to large data sets, allowing responsive concurrent access for multiple users in a core setting. CPFP has also been expanded to include novel software such as the SINQ spectral index quantitation tool3 and the ModLS post-translational modification localization scoring method.4 It is the primary analysis platform for busy proteomics facilities at two institutions, since 2009 and 2011, respectively. These installations have performed 15 416 database searches for 141 © 2012 American Chemical Society
users, yielding over 243 million peptide to spectrum matches (PSMs). All results remain accessible via the web interface, and it is possible to search across multiple submissions for peptides or proteins of interest identified in previous work. The scale of these existing installations highlights the suitability of CPFP for core facilities and suggests it can be extended for cloud and high-performance computing analyses of very large data sets. Here we present Cloud CPFP, a new version of CPFP that allows cloud and remote cluster-based processing of data. As the size of data files generated in proteomics core facilities and research laboratories increases, computing power required to process these files increases accordingly. While modern instruments produce considerably larger data files than their predecessors, research laboratories and core facilities have limitations in purchasing and maintaining additional computers. The data processing requirements of large facilities or laboratories may now require multiple powerful workstations, server computers, or a compute cluster in extreme cases. These requirements involve significant up-front cost of purchasing, with additional expenditure and staff-time for ongoing maintenance and upgrades. Cloud computing services and institutional compute clusters offer the possibility to purchase computing power ad hoc, reducing the up-front costs of establishing a large-scale data analysis pipeline drastically, and Received: July 26, 2012 Published: October 22, 2012 6282
dx.doi.org/10.1021/pr300694b | J. Proteome Res. 2012, 11, 6282−6290
Journal of Proteome Research
Technical Note
avoiding the staff-time needed to maintain a local computational infrastructure.
to the execution of data processing jobs of limited duration. All data must be transferred into the cluster’s dedicated file system over a network link, and processing tasks must be submitted into a scheduling system. Typically logging into a head-node using a command-line protocol such as secure shell (SSH) is needed to perform these tasks. The shared nature of the clusters means that the submitted jobs can take many hours or days to begin running in periods of heavy use. However, the huge amount of compute time offered at low or no cost to users is extremely useful when large analyses are needed and there are no time restrictions. Both cloud and cluster computing allow local installations of software to offload extensive data processing to remote systems, thus lessening the need to purchase, install, and maintain additional local resources. Our facility currently utilizes a small number of local servers for data processing that require a limited amount of time for maintenance. We have developed Cloud CPFP in order to restrict our need to expand and maintain large computing resources in future and be able to direct our bioinformatics effort to developmental and experimental work.
Cloud and Cluster Computing
Rather than purchasing and locally housing computing, data storage and other infrastructure, cloud computing allows users to rent these resources from a cloud services provider, and pay only for the time used. This model replaces purchasing and maintenance costs with computing usage cost, which can be much lower when the user does not need sizable computational resources on a full-time basis. A variety of cloud service providers now offer a wide range of services, some of the largest commercial examples being the Amazon Web Services (AWS) cloud (Amazon.com Inc., Seattle, WA), the Windows Azure cloud (Microsoft Corp, Redmond, WA), and the Rackspace cloud (Rackspace, San Antonio, TX). The services of these providers differ in terms of features, performance and cost, requiring that applications are targeted to a particular platform. AWS is the largest service, and has been used for a wide range of bioinformatics computing projects.5−7 Within AWS users may purchase computational power in the Elastic Compute Cloud (EC2), and data storage in the Simple Storage Service (S3) or Elastic Block Service (EBS). Many other infrastructure services are available including the Simple Queue Service (SQS) for message queuing, and the Relational Database Service (RDS) that offers dedicated relational database instances. EC2 and RDS instances are available in a range of sizes and prices, charged per hour. Other services are charged per unit of usage, e.g., GB of storage/number of requests/GB of data transfer, etc. While the AWS system and its charging structure are complex, they offer flexibility for bioinformatics computing, with different configurations supporting various trade-offs between throughput and cost. An application programmer interface (API) allows programmatic control of cloud resources, and a web interface provides simple options for managing an AWS account. In the cloud, on-demand resources are entirely under user control for the period that they are purchased. Compute instances function as independent computers, which may be configured and used as desired. By comparison cluster and grid facilities typically provide users with time on large shared systems, where jobs may be submitted and are scheduled according to overall demand. Compute clusters can consist of a small number of machines dedicated to a single task. For example, the previous version of CPFP required the GridEngine cluster queue system (Oracle Corp, Santa Clara CA) to execute data processing jobs. In this respect an installation of CPFP on a single machine constituted a single-node cluster, and expansion was only possible by adding additional systems to the cluster. This configuration can be considered a local and dedicated cluster, where the software has direct access to and complete control over the job scheduling system. The software and cluster nodes all have access to the same file system and CPFP database. Larger general purpose remote clusters that are shared between large numbers of users and tasks are also becoming more commonplace. Academic staff usually have access to some kind of cluster shared within or between academic institutions, e.g., the systems of the XSEDE network. Often these systems offer large amount of compute time and storage at little or no cost to users and are centrally managed. They differ from a local dedicated cluster as they typically cannot be used to host a web interface or database that must run constantly; they are limited
Existing Cloud and Cluster Tools for Proteomics
Existing cloud computing solutions for proteomics data analysis include the academic ViPDAC platform,8 the commercial IP2 Integrated Proteomics Pipeline (Integrated Proteomics Applications, San Diego CA, http://www.integratedproteomics. com/) and tools within TPP itself. These software have similar design patterns to the cloud functionality in CPFP, using the AWS cloud architecture. ViPDAC is a more limited platform compared to CPFP, TPP, or IP2 and only supports the submission and execution of searches and retrieval of results; it does not provide comprehensive online web-based result viewers, the ability to combine results from multiple search engines, and quantitation tools or further advanced features present in the other cloud computing platforms. Despite the limited nature of this platform, the developers (Haligan et al.)8 presented a convincing argument that using cloud computing for proteomics data analysis could help improve analysis times and control cost. Useful data on the trade-off between performance and cost of database searches was provided to establish the financial benefits of cloud-based computing. Since publication in 2009, cloud providers have launched more powerful infrastructure and lowered prices, which may further increase the benefits of cloud computing. Existing code within the TPP currently supports (in version 4.5.2) the use of the pipeline on compute clusters or in the Amazon cloud. The comprehensive hpctools command line scripts, which are part of TPP, allow the execution of searches and postprocessing on compute clusters/grids and across multiple compute instances in the cloud. However, these tools require users to have access to a Linux system and be comfortable with the command line interface. The existing TPP can also be run entirely within the AWS cloud by using the provided web-based launcher or command line scripts to start a compute instance hosting the pipeline. Users can then interact with the pipeline via its web interface or by logging into the command line interface. TPP’s web interface (Petunia) currently does not support the use of hpctools functionality or easy parameter selection for searches. We have developed cloud and cluster computing features for CPFP independent of the TPP code, to support CPFP’s focus as a data analysis pipeline for core facilities, which often 6283
dx.doi.org/10.1021/pr300694b | J. Proteome Res. 2012, 11, 6282−6290
Journal of Proteome Research
Technical Note
single CPU database import of results, which can be effectively parallelized, were not. To address these major limitations we designed and implemented a modular architecture for job processing, where high-level jobs such as database search of a submission are broken down into smaller units, for more efficient runs. Individual portions of processing, such as the database import of results, were also rewritten to allow parallel processing. We provide an overview of Cloud CPFP’s architecture in this manuscript. Further detail of improvements can be found in the Supporting Information.
demand software to be centrally administered and accessed by all core staff and customers. Cluster or cloud computing can be used to increase throughput of an existing local installation. Alternatively, CPFP can be launched entirely in the cloud similar to the TPP AWS images. Cloud-only configurations of CPFP can make use of multiple compute instances to deliver high-throughput data analysis, without the need for command line tools, and are suited for analyzing very large data sets that require parallel processing for acceptable performance. Single cloud-only installations of CPFP can be administered centrally to allow a core facility to support many users. The webinterface for search submissions uses a single page to specify consistent search parameters for multiple search engines, which is quite simple. There is no need for users to create parameter files for individual search engines, which requires significant knowledge as the syntax for specifying cleavage enzymes, posttranslational modifications (PTMs) etc. differs greatly between the search engines supported by various platforms. Another cloud proteomics solution, the commercial IP2 Cloud Service (Integrated Proteomics Applications, San Diego, CA) offers a comprehensive web interface with results viewers and postidentification analysis tools, which is built on a suite of open-source tools from the Scripps Research Institute and others. The functionality of these tools is similar to those in TPP and CPFP. We anticipate that existing commercial providers of data analysis tools, such as the Sorcerer pipeline (Sage-N Research, Milpitas, CA), and Scaffold software (Proteome Software, Portland, OR), will eventually develop their own cloud applications.
Modular Job Processing Architecture
In Cloud CPFP the direct submission of cluster jobs by the web interface component has been replaced with a modular system consisting of a job daemon and various shepherds. A high level job such as a database search and subsequent processing for one or more data files is created by the web interface in the form of a database record. The CPFP job daemon periodically checks for pending jobs to run. Significant data processing within each job is split into work packages. The job may create these work packages in parallel or sequentially, depending on the nature of the work. For example, a search job against 16 data files can create 16 work packages at once to run the search engine against each file in parallel. However, work packages for postprocessing of the results are run in series because the TPP tools run against all database search results. Figure 1 illustrates this process for a search of a submission consisting of N input files. Work packages are classified as local or remote. A local work package requires direct access to the MySQL database, file system, scripts, and libraries, and it cannot be trivially packaged for execution on a machine that does not have access to these resources. A remote work package is one that can be easily packaged into a self-contained unit, consisting of all programs and data files necessary to perform the required work. These remote work packages are suitable for execution on a remote system that cannot access all of CPFP’s parts. In this version of Cloud CPFP, MS/MS database searches are the only remote work packages. They are the easiest parts of the processing to run remotely because of limited reliance on other components of CPFP. Database search is also usually the lengthiest portion of an analysis, so parallel execution of multiple searches will provide a large increase in speed.
■
DESIGN AND FEATURES A CPFP installation consists of a number of parts. A web interface provides access to the system for users. A MySQL relational database (Oracle Corp, Santa Clara CA) holds processed input data, final results, and meta-information. The file system stores uploaded data files as well as intermediate and final output data files from data processing. Various scripts and libraries, along with third-party external programs (such as the TPP tools and search engines) provide the data processing functionality. Users interact with the web interface to submit data sets, run searches against the data sets, and launch postprocessing tasks against the results. Improvements have been made to the CPFP code to address several key limitations in previous versions of CPFP. Historically the CPFP web interface directly created and submitted jobs to a local GridEngine cluster for the data processing tasks, and so the web interface was required to run on a system that could directly submit jobs to the GridEngine software. As a result of this prerequisite (being able to submit jobs directly to the GridEngine software), installation of CPFP on a single machine, which required the configuration of a single-machine GridEngine cluster, was complicated. In addition, because of the fact that the original CPFP was designed for local installations only, all cluster nodes required access to all parts of CPFP, including the file system, MySQL database, and scripts, which prevented the use of remote clusters for processing. A database search in CPFP was previously performed as a single job, run on the local cluster, in which each peak list file was searched sequentially, and followed by TPP postprocessing. This workflow, while simple, prevented efficient parallel searches of large data sets with many peak list files across multiple nodes in the cluster. Additionally, tasks such as the
Job Shepherds
Execution of work packages is overseen by the CPFP shepherds. These modules are termed shepherds as they “herd” work packages to the correct location for execution and monitor them for completion or error. Local and remote shepherds are available. A local shepherd can run any work package, local or remote. It will execute these work packages only on systems that have direct local access to the CPFP file system and database. A remote shepherd can execute only remote work packages, which do not require direct access to the CPFP file system and database. The remote shepherd oversees the transfer of data to a remote system, execution of the work package commands on that system, and retrieval of results from the remote system. Two local shepherds are currently offered in Cloud CPFP. The LocalSMP Shepherd runs all work packages on the local machine, taking advantage of multiple processors and cores if configured to do so. It is intended to be used when CPFP is installed on a single machine and eliminates the previous requirement for GridEngine job scheduler installation. The 6284
dx.doi.org/10.1021/pr300694b | J. Proteome Res. 2012, 11, 6282−6290
Journal of Proteome Research
Technical Note
instances as necessary to meet demand. Configuration settings specify the size and therefore the cost of instances to be used for processing, the maximum number of instances to be started, and how quickly they are launched in response to rising demand. When no work packages are waiting, an EC2 slave instance will sit idle for a user-configurable time before shutting down. Since EC2 usage is charged per hour it is cost-efficient to delay shutdown so that additional cost is not incurred if further work packages become available within the hour. General configuration options for the remote shepherds allow the system to be configured in such a manner as to limit the use of remote systems. In order to use any freely available local resources before using the remote ones, remote execution can be restricted to occasions when a certain number of work packages are pending. Execution can also be limited to work packages of certain complexity, assessed using an approximate metric calculated from factors such as enzyme specificity, database size, number of modifications etc. This limit allows users to prevent short-running work packages from being sent to remote systems, where the additional time required for data transfer and scheduling would be an excessive proportion of total execution. Hybrid Local/Remote Installations
Although Cloud CPFP can be run on a single local computer, or a local cluster, it is expected that the most beneficial longterm configuration for a core facility using it would be the use of both local and remote processing. The web interface, MySQL database, and data files would be kept on a system local to the core, while some local data processing capability would be provided either by a single server or a small GridEngine cluster. This arrangement allows backups to be provided centrally and according to the local rules. It also allows small processing tasks to be run locally. Figure 2 illustrates this setup using a local GridEngine cluster, with the AWS cloud available for remote processing. To minimize cost, this setup would be configured to use the cloud only when the local cluster queue is full, keeping analysis times down to a manageable level in peak periods of demand without the need to expand the local cluster.
Figure 1. Illustration of how a high-level job, in this case a search against a data set of N files, is broken by the job daemon into work packages that are executed in parallel or series by shepherds. Database searches of each file are independent and can be run in parallel. TPP processing introduces the narrowest bottleneck, as the TPP tools must be run sequentially on the entirety of the search results. Local work packages must be run on a local system, while remote work packages can be executed on a remote system by the remote shepherd.
LocalSGE scheduler runs work packages on a local GridEngine cluster, replicating the functionality of previous versions of CPFP. Two remote shepherds can offload processing to cloud and remote cluster systems. The RemoteSGE shepherd transfers work packages to a remote cluster using the Secure File Transfer Protocol (SFTP), and retrieves results in the same manner. The shepherd maintains a connection to the remote cluster via the Secure Shell protocol (SSH), which is used to submit jobs to the cluster’s scheduling system and monitor their progress. Currently only remote clusters using the GridEngine scheduling system are supported. However, the open-source and modular nature of CPFP allows advanced users to create their own shepherds to support other scheduling architectures. The RemoteEC2 shepherd uses EC2 compute nodes in the AWS cloud to process work packages. This shepherd transfers waiting work packages to the S3 file storage service. It then submits a message into an SQS queue indicating the package is available to be processed. Slave EC2 instances periodically check this SQS queue for new work packages, and when one becomes available the instance retrieves it from S3, runs the commands it contains, and uploads results back into S3. A message is placed into an SQS queue to inform the shepherd that the results can be retrieved. The shepherd periodically monitors the number of jobs waiting and starts additional EC2
Figure 2. Overview of the architecture of a hybrid installation of Cloud CPFP using a local GridEngine cluster as well as AWS cloud services for data processing. A local master server hosts the web interface, database, job daemon and shepherds. A local shepherd is used to execute work packages on a local GridEngine cluster. Remote work packages are sent into and retrieved from the AWS cloud via the S3 storage service and SQS message system. These work packages are then executed across multiple EC2 compute instances in the cloud. 6285
dx.doi.org/10.1021/pr300694b | J. Proteome Res. 2012, 11, 6282−6290
Journal of Proteome Research
Technical Note
different cell line. Five files could not be retrieved from the public repository, leaving 193 available for use in this study. Raw files were converted into MGF format peak lists using ProteoWizard msconvert10 (version 3.0.3535) with default parameters. The resulting 26 GB of peak list files were submitted to various installations of CPFP and analyzed as detailed below.
A significant advantage of this configuration is that in case of an interruption in cloud services or Internet connectivity the data analysis can continue, albeit at reduced speed using limited local resources. However, dedicated computing support is likely to be necessary to maintain the software and hardware necessary to run CPFP and maintain adequate backup arrangements, etc. Results always permanently reside on the local server or computer hosting the CPFP database. The AWS S3 service is used only to hold data temporarily while processing of work packages is in progress.
Local vs Cloud Computing
Our local installation used two R610 servers (Dell, Austin TX), each with dual 6-core E5645 CPUs and 48 GB of RAM. One server provided the web interface, while the other hosted the MySQL database. Both servers processed CPFP work packages using the local GridEngine shepherd only. All data storage was provided by a 27 TB RAID6 volume hosted on an E18 SATA array (Nexsan Corp., Thousand Oaks, CA) with dedicated 1 Gbps iSCSI connections to each server. The AWS cloud configuration of CPFP was run using an EC2 c1.xlarge instance with 500 GB EBS storage to provide the web interface and coordinate job execution. The RemoteEC2 shepherd was configured to launch up to 17 additional c1.xlarge instances for work package processing. The MySQL database was hosted on an RDS m1.xlarge instance with 80 GB of storage. This configuration approached the peak performance that can be obtained from an AWS configuration without using expensive high memory or cluster-compute instances and within the default EC2 20 instance limit of a new account. Each set of data files from a single cell line was uploaded to CPFP as a single submission. Searches were performed for each submission using X!Tandem11 and OMSSA,12 and the results were combined using the iProphet algorithm.13 The UniProtKB Human complete proteome sequence database14 (release 2012_01) in a concatenated target plus reversed-decoy format was used for all searches.15 Carbamidomethylation of cysteine and oxidation of methionine were specified as fixed and variable modifications, respectively. Trypsin was specified as the cleavage enzyme and a single missed-cleavage was permitted. Precursor and fragment mass tolerances were 20 ppm and 0.5 Da. All searches were submitted concurrently. A mean of 248 250 PSMs and 5646 protein groups were identified per cell line, at a 1% false discovery rate, assessed using the target-decoy method. Across the 11 cell lines a total of 2 730 748 spectrum matches were assigned, indicating the scale of the data processing challenge. This number of PSMs exceeds the 2 023 960 reported in another recent reanalysis study of the Geiger et al. data set that used the Mascot search engine only.16 The number of protein groups identified was lower than the comparative number of protein identifications reported in the original study using MaxQuant,17 without match between runs functionality. ProteinProphet is used within CPFP for protein inference and implements a more complex grouping strategy than MaxQuant. This approach has been demonstrated to perform extremely well on smaller data sets but is outperformed by simpler methods on large data sets at low false discovery rates.18 In addition, we note in the Supporting Information for Geiger et al. that 552 protein identifications contain indistinguishable protein accession numbers from both target and decoy databases. In CPFP these would be considered decoy identifications and excluded from final counts. Figure 3 shows the time and cost of analysis for the complete data set on both configurations of CPFP. The reported times were from submission of the first search until all results were
Cloud Only Installations
An alternative to a local or hybrid installation of CPFP is an AWS cloud-only installation. We provide Amazon Machine Images (AMIs) for master EC2 compute nodes that run the web interface, host the database, and coordinate the data processing. These images can be launched easily using the AWS control panel, in a similar manner to the ViPDAC platform, allowing users to create a personal or lab-wide analysis pipeline without installing any software. The master node must be running to allow submission of data and access to results. All final results are stored on the Elastic Block Store (EBS) volume associated with the master instance. However, in periods of inactivity it can be suspended to avoid EC2 costs. Storage charges are still incurred when the master is suspended. When data processing is required, the master instance will launch additional EC2 slave nodes as described above. If required, the AWS Relational Database Service (RDS) may be used to provide a dedicated database service, removing some load from the master instance and improving performance for heavily used installations at additional cost. A cloud-based CPFP instance can be left running permanently for general access, or can be created, temporarily stopped, and permanently terminated as required. Snapshots can be taken easily within AWS to archive data or provide periodic backups of an instance that is left running continuously.
■
EXAMPLE APPLICATIONS In order to demonstrate the utility of Cloud CPFP, we have analyzed a large data set using a local-only installation of the pipeline, a cloud-only installation, and a hybrid local/remote installation using a supercomputing cluster. We have limited our test analyses to very large data sets, as these are likely to gain the largest benefit from cloud and cluster computing. We believe that this does not bias our comparison, since the presence of powerful and cheap multicore desktop and laptop computers on the desks of most scientists recently means that those with infrequent modest data processing needs will not need to use more complex systems such as CPFP. Our software targets the requirements of proteomics core facilities and large proteomics laboratories where much greater volumes of mass spectrometry data from multiple instruments are routinely generated. These large data sets are usually generated during experiments involving extensive fractionation and comparison of multiple conditions, with replicate samples. Haligan et al. studied the time required and cost incurred to perform searches against a single data file with unconstrained enzyme specificity, demonstrating the advantages of using cloud computing in this context.8 We instead consider more conventional tryptic searches against a large data set that is publically available and contains 198 LTQ-Orbitrap Velos (Thermo Scientific, Bremen) 240 min LC−MS/MS runs in 11 groups.9 The samples in each group were derived from a 6286
dx.doi.org/10.1021/pr300694b | J. Proteome Res. 2012, 11, 6282−6290
Journal of Proteome Research
Technical Note
total of 65 MB. In a hybrid local/cloud configuration, data transfer out of AWS will cost significantly more as a local server would retrieve all files produced by the database search engines (which collectively are much larger than the final Excel result file). We achieved an upload speed of 26.8 MB/s between our institution’s local network and EC2, resulting in an upload time of 17 min for the 26 GB of peak list files from the test data set. At this transfer rate the time taken to upload data is small compared to the duration of the analysis. Since Internet bandwidth can vary widely between sites, this may not always be the case. Cloud CPFP will be of interest to users with excellent Internet connectivity, as is the general case for most cloud-based applications. We recognize that for many the cost of hosting local computing equipment will be higher than our figures. We benefit from negligible colocation charges, a large amount of free backup storage, and capable in-house staff to maintain the equipment. Where these benefits are not available, the cost of the analyses will move further in favor of the cloud-only installation. Conversely, in some institutions departmental or institutional computing facilities and support are available without direct cost. The cost benefit or disadvantage of cloud computing will depend heavily on the computing environment available at an individual institution. We highly recommend that all users carefully test and consider all available options. For more complex searches, involving multiple PTMs and/or semispecific enzymatic cleavage rules, the use of remote compute facilities should achieve a greater reduction in processing times. A greater proportion of the total time would be spent running the more complex database searches, which parallelize well and can be run remotely. In such cases relatively less time would be required for data transfer, disk intensive single-threaded processing by TPP tools, and parallel database imports, which ran faster locally because of our higher performance individual CPU cores and better storage throughput. Within the cloud, these limitations could be addressed by employing more costly but more powerful compute instances and faster storage configurations.
Figure 3. Comparison of run-time and monetary cost of the analysis of the Geiger cell lines data set using a local installation of CPFP and an AWS cloud-only installation of CPFP using up to 17 EC2 processing instances. Cost for local and cloud-only analyses is similar, while the cloud-based processing is completed in 21% of the time required for local processing.
available. The cost of cloud analysis was calculated using AWS usage reports with on-demand pricing, excluding free-tier allowances. The cost for the local CPFP installation was calculated by estimating the effective usage cost per-hour for the two servers multiplied by the number of hours it took to finish the job. The purchase price of the servers was proportioned per hour of usage, assuming a service life of 3 years and average 6 min per hour (10%) usage for each server during these 3 years. Cluster monitoring showed that in our core the actual utilization of servers was closer to 5%, indicating that the capacity of the local installation was overspecified to cope with peak demand rather than average requirements. The results show that the AWS cloud configuration of CPFP completed the data analysis in 21% of the time required by the local installation, or 0.6% of the acquisition time of the data set. This demonstrates that Cloud CPFP allows the analysis of extremely large data sets within a single day, without the difficulties associated with maintaining a large local compute infrastructure. Costs were comparable between configurations, with the cloud analysis costing slightly less ($0.64) than the proportioned cost of the local analysis ($80.15). A full breakdown of the AWS charges for the cloud-based analysis can be found in Supporting Information Table S1. The majority of the cost is attributed to EC2 and RDS instancehours and storage ($78.87). Data transfer charges were negligible (approximately $0.01), since transfer into AWS is free. The only chargeable outgoing data was generated by accessing the web interface and downloading Excel result files, a
AWS Instance Selection
A variety of EC2 instances are available with different performances. The speed of database searches will depend on the worker instance type chosen and the number of instances used. In a cloud-only CPFP installation the master node is also an EC2 instance, and its size will affect the speed of TPP processing, database import, and the web interface. We recommend using m1.large or c1.xlarge master instances
Table 1. Duration, Cost, And Efficiency (Spectra Per Second Per Dollar) for Cloud CPFP Database-Search of the Geiger et. al. GAMG Cell Line Data, Using 2 EC2 Worker Instances of Various Types instance information type
USD ($) per hour
m1.small m1.medium m1.large m1.xlarge c1.medium c1.xlarge m2.xlarge m2.2xlarge m2.4xlarge
0.080 0.160 0.320 0.640 0.165 0.660 0.450 0.900 1.800
timings
cores × compute units 1 1 2 4 2 8 2 4 8
× × × × × × × × ×
1 2 2 2 2.5 2.5 3.25 3.25 3.25
nominal cost
actual cost
wall clock hours
nominal instance hours
charged instance hours
spectra /s
total USD ($)
spectra/s per USD ($)
total USD ($)
spectra/s per USD ($)
6.55 4.07 3.63 1.46 3.20 1.82 2.00 1.53 1.22
13.11 8.15 7.27 2.92 6.41 3.65 3.99 3.07 2.44
14 9 8 4 8 4 4 4 4
26.51 42.65 47.81 118.87 54.24 95.23 87.07 113.31 142.25
1.05 1.30 2.33 1.87 1.06 2.41 1.80 2.76 4.40
25.29 32.72 20.56 63.54 51.33 39.55 48.49 41.06 32.36
1.12 1.44 2.56 2.56 1.32 2.64 1.80 3.60 7.20
23.67 29.61 18.68 46.43 41.09 36.07 48.37 31.47 19.76
6287
dx.doi.org/10.1021/pr300694b | J. Proteome Res. 2012, 11, 6282−6290
Journal of Proteome Research
Technical Note
estimate that the time required would exceed the 72 h window. X!Tandem, OMSSA and combination analysis of a single randomly chosen data file with the above parameters required 2 h 55 min. From this figure, we estimated >23 days would be required for the entire data set of 193 files on the local installation. Cost concerns prevented us from running this test using the AWS cloud, although Amazon offers EC2 cluster compute nodes of comparable power to nodes in the Lonestar system. Despite lengthy job queues due to an approaching maintenance window, the analysis completed in 81 h versus the >552 h estimated for the local installation only. The first HPC job started on the Lonestar cluster 5 h 17 min after initial submission, and the final job completed at 78 h 49 min. The first of the 11 cell-line data sets was completely analyzed 61 h 44 min after submission, with postprocessing for the final data set completed after 80 h 48 min. A maximum of 25 search work packages ran concurrently during this period. If run during a period of lower load, with the cluster maximum of 50 concurrent jobs, we anticipate that the analysis could be completed within 40 h. A total of 7584 CPU core-hours were used by jobs running on the TACC cluster, which features processors with 30% higher individual performance than our local installation (Passmark Software, http://www.passmark. com/). This indicates that approximately 10 834 core-hours, corresponding to 451 wall-clock hours, would have been required for the database-searches on our local system. Considering the additional time necessary for job preparation, postprocessing, and database import of results, our initial estimate of >552 h for local analysis was reasonable.
($0.32 or $0.66 per hour). An m1.large instance provides 2 cores, with 2 EC2 compute units of power each, while a c1.xlarge instance provides 8 cores, of 2.5 compute units each. TPP tools use only one core, so there is little difference in performance unless more than 2 processes run in parallel. An m1.medium master instance can be used, but processing and the web interface will be slow. To investigate the relationship between price and performance for database-search with Cloud CPFP, we used the GAMG cell-line subset of the Geiger et al. data, consisting of 18 MS runs with 625 351 MS/MS spectra. We searched the data using 9 cloud-only CPFP installations. All used an m1.large master and 2 worker instances, but we varied the worker instance type. The results of this comparison are shown in Table 1. The relationship between performance and actual cost is complex because EC2 usage is charged per whole instancehour. An instance used for 61 min incurs two instance-hours of charges. Table 1 presents a nominal cost using partial instancehours, as well as actual cost from whole instance-hours. When Cloud CPFP is heavily shared, the real cost of searches will be between these figures as worker instances remaining active after one search can be reused for the next job. We find that the m1.xlarge instance is most efficient, by nominal spectra processed per second per dollar. We expected speed to be proportional to instance compute units, since database-search is a CPU-limited process, but this was not the case. Instances that provide more than four cores did not scale well, indicating that the database-search software does not parallelize perfectly. Although the m1.xlarge is most efficient here, smaller instances may be useful for smaller searches that would not use a complete m1.xlarge instance-hour. Spectra per second figures can guide for instance selection, but will vary widely, depending on search parameters.
Limitations and Future Work
A number of limitations remain in Cloud CPFP. Parallelization of the TPP tools is limited, and many functions within CPFP require direct database access, which makes it difficult to move further processing to remote systems efficiently. These issues will be addressed through cloud-centric redesign of these portions of CPFP, and extensive rewrite or replacement of the TPP tools. More fundamental problems exist for many users regarding cloud computing in general. Employers may prohibit the use of cloud services because of information security issues, contract terms, or other concerns. Our Cloud CPFP images for the AWS platform do not incorporate any specific encryption or data security measures, other than providing encrypted HTTPS access to the web interface. Within AWS efforts are made to wipe storage areas before reuse by future customers, but users may wish to consider customizing the Cloud CPFP image to use encrypted storage if necessary. Various reviews discussing security in cloud-computing are available.21 Outdated local network infrastructure or poor Internet connectivity may result in low data transfer rates that can diminish the benefits of sending work to faster remote systems. Use of prebuilt cloud images to setup an installation of CPFP removes the need for knowledge of software installation and configuration, but adds the requirement of some knowledge of cloud architecture and management. We attempt to address this with documentation on the project Web site. Finally, we acknowledge that Cloud CPFP is currently limited to use with a single cluster environment (GridEngine) and cloud architecture (AWS). Because of the modular nature of CPFP, it is possible to add support for other platforms by copying and modifying the existing remote shepherd modules. We encourage advanced users and developers to contact us if they add functionality and
Using a Remote Supercomputing Cluster in a Hybrid Installation
To demonstrate the benefits of the remote cluster computing functionality of Cloud CPFP, we configured our local installation (described above) to submit remote work packages to the Texas Advanced Computing Center (TACC) Lonestar HPC system. Lonestar is a 1888 node, 22 656 core cluster with 302 TFLOPS peak performance. Since the system is heavily used by a large number of researchers, the job queue is frequently long and wait times for execution can be many hours. However, for extremely complex jobs the large capacity of the system allows such fast data processing that it easily justifies the long queue. The test data set was searched as described previously, with the exceptions that semitryptic cleavage specificity was chosen, and phosphorylation of Ser, Thr and Tyr were specified as variable modifications. These changes dramatically increase the complexity of the searches, but are relatively common search parameters. Searches for phosphorylation are frequent, and it has been demonstrated that searching using semitryptic enzyme specificity can significantly increase the number of identifications obtained.19,20 We observed an increase in PSMs and protein groups by an average of 4.5 and 5.0% per cell line. This search was not performed on a local installation of CPFP as we did not have access to a dedicated local installation that could be used for test purposes with no additional workload. We commonly run especially complex searches on our local installation over the weekend, but in this case we 6288
dx.doi.org/10.1021/pr300694b | J. Proteome Res. 2012, 11, 6282−6290
Journal of Proteome Research will provide support to integrate additions into the main release of CPFP.
■
■
REFERENCES
(1) Trudgian, D. C.; Thomas, B.; McGowan, S. J.; Kessler, B. M.; Salek, M.; Acuto, O. CPFP: a central proteomics facilities pipeline. Bioinformatics 2010, 26 (8), 1131−2. (2) Keller, A.; Eng, J.; Zhang, N.; Li, X. J.; Aebersold, R. A uniform proteomics MS/MS analysis platform utilizing open XML file formats. Mol. Syst. Biol. 2005, 1, 2005 0017. (3) Trudgian, D. C.; Ridlova, G.; Fischer, R.; Mackeen, M. M.; Ternette, N.; Acuto, O.; Kessler, B. M.; Thomas, B. Comparative evaluation of label-free SINQ normalized spectral index quantitation in the central proteomics facilities pipeline. Proteomics 2011, 11 (14), 2790−7. (4) Trudgian, D. C.; Singleton, R.; Cockman, M. E.; Ratcliffe, P. J.; Kessler, B. M. ModLS: Post-translational modification localisation scoring with automatic specificity expansion. BMC Research Notes, submitted for publication. (5) Fusaro, V. A.; Patil, P.; Gafni, E.; Wall, D. P.; Tonellato, P. J. Biomedical cloud computing with Amazon Web Services. PLoS Comput. Biol. 2011, 7 (8), e1002147. (6) Langmead, B.; Schatz, M. C.; Lin, J.; Pop, M.; Salzberg, S. L. Searching for SNPs with cloud computing. Genome Biol. 2009, 10 (11), R134. (7) Kudtarkar, P.; Deluca, T. F.; Fusaro, V. A.; Tonellato, P. J.; Wall, D. P. Cost-effective cloud computing: a case study using the comparative genomics tool, roundup. Evol. Bioinf. Online 2010, 6, 197−203. (8) Halligan, B. D.; Geiger, J. F.; Vallejos, A. K.; Greene, A. S.; Twigger, S. N. Low cost scalable proteomics data analysis using Amazon’s Cloud Computing services and open source search algorithms. J. Proteome Res. 2009, 8 (6), 3148−53. (9) Geiger, T.; Wehner, A.; Schaab, C.; Cox, J.; Mann, M. Comparative proteomic analysis of eleven common cell lines reveals ubiquitous but varying expression of most proteins. Mol. Cell. Proteomics 2012, 11 (3), M111 014050. (10) Kessner, D.; Chambers, M.; Burke, R.; Agus, D.; Mallick, P. ProteoWizard: open source software for rapid proteomics tools development. Bioinformatics 2008, 24 (21), 2534−6. (11) Craig, R.; Beavis, R. C. TANDEM: matching proteins with tandem mass spectra. Bioinformatics 2004, 20 (9), 1466−7. (12) Geer, L. Y.; Markey, S. P.; Kowalak, J. A.; Wagner, L.; Xu, M.; Maynard, D. M.; Yang, X.; Shi, W.; Bryant, S. H. Open mass spectrometry search algorithm. J. Proteome Res. 2004, 3 (5), 958−64. (13) Shteynberg, D.; Deutsch, E. W.; Lam, H.; Eng, J. K.; Sun, Z.; Tasman, N.; Mendoza, L.; Moritz, R. L.; Aebersold, R.; Nesvizhskii, A. I. iProphet: multi-level integrative analysis of shotgun proteomic data improves peptide and protein identification rates and error estimates. Mol. Cell. Proteomics 2011, 10 (12), M111 007690. (14) Magrane, M. Consortium, U., UniProt Knowledgebase: a hub of integrated protein data. Database 2011, 2011, bar009. (15) Elias, J. E.; Gygi, S. P. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat. Methods 2007, 4 (3), 207−14. (16) Hahne, H.; Gholami, A. M.; Kuster, B. Discovery of O-GlcNAcmodified proteins in published large-scale proteome data. Mol. Cell. Proteomics 2012, 843−50.
■
AVAILABILITY Our cloud and cluster improvements, and other additional features such as simplified results viewers, are released under the open-source CDDL license. The project Web site and a demonstration server with guest access are available via http:// cpfp.sourceforge.net. The demonstration server is hosted in the AWS cloud and provides access to example data sets. We maintain AMI images for hybrid and cloud-only installations of CPFP and post details on the project Web site. ASSOCIATED CONTENT
S Supporting Information *
Supplementary tables and methods. This material is available free of charge via the Internet at http://pubs.acs.org.
■
ACKNOWLEDGMENTS
We acknowledge the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing HPC resources. We thank Drs. X. Guo for software testing, B. Thomas and B. M. Kessler for advice and discussion, and P. Charles for software fixes. Support was provided by Cancer Prevention and Research Institute of Texas Grants RP120613 and R1121 to HM and an AWS in Education research grant award to DCT.
CONCLUSION We have demonstrated that Cloud CPFP is a powerful data analysis pipeline, capable of processing very large data sets entirely in the cloud, without up-front infrastructure costs. We also believe that for many academic proteomics laboratories and facilities that have access to a remote compute cluster of some description, the remote cluster functionality of Cloud CPFP offers the ability to perform especially complex searches against very large data sets without affecting routine day-to-day work. The pipeline offers significant additional functionalities versus existing platforms such as ViPDAC and the TPP on which it is based. The possibility of performing large searches in reasonable time, and without impacting day-to-day analysis of customer samples, provides opportunities for data-mining exercises on archived public and private data. Another advantage is the ability to build spectral libraries from a huge range of data sets in a reasonable amount of time, which allows the development of SRM assays without the need for initial shotgun experiments. CPFP currently contains a tool to generate candidate transitions using the identified spectra accumulated within its database. The study of post-translational modifications that are not commonly considered may also benefit from these analyses. The vast number of mass spectra in resources such as ProteomeCommons Tranche22 and the PRIDE repository23 may harbor significant insight into additional roles of lessstudied modifications, such as AMPylation,24 in cell signaling processes. HPC and cloud resources make interrogation of these data possible. Cloud and cluster computing are now commonplace in other biomedical and bioinformatics fields, with a variety of cloud computing software offered by academic laboratories, instrument vendors, and third parties. We believe that proteomics will follow suit and intend to continue to develop CPFP with this in mind.
■
■
Technical Note
AUTHOR INFORMATION
Corresponding Author
*E-mail:
[email protected]. Tel: 214-6487001. Fax: 214-645-6298. Notes
The authors declare no competing financial interest. 6289
dx.doi.org/10.1021/pr300694b | J. Proteome Res. 2012, 11, 6282−6290
Journal of Proteome Research
Technical Note
(17) Cox, J.; Mann, M. MaxQuant enables high peptide identification rates, individualized ppb-range mass accuracies and proteome-wide protein quantification. Nat. Biotechnol. 2008, 26 (12), 1367−72. (18) Claassen, M.; Reiter, L.; Hengartner, M. O.; Buhmann, J. M.; Aebersold, R. Generic comparison of protein inference engines. Mol. Cell. Proteomics 2012, 11 (4), O110 007088. (19) Wang, H.; Tang, H. Y.; Tan, G. C.; Speicher, D. W. Data analysis strategy for maximizing high-confidence protein identifications in complex proteomes such as human tumor secretomes and human serum. J. Proteome Res. 2011, 10 (11), 4993−5005. (20) Ning, K.; Fermin, D.; Nesvizhskii, A. I. Computational analysis of unassigned high-quality MS/MS spectra in proteomic data sets. Proteomics 2010, 10 (14), 2712−8. (21) Zissis, D.; Lekkas, D. Addressing cloud computing security issues. Future Gener. Comput. Syst. 2012, 28 (3), 583−92. (22) Hill, J. A.; Smith, B. E.; Papoulias, P. G.; Andrews, P. C. ProteomeCommons.org collaborative annotation and project management resource integrated with the Tranche repository. J. Proteome Res. 2010, 9 (6), 2809−11. (23) Jones, P.; Cote, R. G.; Cho, S. Y.; Klie, S.; Martens, L.; Quinn, A. F.; Thorneycroft, D.; Hermjakob, H. PRIDE: new developments and new datasets. Nucleic Acids Res. 2008, 36 (Database issue), D878− 83. (24) Yarbrough, M. L.; Orth, K. AMPylation is a new posttranslational modiFICation. Nat. Chem. Biol. 2009, 5 (6), 378−9.
6290
dx.doi.org/10.1021/pr300694b | J. Proteome Res. 2012, 11, 6282−6290