CAPER 3.0: A Scalable Cloud-Based System for Data-Intensive

Mar 20, 2015 - CAPER 3.0 uses cloud computing technology to facilitate MS/MS-based peptide identification. In particular, it can use both public and p...
4 downloads 12 Views 5MB Size
Article pubs.acs.org/jpr

CAPER 3.0: A Scalable Cloud-Based System for Data-Intensive Analysis of Chromosome-Centric Human Proteome Project Data Sets Shuai Yang,†,‡ Xinlei Zhang,§ Lihong Diao,†,‡ Feifei Guo,†,‡,∥ Dan Wang,†,‡ Zhongyang Liu,†,‡ Honglei Li,§ Junjie Zheng,†,‡ Jingshan Pan,⊥ Edouard C. Nice,# Dong Li,*,†,‡ and Fuchu He*,†,‡ †

State Key Laboratory of Proteomics, Beijing Proteome Research Center, Beijing Institute of Radiation Medicine, Beijing 100850, China ‡ National Center for Protein Sciences Beijing, Beijing 102206, China § Beijing Genestone Technology Ltd., Beijing 100085, China ∥ Institute of Basic Medical Sciences Chinese Academy of Medical Sciences, School of Basic Medicine, Peking Union Medical College, Beijing 100005, China ⊥ Shandong Computer Science Center (National Supercomputer Center in Jinan), Shandong 250101, China # Department of Biochemistry and Molecular Biology, Monash University, Clayton, Victoria 3800, Australia ABSTRACT: The Chromosome-centric Human Proteome Project (CHPP) aims to catalog genome-encoded proteins using a chromosomeby-chromosome strategy. As the C-HPP proceeds, the increasing requirement for data-intensive analysis of the MS/MS data poses a challenge to the proteomic community, especially small laboratories lacking computational infrastructure. To address this challenge, we have updated the previous CAPER browser into a higher version, CAPER 3.0, which is a scalable cloud-based system for data-intensive analysis of CHPP data sets. CAPER 3.0 uses cloud computing technology to facilitate MS/MS-based peptide identification. In particular, it can use both public and private cloud, facilitating the analysis of C-HPP data sets. CAPER 3.0 provides a graphical user interface (GUI) to help users transfer data, configure jobs, track progress, and visualize the results comprehensively. These features enable users without programming expertise to easily conduct data-intensive analysis using CAPER 3.0. Here, we illustrate the usage of CAPER 3.0 with four specific mass spectral data-intensive problems: detecting novel peptides, identifying single amino acid variants (SAVs) derived from known missense mutations, identifying sample-specific SAVs, and identifying exon-skipping events. CAPER 3.0 is available at http://prodigy.bprc. ac.cn/caper3. KEYWORDS: Proteomic data analysis platform, proteomic data visualization, bioinformatics, cloud computing, big data, Chromosome-centric Human Proteome Project



INTRODUCTION As an important component of the Human Proteome Project (HPP) established by the Human Proteome Organization (HUPO), the Chromosome-centric Human Proteome Project (C-HPP) was officially launched in Geneva in 2011.1 The CHPP aims to catalog genome-encoded proteins using a chromosome-by-chromosome strategy, including alternative spliced variants, single amino acid variants (SAVs), and major post-translational modifications (PTMs).2−5 In particular, identifying novel peptides/proteins that are currently not represented in annotated protein databases is essential to drawing a comprehensive map of the human proteome.6 To achieve these objectives, the C-HPP consortium has three major pillars: mass spectrometry (MS), antibody/affinity capture reagents (Ab), and bioinformatics-driven knowledge base (KB). An MS-based proteogenomics analysis strategy has enabled the detection of novel peptides, SAVs, and splice© XXXX American Chemical Society

junction peptides, but it requires extensive computing resources.7 As the C-HPP proceeds, huge volumes of MS/MS data will be submitted to existing proteomics repositories via ProteomeXchange.8 The computational infrastructure requirement for mining these data sets is obviously beyond the reach of small laboratories and becomes an increasing challenge for large organizations.9,10 For example, proteogenomics analysis of a typical bacterial genome required a total of about 100 CPU hours,7 and it would be considerably longer for Homo sapiens (genome size of 3.2 Gb). To facilitate analysis of C-HPP data Special Issue: The Chromosome-Centric Human Proteome Project 2015 Received: December 31, 2014

A

DOI: 10.1021/pr501335w J. Proteome Res. XXXX, XXX, XXX−XXX

Article

Journal of Proteome Research

boto/boto) were used as interfaces to Amazon Simple Storage Service (S3) and Amazon Elastic Compute Cloud (EC2). Jsch (http://www.jcraft.com/jsch/) was used to connect to an sshd server, enabling execution of commands on remote machines. A Perl script downloaded from the Mascot Web site (http:// www.matrixscience.com/downloads/decoy.pl.gz) was used to generate a decoy database. FDR control was implemented by MzidLibrary.24 A custom instance store-backed Linux Amazon Machine Image (AMI) was created by following instructions in the AWS help document (http://docs.aws.amazon.com/ AWSEC2/latest/UserGuide/creating-an-ami-instance-store. html). The CAPER 3.0 AMI used a Ubuntu 12.04.05 LTS server (64-bit) as the base image. CAPER 3.0 GUI was based on JavaFX 2 and ControlsFX (http://fxexperience.com/ controlsfx/). Other scripts were written in linux bash and python.

sets, a fast data processing system is crucial. A potential solution to large-scale data analysis is cloud computing.11−13 Cloud computing delivers IT resources over the Internet14 and enables researchers to establish their own computing cluster to accelerate peptide identification. Mohammed et al. reached a more than 30-fold improvement in speed when running X! Tandem on the cloud using 8 instances, each composed of 8 virtual CPUs.9 Clouds can be classified as public or private.15 Public cloud is available in a pay-as-you-go manner to the general public, and Amazon Web Services (AWS) is one of the most commonly used public clouds in bioinformatics. When operational costs are considered, the charge is price-competitive with using in-house facilities.16 Private cloud is expected to maximize the utilization of existing in-house resources while addressing concerns of data security and data transfer.17 For example, Eucalyptus (https://www.eucalyptus.com/) is an open source private cloud software compatible with AWS. The C-HPP community has developed several bioinformatics analysis tools, including The Proteome Browser (TPB) developed by Australia,18 GenomewidePDB, by Korea,19 and Gene-centric Knowledgebase for Chr 18, by Russia.20 These tools use a gene-centric heatmap to integrate and visualize proteomic data sets and related annotations. Previously, our team developed a chromosome-assembled human proteome browser (CAPER).21,22 CAPER introduced many features, including a configurable workflow system based on a Galaxy framework and a powerful toolbox for finding missing proteins, mapping identified peptides to human chromosomes, bridging the C-HPP and ENCODE data sets, and allowing protein functional annotation. Although CAPER has proven to be a powerful toolkit for C-HPP data analysis, it does not support processing of MS/MS data because of a lack of local computing resources. To address this limitation, we have updated the previous CAPER into a higher version, CAPER 3.0, a scalable cloudbased system for data-intensive analysis of C-HPP data sets. In addition to the track viewer used in previous releases of CAPER, CAPER 3.0 uses cloud computing technology to facilitate MS/MS-based peptide identification. In particular, it can use both public and private clouds to help analyze C-HPP MS data sets. Currently, CAPER 3.0 has integrated four analysis pipelines specific for C-HPP: detecting novel peptides, identifying SAVs derived from known missense mutations, identifying sample-specific SAVs, and identifying exon-skipping events. CAPER 3.0 also provides a graphical user interface (GUI), making it easier to transfer data, configure jobs, track progress, and visualize the results comprehensively. These features enable users without programming expertise to easily conduct data-intensive analysis using CAPER 3.0.



Data Set Collection

Data used for generating the CAPER reference data sets is described below: (1) The primary assembly of the human genome was downloaded from Ensembl FTP (ftp://ftp. ensembl.org/pub/release-75/fasta/homo_sapiens/dna/ Homo_sapiens.GRCh37.75.dna_rm.primary_assembly.fa.gz); (2) protein/coding DNA sequence (CDS), annotation, and missense mutation data were downloaded from Ensembl Biomart 75.25 Protein annotation contains attributes “Ensembl Protein Id”, “Genomic coding start”, “Genomic coding end”, “phase”, and “CDS length”. CDS annotation contains attributes “Ensembl Transcript ID”, “Genomic coding start”, and “Genomic coding end”. Exon annotation contains attributes “Ensembl Exon ID”, “Exon chromosome start”, and “Exon chromosome End”. Missense mutation contains attributes “Ensembl Protein ID”, “Reference ID”, “Protein location”, “Protein Allele”, and filters “Consequence Type missense_variant”. Database Construction

The “Novel peptides detection” pipeline uses six-frame translation of the entire human genome as the query database. This method has been reported previously in many articles for finding a novel gene model or improving gene annotation.26 EMBOSS Transeq was used to translate DNA sequence in six frames.27 After six-frame translation, we collected the amino acid sequence between stop codons (represented by *) except where the sequence contained ambiguous amino acids (represented by X) or its total length was less than 7. At the same time, we calculated each sequence’s genomic coordinate and orientation. Finally, amino acid sequences for each chromosome with the calculated information were recorded in a FASTA file. The “Known SAVs identification” pipeline searches against a variant peptide database built from known missense mutations. The known missense mutations were downloaded from Ensembl BioMart. Building steps have been described previously.5,28 Each missense mutation was mapped to a protein sequence, and the variant peptide was generated by placing the mutation at the center of the sequence with a maximum of 40 amino acids flanking either side, so the maximum length of a variant peptide was 81 amino acids. We also calculated each variant peptide’s genomic location using annotation data from Ensembl BioMart. Final peptides contained one missense mutation per sequence and were written into a FASTA file.

MATERIALS AND METHODS

Design and Implementation of CAPER 3.0

CAPER 3.0 uses MR-Tandem as the parallel database search engine for peptide identification. MR-Tandem is a slightly modified version of X!Tandem and can run on Hadoop cluster.23 Apache Hadoop (http://hadoop.apache.org/) is an open-source software framework that enables the distributed processing of large data sets and is well-suited for batch processing using commodity hardware. Hadoop 1.0.3 and Oracle Java JDK 1.7 were used for building a Hadoop cluster. JetS3t (http://www.jets3t.org/), AWS SDK for Java (http:// aws.amazon.com/sdkforjava/), and boto (https://github.com/ B

DOI: 10.1021/pr501335w J. Proteome Res. XXXX, XXX, XXX−XXX

Article

Journal of Proteome Research

Figure 1. Schema of CAPER 3.0. (A) LWP (local work package) is the client of CAPER 3.0, which deals with local operations like transferring data, configuring a job, tracking progress, and visualizing the results. RWP (remote work package) handles tasks that run in the cloud, including database search and peptide-level quality control. RWP is deployed to the Amazon’s cloud platform. (B) The CAPER reference data sets archive currently includes the following: (1) ORF peptide sequences (length > 7) from a six-frame translation of the entire human genome; (2) single amino acid peptide sequences (flanking length = 40) from known missense variants, and (3) exon−exon junction peptide sequences from in-phase exon−exon combinations. (C) General steps in running a job by the CAPER 3.0 client are as follows: (1) the user uploads MS/MS data to the Amazon Simple Storage Service (S3), (2) the user configures job and computing parameters and submits the job, (3) the user monitors the job status and control instances by terminate and reboot actions, (4) when the job is finished, the CAPER 3.0 client downloads the mzIdentML format results file from S3, and (5) the mzIdentML format results file is parsed by the CAPER 3.0 client, and identified peptides are shown in their genome context using the CAPER server. Blue arrowed dash lines denote the direction of the flow of data in the cloud.

The “Sample-specific SAVs identification” pipeline searches against a customized variant peptide database built from a userspecified VCF file. For simplicity’s sake, only base substitution mutations in the VCF file were preserved. After that, the coding DNA sequence (CDS) that contained the mutation was identified, and the reference nucleotide was replaced with the variant one. If the reference nucleotide in a CDS was not equal to the reference nucleotide in the VCF file or if the consequence of the base substitution was not missense, then the mutation was ignored. Finally, the variant CDS was translated into the corresponding peptide, and the variant peptide’s genomic coordinate was calculated. As described above, the variant site was placed at the center with a maximum length of 81 amino acids for a variant peptide. Final peptides contained one missense mutation per sequence and were written into a FASTA file. The “Exon-skipping events identification” pipeline uses a query database built from all possible compatible exon−exon junction sequences of every gene. The database building method is based on Mo et al.29 The algorithm found all possible exon−exon combinations of each gene in the Ensembl database, kept in-phase exon−exon combinations, and translated the exon−exon junction sequence into a peptide. Each peptide consisted of 25 amino acids from the end of the first exon and 25 amino acids from the beginning of the next exon of a given gene. Known exon−exon combination sequences were excluded. In addition, all human exons with genomic location

information were downloaded from Ensembl BioMart, and the junction peptide’s genomic location was calculated. Finally, junction peptides with the calculated information were recorded in a FASTA file.



RESULTS AND DISCUSSION

Overview of CAPER 3.0

CAPER 3.0 is composed of two core packages (Figure 1A), a remote work package (RWP) and a local work package (LWP). The RWP deals with tasks that run in the cloud, including database search and peptide-level quality control, and is deployed to Amazon’s cloud platform. LWP is a java-based client and deals with local operations. It offers a user-friendly interface with job and cloud management. The general steps of running a job by the CAPER 3.0 client are shown in Figure 1C. Users upload MS/MS data and configure a job in the CAPER 3.0 client before processing. When the job is finished, the CAPER 3.0 client will terminate the EC2 instances and download the mzIdentML format results file from S3 to the user’s computer. The results file will be parsed locally, and the identified peptides will be shown in their genome context using the CAPER server. There are four main components in the CAPER 3.0 client, including data transfer, job configuration, progress tracking, and result visualization (Figure 2). Before using the CAPER 3.0 client, users need to log in to their cloud accounts using an C

DOI: 10.1021/pr501335w J. Proteome Res. XXXX, XXX, XXX−XXX

Article

Journal of Proteome Research

Figure 2. Interfaces for the four main components of LWP of CAPER 3.0. (A) File transfer: (1) Local file system. The functions of the four buttons in the red rectangle are browsing file system, refreshing the current directory, up to parent directory, and deleting file/directory; (2) Cloud storage. The functions of the four buttons in the green rectangle are creating bucket (a bucket is a logical unit of storage in Amazon’s cloud, and its name is globally unique), listing buckets, refreshing bucket list/object list, and deleting object/bucket; (3) This panel contains data transfer tasks that are currently running, and it also shows data transferring speed. (B) Job configuration: (1) Cloud service parameters such as AMI id, key name (EC2 associates the public key with the name that you specify as the key name), security group (security group rules act as the firewall), instance type, and D

DOI: 10.1021/pr501335w J. Proteome Res. XXXX, XXX, XXX−XXX

Article

Journal of Proteome Research Figure 2. continued

cluster size are specified here; (2) MS/MS file in MGF format is selected here; (3) Job type is selected in the combo box, and FDR value is set in the text field; (4) MR-Tandem parameters such as fragment tolerance and fixed/variant modifications are specified here. (C) Progress tracking: (1) Listing current job with attributes such as name of the processing MS/MS file, job start time, EC2 instances, status, etc; (2) Logging information on a Hadoop streaming job; (3) Listing instances with attributes such as AMI id, instance state (pending, running, or teminated), instance type, public IP, key name, launch time, and security group, and after clicking each instance in (3), its terminal window (4) will appear for further command. (D) Result visualization: (1) Identified peptides are listed, including the chromosome in which the corresponding coding region is located, DNA strand orientation, amino acid sequences, modifications, and other description information (like the ORF’s genomic coordinate in the analysis of “novel peptides detection”); (2) A potential novel peptide is shown in the CAPER 3.0 track viewer. There are three default tracks: “Ensembl Protein Coding Genes” track, representing gene models from Ensembl; “Transcript Evidence” track, showing transcripts supported by neXtProt or HLP transcriptome; and “Protein Evidence” track, exhibiting proteins supported by PeptideAtlas, GPMDB, HPA, or neXtProt; (3) Spectra that match the peptide are shown to determine the confidence of the identification, including their calculated m/z value, experimental m/z value, charge state, X! Tandem expect value, X!Tandem Hyperscore, local FDR, and Q value information; (4) A spectrum plot. The x-axis represents the ratio of the mass of an ion to the number of elementary charges that it carries, and the y-axis represents signal intensity of the ion.

Figure 3. Example of finding novel peptides by searching against the six-frame translation database of the human genome. (A) Configuration of the workflow of “Novel peptides detection”: the database of “Chr 1 (6-frame translation)” is selected, FDR threshold is set to 0.01, and known peptides are filtered. (B) “Identified Peptides” track is the visualized result. The yellow rectangle indicates a potential novel peptide sequence identified by CAPER 3.0. The peptide sequence is GIIWGEDTLMEYLENPK on the positive strand (black arrow) of chromosome 1. As shown by these three tracks, no protein coding gene, transcript, or peptide evidence has been found in this genomic region so far. (C) Speed improvement of parallel processing in the 8-CPU cloud compared to that of 1 CPU for the same data. vCPU is a virtual CPU in Amazon’s cloud, and the vertical bars are the time spent running X!Tandem.

access key and a secret key. The CAPER 3.0 client encrypts these keys and stores them to a file. The data transfer component has a FTP-like graphic interface, making it easy to operate data upload or download. Users can upload multiple files simultaneously, and the progress of the data transfer will be shown (Figure 2A). Several parameters need to be configured before running a job (Figure 2B). Job list, instance list, and instance console output are provided to track the progress (Figure 2C). When an EC2 instance is double-clicked, a terminal is provided for users to execute a command on that instance. Users can also terminate or reboot selected instances in the CAPER 3.0 client. The result visualization component (Figure 2D) gives a graphic interface of the analysis results, including the identified peptides, together with the corresponding spectra and match scores. In the result visualization component, users can also import and visualize the history results. Currently, CAPER 3.0 supports MS/MS data only in MGF format. However, raw data can be converted to MGF format by msconvert (http://proteowizard.sourceforge.net/downloads. shtml). As a prerequisite to using CAPER 3.0, users should install Oracle Java SE runtime environment 8 (http://www. oracle.com/technetwork/java/javase/downloads/jre8downloads-2133155.html) and register an AWS account

(http://aws.amazon.com/). The CAPER 3.0 GUI client and CAPER 3.0 AMI will significantly reduce the effort required to use the software and ensure the reproducibility of an analysis. The use of CAPER 3.0 to solve four specific data-intensive problems is illustrated: detecting novel peptides, identifying single amino acid variants (SAVs) derived from known missense mutations, identifying sample-specific SAVs, and identifying exon-skipping events. Detecting Novel Peptides: Achieving a Comprehensive Map of the Human Proteome

Identifying novel peptides/proteins that are currently not part of annotated protein databases is essential to achieving a comprehensive map of the human proteome.6 MS-based proteomics measures proteins directly on a large scale and can provide a unique way to verify putative gene products at the protein level. Proteogenomics analysis (e.g., searching against the six-frame translation database of human genome) is a common strategy to detect novel peptides that are not present in standard reference databases, but it requires high-volume computations that are beyond the reach of small laboratories. Because of this requirement, a scalable data analysis system is crucial. E

DOI: 10.1021/pr501335w J. Proteome Res. XXXX, XXX, XXX−XXX

Article

Journal of Proteome Research

Figure 4. Example of identifying SAVs derived from known missense mutations. (A) Configuration of the workflow of “Known SAVs identification”: “Ensembl missense SNV” is selected, and the FDR threshold is set to 0.01. (B) The yellow rectangle is an identified variant peptide. The peptide sequence is TINGQQTIIACIESHQFQPK on the positive strand of chromosome 1. SNP rs377339360 is detected, with a codon change from GAT to AAT, resulting in the encoded residue changing from D to N at position 3 in the peptide (red arrow). At this genomic region, protein coding gene CAPZA1, transcript ENST00000263168, and peptide ENSP00000263168 evidence are found in the “Ensembl Protein Coding Genes” track, “Transcript Evidence” track, and “Protein Evidence” track, respectively. (C) Speed improvement of parallel processing in the 8-CPU cloud compared to that with 1 CPU for the same data.

The “Novel peptides detection” pipeline was constructed by searching against the six-frame translation database (see Material and Methods) and filtering known peptides. To configure the pipeline, users are required to specify the chromosome number, FDR threshold, and bucket to store results (Figure 3A). Users also need to specify some general parameters. For our tests, we always used 8 instances, each has 1 virtual CPU (vCPU), 3.75 GB memory, and 410 GB storage. X!Tandem parameters are always set as follows: fragment mass error is 0.05, cleavage enzyme is trypsin, e value is less than 0.01, fixed modification is cysteine carboxymethylation, and variant modification is cysteine oxidation. The testing data was downloaded from Peptide Atlas (accession PASS00215; detailed information on the data is described by Sheynkman et al.30). The MS/MS file size is approximately 300 MB in this test. An example of finding novel peptides is shown in Figure 3. The potential novel peptide’s sequence is GIIWGEDTLMEYLENPK on the positive strand of chromosome 1, and there is no track evidence for the existence of this peptide (Figure 3B). The peptide has 2 spectrum matches, and all of the spectra carry 2 charges. Its calculated m/z is 1004.49. Database searching was processed in the 8-CPU cloud and achieved a ∼4.2-fold increase in speed compared to that with 1 CPU (Figure 3C). The performance improvement is unequal to the number of EC2 instances used because there are several serial parts in the job, like generating theoretical spectra from the protein database. Additionally, MR-Tandem uses Hadoop as the distributed computing framework, and Hadoop guarantees that all tasks are successfully done in one step before proceeding to another. The HDFS disk I/O is also a performance factor.

important to characterize these variant peptides and discover single amino acid polymorphisms (SAP) between individuals.5 However, these variant peptides cannot be identified directly by standard proteomic analysis due to the lack of mutated peptide information in regular reference databases. Two approaches have been reported for building a variant peptide database: one is utilizing variant data in a public repository and the other is using sample-matched RNA-Seq data. In CAPER 3.0, the pipeline “Known SAVs identification” uses a variant peptide database built from missense mutations in Ensembl. To configure this pipeline, users need to select “Known SAVs identification” in the combo box and specify the FDR threshold and output bucket name (Figure 4A). Users also need to specify search engine parameters, such as fragment tolerance and fixed/variant modifications. An example of detecting SAVs derived from known missense mutations is shown in Figure 4. The test data size is ∼100 MB, and other job parameters are as described above. One missense mutation, rs377339360, is identified. The codon changes from GAT to AAT, resulting in the encoded residue changing from D (Asp) to N (Asn). The potential variant peptide sequence is TINGQQTIIACIESHQFQPK on the positive strand of chromosome 1, with carboxyamidation of C at position 11. The variant peptide has 2 spectrum matches, and the spectra carry 3 charges. The calculated m/z is 1108.04. Database searching was processed in the 8-CPU cloud and achieved a ∼3.7-fold increase in speed compared to that with 1 CPU (Figure 4C). Detecting Sample-Specific SAVs Using Customized Variant Peptide Database: Enabling Identification of Rare or Novel Single Amino Acid Polymorphisms

Using customized variant peptide databases built from samplematched RNA-Seq data analysis provides another way to detect SAPs and has been found to be more accurate.5 RNA-Seq can be used to identify the nonsynonymous SNVs in a sample, which allows for the creation of a customized SAP database, thereby enabling identification of SAP peptides. Peptides

Identifying SAVs Derived from Known Missense Mutations: Facilitating Single Amino Acid Polymorphism Discovery

Each individual carries numerous single nucleotide variants (SNVs). A missense mutation is a point mutation that causes a codon to encode a different amino acid.31 For C-HPP, it is F

DOI: 10.1021/pr501335w J. Proteome Res. XXXX, XXX, XXX−XXX

Article

Journal of Proteome Research

Figure 5. Example of identifying sample-specific SAVs using a customized variant peptide database. (A) Configuration of the workflow of “Samplespecific SAVs identification”: a VCF file is specified, and the FDR threshold is set to 0.01. (B) The yellow rectangle is an identified variant peptide with the sequence VATAQDDITGDGTTSNVLIIGELLK on the negative strand of chromosome 17. A mutation at position 33285656 is detected, and the reference codon changes from C to A, resulting in the encoded residue changing from V to I (red arrow). At this genomic region, protein coding gene CCT6B, transcript ENST00000421975, and peptide ENSP00000327191 evidence are found in the “Ensembl Protein Coding Genes” track, “Transcript Evidence” track, and “Protein Evidence” track, respectively. (C) Speed improvement of parallel processing in the 8-CPU cloud compared to that with 1 CPU for the same data.

Figure 6. Example of identifying exon-skipping events. (A) Configuration of the workflow of “Exon-skipping events identification”: “Exon−Exon junction” is selected, and the FDR threshold is set to 0.01. (B) The combination of the yellow rectangle and black line is a potential junction peptide identified by CAPER 3.0. The peptide sequence is DKNEQAFEEVFQNANFR and is found on chromosome 17. It indicates a potential skipping event between exon ENSE00001132340 and exon ENSE00003610995 in phase zero. At this genomic region, protein coding gene RPA1, transcript ENST00000254719, and peptide ENSP00000254719 evidence are found in the “Ensembl Protein Coding Genes” track, “Transcript Evidence” track, and “Protein Evidence” track, respectively. (C) Speed improvement of parallel processing in the 8-CPU cloud compared to that with 1 CPU for the same data.

derived from sample-specific RNA-Seq data can better estimate the real protein pool in the sample and thus improve the accuracy of peptide identification. Bioinformatics analysis of RNA-Seq data can generate a variant call format result, or VCF file. For example, SAMtools can generate the VCF file during SNV calling progress. CAPER 3.0 uses a VCF file to build the variant peptide database. The input of this job is a VCF file and MS/MS data. The VCF file is used to compile the sample-specific variant peptide database. To configure this job, users first need to upload the VCF file to S3 and then select the VCF file and specify the FDR threshold and output bucket name (Figure 5A). Users also need to

specify search engine parameters, such as fragment tolerance and fixed/variant modifications. The test VCF file can be downloaded from the 1000 genome project (http://ftp. 1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/ functional_annotation/annotated_vcfs/). An example of detecting sample-specific SAVs using a customized variant peptide database is shown in Figure 5. The testing MS/MS data size is ∼100 MB, the VCF file size is ∼178 MB, and other job parameters are the same as described above. A mutation on chromosome 17 at position 33285656 was identified. The codon changes from C to A, resulting in a residue change from V to I. The potential variant peptide sequence is VATAQG

DOI: 10.1021/pr501335w J. Proteome Res. XXXX, XXX, XXX−XXX

Article

Journal of Proteome Research

2012AA020201), and the National Natural Science Foundation of China (31271407).

DDITGDGTTSNVLIIGELLK on the negative strand. It has 4 spectrum matches, and the matched spectra carry 2 charges. The calculated m/z is 1272.67. Database searching was processed in the 8-CPU cloud and achieved a ∼6.6-fold increase in speed compared to that with 1 CPU (Figure 5C).



(1) Marko-Varga, G.; Omenn, G. S.; Paik, Y.-K.; Hancock, W. S. A first step toward completion of a genome-wide characterization of the human proteome. J. Proteome Res. 2013, 12, 1−5. (2) Paik, Y.-K.; Omenn, G. S.; Thongboonkerd, V.; Marko-Varga, G.; Hancock, W. S. Genome-wide proteomics, Chromosome-centric Human Proteome Project (C-HPP), Part II. J. Proteome Res. 2014, 13, 1−4. (3) Paik, Y.-K.; Jeong, S.-K.; Omenn, G. S.; Uhlen, M.; Hanash, S.; Cho, S. Y.; Lee, H.-J.; Na, K.; Choi, E.-Y.; Yan, F.; et al. The Chromosome-centric Human Proteome Project for cataloging proteins encoded in the genome. Nat. Biotechnol. 2012, 30, 221−223. (4) Hühmer, A. F. R.; Paulus, A.; Martin, L. B.; Millis, K.; Agreste, T.; Saba, J.; Lill, J. R.; Fischer, S. M.; Dracup, W.; Lavery, P. The Chromosome-centric Human Proteome Project: a call to action. J. Proteome Res. 2013, 12, 28−32. (5) Sheynkman, G. M.; Shortreed, M. R.; Frey, B. L.; Scalf, M.; Smith, L. M. Large-scale mass spectrometric detection of variant peptides resulting from nonsynonymous nucleotide differences. J. Proteome Res. 2014, 13, 228−240. (6) Kim, M.-S.; Pinto, S. M.; Getnet, D.; Nirujogi, R. S.; Manda, S. S.; Chaerkady, R.; Madugundu, A. K.; Kelkar, D. S.; Isserlin, R.; Jain, S.; et al. A draft map of the human proteome. Nature 2014, 509, 575− 581. (7) Tovchigrechko, A.; Venepally, P.; Payne, S. H. PGP: parallel prokaryotic proteogenomics pipeline for MPI clusters, highthroughput batch clusters and multicore workstations. Bioinformatics 2014, 30, 1469−1470. (8) Vizcaíno, J. A.; Deutsch, E. W.; Wang, R.; Csordas, A.; Reisinger, F.; Ríos, D.; Dianes, J. A.; Sun, Z.; Farrah, T.; Bandeira, N.; et al. ProteomeXchange provides globally coordinated proteomics data submission and dissemination. Nat. Biotechnol. 2014, 32, 223−226. (9) Mohammed, Y.; Mostovenko, E.; Henneman, A. A.; Marissen, R. J.; Deelder, A. M.; Palmblad, M. Cloud parallel processing of tandem mass spectrometry based proteomics data. J. Proteome Res. 2012, 11, 5101−5108. (10) Halligan, B. D.; Geiger, J. F.; Vallejos, A. K.; Greene, A. S.; Twigger, S. N. Low cost, scalable proteomics data analysis using Amazon’s cloud computing services and open source search algorithms. J. Proteome Res. 2009, 8, 3148−3153. (11) Schatz, M. C.; Langmead, B.; Salzberg, S. L. Cloud computing and the DNA data race. Nat. Biotechnol. 2010, 28, 691−693. (12) Langmead, B.; Schatz, M. C.; Lin, J.; Pop, M.; Salzberg, S. L. Searching for SNPs with cloud computing. Genome Biol. 2009, 10, R134. (13) Schatz, M. C. CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics 2009, 25, 1363−1369. (14) Armbrust, M.; Fox, A.; Griffith, R.; Joseph, A. D.; Katz, R.; Konwinski, A.; Lee, G.; Patterson, D.; Rabkin, A.; Stoica, I.; et al. A view of cloud computing. Commun. ACM 2010, 53, 50−58. (15) Mell, P.; Grace, T. The NIST Definition of Cloud Computing, Special Publication 800-145; NIST: Gaithersburg, MD, 2011. (16) Fox, A. Cloud computingwhat’s in it for me as a scientist? Science 2011, 331, 406−407. (17) Dillon, T.; Wu, C.; Chang, E. Cloud computing: issues and challenges. 24th IEEE Int. Conf. Adv. Inf. Networking Appl. 2010, 27− 33. (18) Goode, R. J. A.; Yu, S.; Kannan, A.; Christiansen, J. H.; Beitz, A.; Hancock, W. S.; Nice, E.; Smith, A. I. The proteome browser web portal. J. Proteome Res. 2013, 12, 172−178. (19) Jeong, S.-K.; Lee, H.-J.; Na, K.; Cho, J.-Y.; Lee, M. J.; Kwon, J.Y.; Kim, H.; Park, Y.-M.; Yoo, J. S.; Hancock, W. S.; et al. GenomewidePDB, a proteomic database exploring the comprehensive protein parts list and transcriptome landscape in human chromosomes. J. Proteome Res. 2013, 12, 106−111.

Identifying Exon-Skipping Events: Providing Valuable Information on Alternative Splicing Using Proteomics

Alternative splicing is an important gene regulation mechanism, and C-HPP plans to identify one representative alternative splicing transcription product for each protein, if present.32 While RNA-Seq is a common way to identify alternatively spliced transcripts, using a proteomics approach is promising because it provides more valuable information at the protein level. There are several different types of alternative splicing events, including exon skipping, 3′ and 5′ splice site selection, and intron retention.33 Currently, CAPER 3.0 is able to detect exon-skipping events by considering all possible compatible exon−exon junction sequences of each gene. Users can also search for customized exon−exon junctions by this pipeline. To configure this pipeline, users need to select “Exonskipping events identification” in the combo box and specify the FDR threshold and output bucket name (Figure 6A). Users also need to specify the search engine parameters, including fragment tolerance and fixed/variant modifications. An example of identifying exon-skipping events is shown in Figure 6. The test data size is ∼300 MB, and other job parameters are the same as described above. A potential skipping event between exon ENSE00001132340 and exon ENSE00003610995 in phase zero is identified. The junction peptide sequence is DKNEQAFEEVFQNANFR and is found on chromosome 17. It has 2 spectrum matches, and the matched spectra carry 3 charges. The calculated m/z is 695.99. Database searching was processed in the 8-CPU cloud, achieving a ∼5.2-fold increase in speed compared to that with 1 CPU (Figure 6C).



CONCLUSIONS The previous CAPER browser has been updated to CAPER 3.0, a scalable cloud-based system for data-intensive analysis of CHPP data sets. CAPER 3.0 offers bioinformatics solution to four data-intensive problems, namely, novel peptides detection, known SAVs identification, sample-specific SAVs identification, and exon-skipping events identification. CAPER 3.0 is capable of processing large data sets and offers a user-friendly interface. We hope that CAPER 3.0 will facilitate big data processing in proteomics.



REFERENCES

AUTHOR INFORMATION

Corresponding Authors

*(D.L.) Tel: 86-10-80705999; Fax: 86-10-80705225; E-mail: [email protected]. *(F.H.) Tel: 86-10-68177417; Fax: 86-10-68177417; E-mail: [email protected]. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS We thank Jun Qin, Weimin Zhu, Bei Zhen, Xiaohong Qian, Yunping Zhu, Ping Xu, and Hongxing Zhang for fruitful discussions. This work was supported by the Chinese Program of International S&T Cooperation (2014DFB30020), Chinese High Technology Research and Development (2015AA020108, H

DOI: 10.1021/pr501335w J. Proteome Res. XXXX, XXX, XXX−XXX

Article

Journal of Proteome Research (20) Zgoda, V. G.; Kopylov, A. T.; Tikhonova, O. V.; Moisa, A. A.; Pyndyk, N. V.; Farafonova, T. E.; Novikova, S. E.; Lisitsa, A. V.; Ponomarenko, E. A.; Poverennaya, E. V.; et al. Chromosome 18 transcriptome profiling and targeted proteome mapping in depleted plasma, liver tissue and HepG2 cells. J. Proteome Res. 2013, 12, 123− 134. (21) Guo, F.; Wang, D.; Liu, Z.; Lu, L.; Zhang, W.; Sun, H.; Zhang, H.; Ma, J.; Wu, S.; Li, N.; et al. CAPER: a chromosome-assembled human proteome browsER. J. Proteome Res. 2013, 12, 179−186. (22) Wang, D.; Liu, Z.; Guo, F.; Diao, L.; Li, Y.; Zhang, X.; Huang, Z.; Li, D.; He, F. CAPER 2.0: an interactive, configurable, and extensible workflow-based platform to analyze data sets from the Chromosome-centric Human Proteome Project. J. Proteome Res. 2014, 13, 99−106. (23) Pratt, B.; Howbert, J. J.; Tasman, N. I.; Nilsson, E. J. MRTandem: parallel X!Tandem using Hadoop MapReduce on Amazon Web Services. Bioinformatics 2012, 28, 136−137. (24) Ghali, F.; Krishna, R.; Lukasse, P.; Martínez-Bartolomé, S.; Reisinger, F.; Hermjakob, H.; Vizcaíno, J. A.; Jones, A. R. Tools (viewer, library and validator) that facilitate use of the peptide and protein identification standard format, termed mzIdentML. Mol. Cell. Proteomics 2013, 12, 3026−3035. (25) Kinsella, R. J.; Kahari, A.; Haider, S.; Zamora, J.; Proctor, G.; Spudich, G.; Almeida-King, J.; Staines, D.; Derwent, P.; Kerhornou, A.; et al. Ensembl BioMarts: a hub for data retrieval across taxonomic space. Database 2011, 2011, bar030−bar030. (26) Fermin, D.; Allen, B. B.; Blackwell, T. W.; Menon, R.; Adamski, M.; Xu, Y.; Ulintz, P.; Omenn, G. S.; States, D. J. Novel gene and gene model detection using a whole genome open reading frame analysis in proteomics. Genome Biol. 2006, 7, R35. (27) Rice, P.; Longden, I.; Bleasby, A. EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet. 2000, 16, 276− 277. (28) Mathivanan, S.; Ji, H.; Tauro, B. J.; Chen, Y.-S.; Simpson, R. J. Identifying mutated proteins secreted by colon cancer cell lines using mass spectrometry. J. Proteomics 2012, 76, 141−149. (29) Mo, F.; Hong, X.; Gao, F.; Du, L.; Wang, J.; Omenn, G. S.; Lin, B. A compatible exon−exon junction database for the identification of exon skipping events using tandem mass spectrum data. BMC Bioinf. 2008, 9, 537. (30) Sheynkman, G. M.; Shortreed, M. R.; Frey, B. L.; Smith, L. M. Discovery and mass spectrometric analysis of novel splice-junction peptides using RNA-Seq. Mol. Cell. Proteomics 2013, 12, 2341−2353. (31) Missense mutation; Wikipedia, the free encyclopedia, 2014; http://en.wikipedia.org/wiki/Missense_mutation. (32) Paik, Y.-K.; Omenn, G. S.; Uhlen, M.; Hanash, S.; Marko-Varga, G.; Aebersold, R.; Bairoch, A.; Yamamoto, T.; Legrain, P.; Lee, H.-J.; et al. Standard guidelines for the Chromosome-centric Human Proteome Project. J. Proteome Res. 2012, 11, 2005−2013. (33) Keren, H.; Lev-Maor, G.; Ast, G. Alternative splicing and evolution: diversification, exon definition and function. Nat. Rev. Genet. 2010, 11, 345−355.

I

DOI: 10.1021/pr501335w J. Proteome Res. XXXX, XXX, XXX−XXX