ComPIL 2.0: An Updated Comprehensive ... - ACS Publications

Dec 10, 2018 - most sensitive and throughput methods of choice for discovery- oriented proteomics ... sequences by such software as SEQUEST,1 Mascot,2...
0 downloads 0 Views
Subscriber access provided by YORK UNIV

Article

ComPIL 2.0: An Updated Comprehensive Metaproteomics Database Sung-Kyu Robin Park, Titus Jung, Peter S. Thuy-Boun, Ana Y. Wang, John R. Yates, and Dennis W. Wolan J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.8b00722 • Publication Date (Web): 10 Dec 2018 Downloaded from http://pubs.acs.org on December 11, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 22 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ComPIL

Journal of Proteome Research

2.0:

An

Updated

Comprehensive

Metaproteomics Database Sung Kyu (Robin) Park†‡, Titus Jung†‡, Peter S. Thuy-Boun†#, Ana Y. Wang†#, John R. Yates III†‡*, Dennis W. Wolan†#* †

Department of Molecular Medicine, #Department of Integrative Structural and Computational Biology,

and ‡Department of Neuroscience, The Scripps Research Institute, 10550 North Torrey Pines Road, La Jolla, CA 92037, USA *Email: DWW ([email protected]) and JRY ([email protected])

ABSTRACT We designed a metaproteomic analysis method (ComPIL) to accommodate the ever-increasing number of sequences against which experimental shotgun proteomics spectra could be accurately and rapidly queried. Our objective was to create these large databases for the analysis of complex meta-samples with unknown composition, including those derived from human, animal, and environmental microbiomes. The amount of high-throughput sequencing data has substantially increased since our original database was assembled in 2014. Here, we present a rebuild of the ComPIL libraries comprised of updated publicly disseminated sequence data as well as a modified version of the search engine ProLuCID-ComPIL optimized for querying experimental spectra. ComPIL 2.0 consists of 113 million protein records and roughly 4.8 billion unique tryptic peptide sequences and is 2.3 times the size of our original version. We searched a dataset collected on a healthy human gut microbiome proteomic sample and compared the results to demonstrate that ComPIL 2.0 showed a substantial increase in the number of unique identified peptides and proteins compared to the first ComPIL version. The high confidence of protein identification and accuracy demonstrated by the use of ComPIL 2.0 may encourage the method’s application for large-scale proteomic annotation of complex protein systems.

ACS Paragon Plus Environment

Park et al., 2018 1

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 22

Keywords: ComPIL, ProLuCID, proteomics search engine, microbiome, metaproteomics

INTRODUCTION Tandem mass spectrometry (MS/MS) has become one of the most sensitive and throughput methods of choice for discovery-oriented proteomics experiments in recent years. MS/MS data collected during shotgun-based proteomics experiments are typically matched to peptides derived from a database of protein sequences by such software as SEQUEST,1 Mascot,2 and ProteinProspector.3 These programs assign MS/MS data to in silico-digested peptide sequences and the best high-scoring peptide candidate matches based on peptide mass, charge state, and fragmentation are chosen as likely peptide identifications for individual spectra.1,2,4-6 A critical prerequisite is that the correct sequence must be present in the protein database for identification. Those spectra corresponding to missing database peptides will be left unidentified, or worse, result in erroneous conclusions following incorrect protein identification.7 This is a minor issue for experimentalists studying homogenous samples (i.e., yeast, single bacterium, human cell line), as the genomes of individual species can be readily accommodated in the corresponding protein database and can include single nucleotide polymorphisms (SNPs) and proteins with known posttranslational modifications.8 For heterogeneous biological samples (i.e., complex environmental or microbiome), the completeness of the protein database is crucial to the ability of spectra from all present organisms to be correctly assigned. Incomplete databases result in many proteins from complex samples to be left unidentified. In 2016, we published the method “ComPIL” (Comprehensive Protein Identification Library) that utilizes high-performance peptide and protein databases, scalable to an essentially unlimited number of protein sequences.9 The original ComPIL database consisted of high-quality genomic sequence information publicly available in 2014 (~500 times the size of the human genome) and was highly integrated with a modified version of SEQUEST called “Blazmass”.1,9 We thoroughly evaluated the sensitivity of the databases against public proteomics data as well as newly collected MS/MS shotgun proteomics data ACS Paragon Plus Environment

Park et al., 2018 2

Page 3 of 22 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

collected from human microbiome samples. We have since implemented ComPIL for all our subsequent metaproteomics studies on colitic mice10,11 and healthy human samples.12 The combination of ComPIL with Blazmass allowed for proteomic searches to be performed with database sizes much larger than previously possible. The amount of public high-throughput sequencing data has tremendously expanded since our initial concatenation of the original ComPIL databases in 2014. For example, the number of all proteins annotated in NCBI RefSeq13 has increased by more than 8x to >113 M proteins (from >81,000 organisms). Such extensive protein sequence databases derived from enormous panels of organisms with many protein variants can overwhelm proteomic search methods. Our accessible computational search methods were designed for rapid protein sequence database queries against the ever-increasing amount of public protein sequences. We demonstrate the application and utility of ComPIL 2.0 with the ProLuCID-ComPIL search engine against a human-only cell lysate proteomics dataset and a highly complex human distal gut microbiome sample. We show that additional insights are garnered with our microbiome data when the new and extensive genomic sequencing information generated over the last 4 years is appended into the ComPIL databases.

MATERIALS AND METHODS Protein sequence repositories used for ComPIL 2.0 Proteins were downloaded from NCBI RefSeq, UniProt, the NIH Human Microbiome Project reference genomes and HMGI metagenomic sequence repositories (gastrointestinal tract and stool subsections downloaded from http://hmpdacc.org/HMGI/ on 11/22/17). For the NCBI RefSeq repository, the following subsections of the release were downloaded in protein FASTA format: Viral, Fungi, Mitochondrion, Plasmid, Protozoa, Archaea, Bacteria. For the UniProt database, the human proteome and the complete proteomes for subsections Archaea, Bacteria, and Viruses were downloaded and incorporated (Table S1). All protein sequences were incorporated without processing or alteration, which resulted in a total of ACS Paragon Plus Environment

Park et al., 2018 3

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 22

112,739,995 protein sequences. The number of proteins in ComPIL 2.0 is ~1.4x the size of the original database (82,817,736 protein sequences). ComPIL users can customize the peptide library to incorporate any size representative database of their choosing.

Hardware and software details The ComPIL 2.0 database was created on a Centos 7 desktop computer, which had an Intel Core i7-6700, 16GB of RAM and 1TB hard drive. Additional space was added by mounting a 3TB Seagate passport. Software used included Python 2.7.5, Python 3.4.3, Java 1.8 (Oracle JDK), GNU Parallel 2014_0622, GNU Sort 8.22, and MongoDB 3.0.6. The ComPIL database was uploaded to a MongoDB cluster, running MongoDB 3.0.6. The cluster was an 8-node Aeon Computing Eclipse Microcloud Server System, with each Microcloud node outfitted with an Intel Xeon E5-1650 v2 6-core CPU with 64GB RAM and a 512GB Samsung 840 Pro solid-state drive. Proteomic searches were performed on a ~5000-core Linux cluster located in the TSRI High Performance Computing (HPC) core facility. Our microbiome LC-MS/MS data searches employed ~100 cores and were completed within 24 h. A reduced number of cores can be used; however, a corresponding increase in search time should be anticipated.

Generation of ComPIL 2.0 databases The ComPIL 2.0 database was generated by downloading the sequence libraries and combining into a 40 GB fasta file. For false discovery rate (FDR) calculations, the protein database entries were reversed and appended to the fasta file to create decoy proteins and resulted in a combined fasta text approximately 80 GB in size. Tryptic peptides (i.e., cleavage after lysine and arginine residues) were generated in silico using the Blazmass index file generation feature for each protein sequence with the constraint that all peptides be fully tryptic (i.e., have two tryptic ends) and a maximum of three internal missed cleavage sites. Approximately 9.5 billion unique forward and reverse peptide sequences were organized into 2 files, MassDB, and SeqDB, and proteins were organized in ProtDB. In MassDB, peptides were indexed by masses, with each index containing sequences of that mass. In SeqDB, peptides were indexed by sequences, ACS Paragon Plus Environment

Park et al., 2018 4

Page 5 of 22 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

with each entry containing protein references and residue information. ProtDB is indexed by locus and holds protein sequences and description in each entry. Both SeqDB and MassDB databases were generated as flat JSON files to allow easy transfer and data population into MongoDB cluster. The JSON files were uploaded sequentially, using custom python scripts and parallel threads. When uploaded to the cluster, ProtDB, MassDB, and SeqDB required 202, 270, and 1051 GB of space, respectively. ComPIL 2.0 is freely available and specific download and configuration instructions for local operation are provided at https://github.com/robinparky/prolucidComPIL.

Generation of ProLuCID-ComPIL search algorithm Similar to our original Blazmass-ComPIL program, we developed a ProLuCID-based ComPIL search engine, which also uses MongoDB, a NoSQL JSON-based database. Our original version of ProLuCID generates peptides in silico for each spectra search and is designed to generate and hold peptide sequences in memory or query large SQLite databases. Conversely, ProLuCID-ComPIL can query the MongoDB server for pre-processed peptide candidates housed in MassDB within the mass range of the spectra it is searching. Therefore, ProLuCID-ComPIL uses much less memory and search time than the original ProLuCID. Furthermore, multiple MongoDB nodes can be used to allow for faster response times and simultaneous ProLuCID-ComPIL searches, which is advantageous when searches are performed in a cluster or cloud setting.

Fecal microbiome sample preparation for LC-MS/MS analysis Fecal samples from healthy adult volunteers were suspended in PBS, pH 7.4, frozen, and lyophilized. Samples were prepared by triflic acid treatment, as previously described.12 Briefly, 300 µL of 1:9 toluene:TA was added to each sample at -80 °C (dry ice/acetone bath). The sample was subsequently raised to 4 °C and gently agitated for 30-60 min until completely solubilized. The sample was frozen at -80 °C, quenched with 900 µL of a cold solution of 1:1:3 water:methanol:pyridine, and the frozen pellet was then allowed to dissolve with frequent venting. The resulting solution was diluted to 10 mL with 50% ACS Paragon Plus Environment

Park et al., 2018 5

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 22

methanol/water and buffer exchanged three times with 100 mM Tris HCl, pH 8.0 in a 3K MWCO centrifugal concentrator. The remaining 1 mL protein solution was lyophilized, and the resulting dried sample was resuspended in 5 M urea in 100 mM Tris HCl, pH 8.0 and the fully dissolved protein concentration was quantified using the BCA assay.

Preparation of samples for LC-MS/MS analysis 100 μg protein per sample suspended in 120 μL 100 mM Bicine, 5 M urea, pH 8.0 were treated with tris(2-carboxyethyl)phosphine (TCEP) (Thermo Fisher, Pierce, 20490) to a final concentration of 5 mM and incubated for 15 min, followed by chloroacetamide (Sigma-Aldrich, 201-174-2) treatment at a final concentration of 25 mM for 15 min in the dark. The mixture was diluted to a final volume of 500 μL with trypsin buffer (100 mM Tris HCl, pH 8.0, 1mM CaCl2), and 2 μg of trypsin (Promega, V5111) was incubated with each sample overnight at 37 °C. Samples were then treated with 25 μL formic acid and centrifuged at 12,000 xg for 5 min. 475 μL of the uppermost liquid was decanted and stored at -20 °C until further use.

LC-MS/MS data collection The microbiome sample peptide solutions (40 µL) were dried using a SpeedVac concentrator and desalted using 10 µL ZipTips C18 (Millipore). The dry peptides were reconstituted in 40 µL of 0.1% formic acid in water, and 10 µL of the peptide solution were used for LC-MS/MS analysis with a Thermo Fisher Orbitrap FusionTM TribridTM mass spectrometer coupled to a nLC 1000 system in the Proteomics Core at TSRI Florida. Peptides were eluted onto an analytical reverse phase column (0.075 x 150 mm Acclaim PepMap RLSC nano Viper, Thermo Fisher) at 300 nL/min with the following gradients, using buffer A (0.1% formic acid in H2O) as the diluent: 5-25% buffer B (80% acetonitrile, 0.1% formic acid, 20% H2O) in 160 min, 2544% buffer B in 80 min, 44-80% buffer B in 10 sec, 80% buffer B for 5 min, 80-5% buffer B in 10 sec, and 5% B for 20 min. The MS was operated with the following settings: MS1 scan range of 380-1400 m/z with a mass tolerance of 10 ppm and 120K resolution using Orbitrap detection, data-dependent MS/MS mode at ACS Paragon Plus Environment

Park et al., 2018 6

Page 7 of 22 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

the maximum speed with precursor priority of the most intense ions, HCD fragmentation with normalized collision energy of 30%, 1.0E4 AGC Target intensity threshold for MS2, 2-8 charge states included in screening parameters, a mass resolution of 30K for MS2, and dynamic exclusion set to exclude after two times if occurs within 30 sec with an exclusion duration of 20 sec. All LC-MS/MS data have been deposited to

the

UCSD

Center

for

Computational

Mass

Spectrometry

https://massive.ucsd.edu/ProteoSAFe/dataset.jsp?task=322c4efbb5c24c97a2f68a53fbed165a

(Massive) (username:

MSV000082943_reviewer, password: test321).

LC-MS/MS data analysis using Blazmass with the original ComPIL database Precursor and fragmentation ion data were extracted from the Xcalibur RAW files via RawConverter 1.0.0.0 (http://fields.scripps.edu/rawconv/) in the MS1 and MS2 file formats. The conversion was performed with the monoisotopic m/z in data-dependent acquisition option enabled. The human microbiome sample MS2 spectra were scored using Blazmass 0.9993 against the peptides of the original ComPIL

database.9

Blazmass

and

ComPIL

source

code

are

open

source

(https://github.com/sandipchatterjee/blazmass_compil). Settings for peptide scoring include a static modification for alkylated cysteine residues (+57.02146 Da), a precursor mass tolerance of 5 ppm, and a fragmentation ion tolerance of 10 ppm. DTASelect 2.1.3 (http://fields.scripps.edu/yates/wp/?page_id=17) was used for filtering with the requirements of two peptides per protein and a protein FDR of 1%. MS2 files were each split into 25-50 chunks in order to parallelize database searches using the Linux cluster located in the TSRI HPC core. Chunks were recombined after searches were complete.

LC-MS/MS data analysis using ProLuCID-ComPIL with ComPIL 2.0 database Tandem mass spectra were matched to sequences using the ProLuCID-ComPIL program against ComPIL 2.0 as for the Blazmass-ComPIL search, including a static modification for alkylated cysteine residues, a precursor mass tolerance of 5 ppm, and a fragmentation ion tolerance of 10 ppm. Searches were performed

ACS Paragon Plus Environment

Park et al., 2018 7

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 22

on an Intel Xeon cluster running under the Linux operating system. The validity of peptide/spectrum matches (PSMs) was also assessed in DTASelect 2.1.3. Peptide match probabilities were calculated based on a nonparametric fit of the direct and decoy score distributions, as full separation of the direct and decoy PSM subsets is not generally possible. The FDR was calculated as the percentage of reverse decoy PSMs against target PSMs that passed the confidence threshold. Each protein identified was required to have a minimum of two peptides. After this last filtering step, we estimate that protein FDRs were below 1% for each sample analysis.

Taxonomy analysis and functional annotation of proteins and protein clusters All forward filtered peptide sequences generated from ComPIL 2.0, original ComPIL, and UniProt human database searches were subjected to Unipept 4.0.0 using the UniProt 2018.06 database for taxonomic analyses and generation of Enzyme Commission (E.C.) numbers and molecular function (MF), cellular component (CC), and biological process (BP) Gene Ontology (GO) terms.14,15 Importantly, these analyses are independent of the ProLuCID searches and users can employ any functional annotation programs that employ fasta files or peptide sequences (i.e. COG, eggNOG).

RESULTS AND DISCUSSION ComPIL 2.0 database organization ComPIL was organized into three distributed NoSQL databases implemented by MongoDB with protein sequences stored in a database termed “ProtDB” and in silico trypsin-digested peptide sequences were binned by identical peptide mass or sequence and stored in databases MassDB or SeqDB, respectively.9 Our new ProtDB database contains 225,479,990 protein sequences (including reverse proteins for FDR) that results in over 9.5 billion unique in silico trypsin-digested peptides in SeqDB and are 1.4x and 2.3x in size compared to the original ComPIL ProtDB (165,635,471 forward and reverse sequences) and SeqDB (~4 billion peptides) libraries, respectively (Figure 1A). Very few unique tryptic peptides corresponding to the newly sequenced proteins overlap with those found in the original SeqDB, as evidenced by the amount ACS Paragon Plus Environment

Park et al., 2018 8

Page 9 of 22 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

of new tryptic peptides with respect to number of new proteins. This observation importantly demonstrates that newly deposited sequences continue to expand the number of unique peptides for inclusion into ComPIL 2.0 (Figure 1B). Our established ProLuCID search engine has options to preprocess fasta files and store the files in a binary format, such as in SQLite embedded databases. However, these solutions work best for small and medium sized databases only. The entire ComPIL 2.0 set of databases requires over 1.5TB of disk storage, which would make using in-memory or embedded databases impractical. Since our first publication describing Blazmass-ComPIL, we have modified the ProLuCID search engine for compatibility with the ComPIL database to permit expansion of the user space and accommodate the increasing number of publicly available genomic sequences. As stated, ProLuCID-ComPIL uses a NoSQL JSON-based MongoDB database and a MongoDB driver to communicate with the MongoDB server. This design feature provides horizontal scalability, as additional storage could be easily afforded with the addition new computers to the network. As such, a MongoDB server could grow indefinitely in memory capacity with the addition of new nodes.

ComPIL 2.0 database composition adds unique tryptic peptide sequences. The scalability of our protein and peptide sequence databases can readily accommodate more protein sequence information as the data becomes publicly available (Table S1 for details on databases in ComPIL 2.0). We distilled the SeqDB database down to unique peptide sequences only to optimize efficiency and speed of searches. This is particularly important as replicate sequence information in species and overlap in strain-to-strain conservation become commonplace. The number of peptides that correspond to only one of the ~225M forward and reverse proteins is 75.3% with only 9.1% of peptides mapping to 4 or more proteins. As the majority of peptide sequences map to three or less parental proteins, (90.9%), each peptide that is added to SeqDB provides a significant amount of new information with regard to proteins.

Search of mass spectrometry data collected from human cell line HEK293 ACS Paragon Plus Environment

Park et al., 2018 9

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 22

The chances of incorrect matches against each mass spectrum increases as the number of database entries expands. We previously demonstrated with MudPIT16 data collected on a human-only HEK293 cell lysate that searches against the original ComPIL library correctly identified 99.7% of the same peptides as compared to searches against a human-only database despite having >500x as many peptide candidates.9 We compared the same dataset (PRIDE partner repository identifier PXD003896)17 against ComPIL 2.0 and the UniProt human database consisting of 40,478 forward and reverse protein entries. While the human entries account for ~0.02% of ComPIL 2.0 ProtDB, scatterplots of DeltaCN vs. XCorr values for PSMs from the ComPIL 2.0 search clustered spectra matching human peptides from the false positive reverse peptide matches similarly to the target-decoy search against the human proteome database (Figure S1). Both searches had similarly low median XCorr (1.8 with ComPIL 2.0, 1.2 with human database) and DeltaCN (0.04 with ComPIL 2.0, 0.1 with human database) values for erroneous reverse PSMs. The decrease in DeltaCN values for ComPIL 2.0 is anticipated as our database size grows. This decrease in best versus second choice peptide matches is correlated with an increased difficulty in the identification of true and false-positive peptides. Notwithstanding, the strong correlation in PSM data clustering between the human and ComPIL 2.0 databases demonstrates that the ProLuCID-ComPIL search engine accurately differentiates human from non-human PSMs despite the overwhelming number of “nonhuman” and FDR reverse decoy peptides. Peptide spectrum matches were filtered in DTASelect based on having at least 2 peptide identifications per protein and a 1% FDR at the protein level. The number filtered peptide matches was reasonably high in the ComPIL 2.0 search despite having >5,500x protein candidates in the ComPIL search as in the human proteome search. We found 15,764 unique peptides appearing in both filtered search result datasets with 4,521 and 5,671 peptides appearing only with the human or ComPIL 2.0 databases, respectively (Figure S2A,B). As shown with the original ComPIL expanded search, we identified peptides corresponding to human adenovirus 5 proteins E1A and E1B (see Supporting Information) and helps to demonstrate how unexpected proteomic information can be acquired when the search database is expanded to include a broader spectrum of proteins. As predicted for larger databases,18-20 the increased search space afforded by ACS Paragon Plus Environment

Park et al., 2018 10

Page 11 of 22

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Paragon Plus Environment

ComPIL 2.0 compared to the original version resulted in an increase in the number of additional peptides matched only to the ComPIL library (Figure S2B). The number of additional peptides unique to the ComPIL 2.0 search is ~20% and does not account for charge-state duplicates. We used Unipept 4.0.0 to identify the potential source of peptides unique to ComPIL 2.0 and almost all are matched to eukaryotic organisms (including humans) that have conserved peptides, incorrectly annotated proteins (i.e., 18 PSMs matched better to bacterial constituents and 28 PSMs matched better to fungal peptides), and/or contaminants such as HAdV5 (see Supporting Information). ComPIL 2.0 and the human-only database peptides were matched to the corresponding protein E.C. numbers and MF, CC, and BP GO terms with Unipept 4.0.0 (Fig. S2CF). The strong overlap across all protein annotations between the large and targeted databases (i.e., ComPIL 2.0 and human-only, respectively) demonstrate the accuracy of ProLuCID matches with the large ComPIL 2.0 database.

Human metaproteomics data searched with ComPIL 2.0 yields new protein information We prepared microbiome proteomes with a recently described treatment method12 from fecal samples provided by a healthy adult volunteer. LC-MS/MS data collected on an Orbitrap FusionTM mass spectrometer was searched against ComPIL 2.0 and our original ComPIL databases with conserved parameters, as described in Materials and Methods. Comparison of the DeltaCN vs. XCorr scatterplots shows that the distribution of true vs. false-positive decoy peptides are well-conserved between PSMs obtained from both databases with 21,315 common PSMs (Fig. 2, 3A). After DTASelect filtering, ComPIL 2.0 and the first version yielded 9,111 and 7,215 peptides, respectively, with 5,846 conserved matched peptides (Fig. 3B). ComPIL 2.0 generated 3,265 unique peptides. 1,603 peptides are uniquely associated with the original ComPIL database search that are not identified as matches in ComPIL 2.0 (Fig. 3B). Importantly, with the newly appended database entries in ComPIL 2.0, the peptide search space is dramatically increased in comparison to the first ComPIL database. Therefore, peptides originally matched to spectra in ComPIL may be removed as the top-ranking hit in ComPIL 2.0 and replaced with a more

Park et al., 2018 11

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 22

definitive peptide candidate. As DTASelect takes the top peptide in the rank per spectrum, previously identified PSMs are discarded. We employed Unipept 4.0.0 to match peptides to parental proteins and species. For ComPIL 2.0 results, Unipept was able to match 7,215 of the 9,111 filtered peptides to an organism, 3,228 to an E.C. number, and 5,536 to MF, CC, and/or BP GO terms from the UniProt 2018.06 database (Fig. 3C-G). Similarly, 6,060 out of 7,449 peptides identified from the original ComPIL database were matched with Unipept (2,724 to an E.C. number, and 4,559 to MF, CC, and/or BP GO terms). The taxonomic analyses at the genus level showed that a far greater number of organisms are potentially identified in the ComPIL 2.0 (226 genera) compared to the original database (103 genera) (Fig. 3C). Expansion to lowest common ancestors (LCA) revealed that 206 and 425 identifiable could be ascertained by peptides identified from the original ComPIL search (Fig. 4A) and ComPIL 2.0 (Fig. 4B), respectively. Comparison of the two searches showed that both ComPIL 2.0 (271) and the original ComPIL (51) had unique LCA not annotated in the opposing results (Fig. 4, see Supporting Information). Interestingly, those taxonomy with the highest number of associated peptides in ComPIL 2.0 belonged primarily to dietary organisms, including (but not limited to), Theobroma cacao (cocoa plant), Sesamum indicum (sesame seeds), Solanum tuberosum (potato), Oryza sativa (rice), and Gallus gallus (chicken). Proteomic matches were also more extensive for the Actinobacteria sp. and some Fungi in the new database (Fig. 4B). We anticipate that additional microbiome organisms, including viruses and phages will become more prevalent in metaproteomics datasets as these species are introduced into the public sequencing repositories and appended to proteomics databases. With the assistance of Unipept,14,15 we putatively identified 704 (72 unique) and 838 (206 unique) MF (Fig. 3E), 268 (15 unique) and 312 (59 unique) CC (Fig. 3F), and 835 (83 unique) and 1,017 (265 unique) BP (Fig. 3F) GO terms from the peptides identified by the original ComPIL and ComPIL 2.0 searches. The top molecular function GO terms unique to the original ComPIL data were primarily from bacteria and included: 1) phospholipase A2 activity (GO:0004623, 4 peptide matches); 2) calcium-dependent phospholipase A2 activity (GO:0047498, 4 peptide matches); 3) bile acid binding (GO:0032052, 3 peptide matches); and 4) enzyme activator activity (GO:0008047, 3 peptide matches) (see Supporting Information). ACS Paragon Plus Environment

Park et al., 2018 12

Page 13 of 22

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Paragon Plus Environment

ComPIL 2.0 had 206 unique GO terms associated with the PSMs, including: 1) nutrient reservoir activity (GO:0045735, 73 peptide matches); 2) cysteine-type endopeptidase inhibitor activity (GO:0004869, 11 peptide matches); 3) aspartic-type endopeptidase inhibitor activity (GO:0019828, 8 peptide matches); and 4) oxygen binding (GO:0019825, 7 peptide matches) (see Supporting Information). Far more peptides were identified to belong to those GO terms uniquely associated with ComPIL 2.0 data relative to those unique to the original ComPIL matches. Interestingly, the nutrient reservoir activity (GO:0045735) is associated specifically with dietary proteins, including those from Theobroma cacao (cocoa plant) and Sesamum indicum (sesame seeds). Similarly, the endopeptidase inhibition (GO:0004869) was primarily appended to peptides identified from Solanum tuberosum (potato) and other Solanum sp. relatives, including tomatoes, tomatillos, eggplant, and some forms of peppers. The newly added proteins to ComPIL 2.0 have clearly begun to improve our overall appreciation for the types of proteins and peptides we can detect from a vastly complex human microbiome sample.

CONCLUSIONS The additional information appended into the rebuild of ComPIL 2.0 allows for proteomic searches against an expansive collection of protein sequences. These large searches, employable against any highly complex proteomic sample, can be rapidly achieved with the new ProLuCID-ComPIL search engine. We anticipate that samples with unknown composition can be readily assigned and that unexpected proteins, anomalies, or contaminants (such as adenovirus constituents in HEK293 lysates) may identified. While our open source protein database, proteomics search engine, and the proteomic microbiome data files are available for use and analysis, we posit that additional orthogonal validation methods should be employed to support all ComPIL 2.0 results. Such validation techniques may include real-time PCR, biophysical, and biochemical characterization of proteins of interest. Our method is applicable to any biological system where significant protein sequence variation may be present, such as in the areas of personalized medicine and proteomic profiling across individuals and is easily scalable to accommodate new genomic sequencing data.

Park et al., 2018 13

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 22

ASSOCIATED CONTENT “Park et al 2018 ComPIL2 analyses.xlsx” contains the analyses of the HEK293 and metaproteomics datasets with ComPIL 2.0, original ComPIL, and UniProt human databases. Supplementary tables and figures can be found in “Park et al 2018 SI.pdf” Table S1. Protein databases downloaded for ComPIL 2.0. Figure S1. DeltaCN vs. XCorr scatterplots for human-only and ComPIL 2.0 databases against HEK293 data. Figure S2. Human HEK293 PRIDE data search with UniProt human database vs. Compil 2.0

AUTHOR INFORMATION Corresponding Authors *Email: DWW ([email protected]) and JRY ([email protected])

Author Contributions D.W.W. and S.K.P. conceived of the project. P.S.T.-B., and A.Y.W. performed wet-lab experiments and collected LC-MS/MS data. T.J., P.S.T.-B. and S.K.P. analyzed tandem LC-MS/MS data. T.J., P.S.T.-B., and A.Y.W. contributed to peptide mapping and functional analysis. The manuscript was written through contributions of all authors. All authors have given approval to the final version of the manuscript.

Funding Sources We gratefully acknowledge financial support from The Scripps Research Institute and Boehringer Ingelheim (to D.W.W.); US Environmental Protection Agency STAR Pre-doctoral Fellowship FP91729601-0 (to P.S.T.-B.); and National Institutes of Health grants 1R56 AG057459, 5R33 CA212973, 5R01 HL131697, 5P41 GM103533, 5R01 MH067880, and 5R01 AI113867 (to J.R.Y.).

ACS Paragon Plus Environment

Park et al., 2018 14

Page 15 of 22 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Notes The authors declare that they have no competing interests.

ACKNOWLEDGEMENTS We thank C. Scharager-Tapia and G. Tsaprailis technical assistance with sample preparation and mass spectrometry instrumentation; S. Chatterjee, G. Stupp, and A. Su for assistance and knowledge relating to ComPIL; J.-C. Ducom for assistance and management of the TSRI high-performance computing; B. Dill at Merck & Co. for critical discussions related to ComPIL and data evaluations.

REFERENCES (1)

Xu, T.; Park, S. K.; Venable, J. D.; Wohlschlegel, J. A.; Diedrich, J. K.; Cociorva, D.; Lu, B.; Liao, L.; Hewel, J.; Han, X.; et al. ProLuCID: An improved SEQUEST-like algorithm with enhanced sensitivity and specificity. J. Proteomics 2015, 129, 16–24.

(2)

Perkins, D. N.; Pappin, D. J.; Creasy, D. M.; Cottrell, J. S. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20 (18), 3551–3567.

(3)

Chalkley, R. J.; Baker, P. R.; Huang, L.; Hansen, K. C.; Allen, N. P.; Rexach, M.; Burlingame, A. L. Comprehensive analysis of a multidimensional liquid chromatography mass spectrometry dataset acquired on a quadrupole selecting, quadrupole collision cell, time-of-flight mass spectrometer: II. New developments in Protein Prospector allow for reliable and comprehensive automatic analysis of large datasets. Mol. Cell. Proteomics 2005, 4 (8), 1194–1204.

(4)

Elias, J. E.; Gygi, S. P. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat. Methods 2007, 4 (3), 207–214.

(5)

Tabb, D. L.; McDonald, W. H.; Yates, J. R. DTASelect and Contrast: tools for assembling and comparing protein identifications from shotgun proteomics. J. Proteome Res. 2002, 1 (1), 21–26.

ACS Paragon Plus Environment

Park et al., 2018 15

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(6)

Page 16 of 22

Eng, J. K.; McCormack, A. L.; Yates, J. R. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spec. 1994, 5 (11), 976–989.

(7)

Knudsen, G. M.; Chalkley, R. J. The effect of using an inappropriate protein database for proteomic data analysis. PLoS ONE 2011, 6 (6), e20873.

(8)

Chick, J. M.; Kolippakkam, D.; Nusinow, D. P.; Zhai, B.; Rad, R.; Huttlin, E. L.; Gygi, S. P. A mass-tolerant database search identifies a large proportion of unassigned spectra in shotgun proteomics as modified peptides. Nat. Biotech. 2015, 33 (7), 743–749.

(9)

Chatterjee, S.; Stupp, G. S.; Park, S. K. R.; Ducom, J.-C.; Yates, J. R.; Su, A. I.; Wolan, D. W. A comprehensive and scalable database search system for metaproteomics. BMC Genomics 2016, 17 (1), 642.

(10)

Mayers, M. D.; Moon, C.; Stupp, G. S.; Su, A. I.; Wolan, D. W. Quantitative metaproteomics and activity-based probe enrichment reveals significant alterations in protein expression from a mouse model of inflammatory bowel disease. J. Proteome Res. 2017, 16 (2), 1014–1026.

(11)

Moon, C.; Stupp, G. S.; Su, A. I.; Wolan, D. W. Metaproteomics of colonic microbiota unveils discrete protein functions among colitic mice and control groups. Proteomics 2018, 18 (3-4).

(12)

Wang, A. Y.; Thuy-Boun, P. S.; Stupp, G. S.; Su, A. I.; Wolan, D. W. Triflic acid treatment enables LC-MS/MS analysis of insoluble bacterial biomass. J. Proteome Res. 2018, 17 (9), 2978–2986.

(13)

O'Leary, N. A.; Wright, M. W.; Brister, J. R.; Ciufo, S.; Haddad, D.; McVeigh, R.; Rajput, B.; Robbertse, B.; Smith-White, B.; Ako-Adjei, D.; et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016, 44 (D1), D733–D745.

(14)

Mesuere, B.; Van der Jeugt, F.; Willems, T.; Naessens, T.; Devreese, B.; Martens, L.; Dawyndt, P. High-throughput metaproteomics data analysis with Unipept: A tutorial. J. Proteomics 2018, 171, 11–22.

ACS Paragon Plus Environment

Park et al., 2018 16

Page 17 of 22 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(15)

Journal of Proteome Research

Gurdeep Singh, R.; Tanca, A.; Palomba, A.; Van der Jeugt, F.; Verschaffelt, P.; Uzzau, S.; Martens, L.; Dawyndt, P.; Mesuere, B. Unipept 4.0: functional analysis of metaproteome data. J. Proteome Res. 2018, current issue.

(16)

Wolters, D. A. D.; Washburn, M. P. M.; Yates, J. R. J. An automated multidimensional protein identification technology for shotgun proteomics. Anal. Chem. 2001, 73 (23), 5683–5690.

(17)

Vizcaíno, J. A.; Csordas, A.; Del-Toro, N.; Dianes, J. A.; Griss, J.; Lavidas, I.; Mayer, G.; PerezRiverol, Y.; Reisinger, F.; Ternent, T.; et al. 2016 update of the PRIDE database and its related tools. Nucleic Acids Res. 2016, 44 (22), 11033–11033.

(18)

Reiter, L.; Claassen, M.; Schrimpf, S. P.; Jovanovic, M.; Schmidt, A.; Buhmann, J. M.; Hengartner, M. O.; Aebersold, R. Protein identification false discovery rates for very large proteomics data sets generated by tandem mass spectrometry. Mol. Cell. Proteomics 2009, 8 (11), 2405–2417.

(19)

Xiong, W.; Giannone, R. J.; Morowitz, M. J.; Banfield, J. F.; Hettich, R. L. Development of an enhanced metaproteomic approach for deepening the microbiome characterization of the human infant gut. J. Proteome Res. 2015, 14 (1), 133–141.

(20)

Nesvizhskii, A. I. Proteogenomics: concepts, applications and computational strategies. Nat. Chem. Biol. 2014, 11 (11), 1114–1125.

ACS Paragon Plus Environment

Park et al., 2018 17

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 22

Figure 1. Design, components, and generation of ComPIL 2.0 databases. ComPIL utilizes 3 databases that are generated from an input protein FASTA file: 1) ProtDB contains all the forward and decoy reverse proteins and information; 2) SeqDB consists of all unique in silico trypsin-digested peptide forward and reverse sequences along with their parent proteins (mapped to ProtDB); and 3) MassDB contains all unique peptide sequences organized into distinct masses (not shown). Peptides with identical sequences or masses were grouped into JSON objects which were imported into MongoDB as SeqDB or MassDB, respectively. ComPIL 2.0 is much larger in size than our original ComPIL database in size (A) and number of entries (B).

ACS Paragon Plus Environment

Park et al., 2018 18

Page 19 of 22

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Paragon Plus Environment

Figure 2. XCorr and DeltaCN distribution for PSMs matched to a human microbiome metaproteomics dataset. Target (black) and decoy (red) PSMs are plotted based XCorr and DeltaCN. DTASelect 2.1.3 uses a quadratic discriminant analysis to separate decoy and target PSMs to maintain user defined FDR (false positive rate).

Park et al., 2018 19

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 22

Figure 3. A human microbiome dataset searched with both the original and ComPIL 2.0. DTASelect filtered PSMs (A) with ComPIL 2.0 and first version databases have highly conserved peptides with additional peptides being matched with ComPIL 2.0 (B). (C) Unipept analysis and comparison between the search results for the lowest common Genus of matched peptides, as well as E.C. numbers of associated proteins (D) and GO molecular functions (E), cellular components (F), and biological processes (G).

ACS Paragon Plus Environment

Park et al., 2018 20

Page 21 of 22

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Paragon Plus Environment

Figure 4. Taxonomic matches to metaproteomic data. Original ComPIL (A) and ComPIL 2.0 (B) PSMs matched to UniProt database with Unipept 4.0.0.

Park et al., 2018 21

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 22

For TOC Only

ACS Paragon Plus Environment

Park et al., 2018 22