g2pDB: A Database Mapping Protein Post-Translational

We constructed the database g2pDB using this protein-to-codon site mapping information. The information in g2pDB has been made available through a RES...
15 downloads 7 Views 1MB Size
Article pubs.acs.org/jpr

g2pDB: A Database Mapping Protein Post-Translational Modifications to Genomic Coordinates Sarah Keegan,† John P. Cortens,‡ Ronald C. Beavis,*,‡ and David Fenyö† †

Center for Health Informatics and Bioinformatics, New York University Medical School, 227 East 30 Street, New York, New York 10016, United States ‡ Department of Biochemistry and Medical Genetics, University of Manitoba, Faculty of Health Sciences, 744 Bannatyne Avenue, Winnipeg, MB R3E 0W3, Canada ABSTRACT: Large scale proteomics have made it possible to broadly screen samples for the presence of many types of posttranslational modifications, such as phosphorylation, acetylation, and ubiquitination. This type of data has allowed the localization of these modifications to either a specific site on a proteolytically generated peptide or to within a small domain on the peptide. The resulting modification acceptor sites can then be mapped onto the appropriate protein sequences and the information archived. This paper describes the usage of a very large archive of experimental observations of human posttranslational modifications to create a map of the most reproducible modification observations onto the complete set of human protein sequences. This set of modification acceptor sites was then directly translated into the genomic coordinates for the codons for the residues at those sites. We constructed the database g2pDB using this protein-to-codon site mapping information. The information in g2pDB has been made available through a RESTful-style API, allowing researchers to determine which specific protein modifications would be perturbed by a set of observed nucleotide variants determined by high throughput DNA or RNA sequencing. KEYWORDS: post-translational modification, single nucleotide variant, phosphorylation, acetylation, ubiquitination, REST API, genome coordinate, protein coordinate



INTRODUCTION The last 15 years of proteomics research has generated a very large amount of information regarding the specific sites of protein residue post-translational modifications (PTMs). One of the potential avenues of research made available because of this information has been to understand how these PTMs and associated sequence domains affect genome population selection rules as well as specific biochemical processes associated with those modifications.1,2 The evidence provided by these studies has been interesting, but the studies themselves have been made more difficult because of how proteomics information resources deal with modification annotation. Proteomics data analysis depends on the availability of high quality protein sequences generated from DNA or RNA nucleic acid sequences. Biomedical proteomics, particularly involving human or mouse proteomics, now use protein sequences predicted from the appropriate genome and the associated gene and gene variant reference information.3 These protein sequences are made available by a number of popular resources in FASTA files, e.g., UniProt,4 RefSeq,5 and ENSEMBL.6 The FASTA file from a particular resource (frequently referred to as a “database” by the proteomics community) will use its own set of accession numbers for the individual protein sequences. The sequences may also differ somewhat from the genomic sequence because of historical sequences present in long© 2016 American Chemical Society

standing collections that may represent splice or sequence variants as the reference protein sequence rather than the current reference genome sequence. The large-scale determination of protein PTM sites has used the protein sequence collections available over the course of the last 10 years as the basis for reporting their observed PTM annotations in publications. A combination of protein accession number versioning, genome sequence updates, changes caused by improvements in gene modeling algorithms, and the discontinuation of widely used sequence repositories (e.g., the International Protein Index7) has made it increasingly difficult to accurately curate PTM sites using the annotations available in the literature. Protein-centric sequence repositories often do not include genome-associated information with their data releases, further increasing the difficulty of associating the information linked to a particular protein accession number with the current version of the human genome reference sequence. These issues with the details of protein sequence provenance create difficulties in associating PTM acceptor sites (expressed in protein sequence coordinates) with the corresponding sequence in the genome (in genome coordinates). Received: November 2, 2015 Published: February 4, 2016 983

DOI: 10.1021/acs.jproteome.5b01018 J. Proteome Res. 2016, 15, 983−990

Article

Journal of Proteome Research GPMDB8 is a repository of analyzed peptide MS/MS spectra constructed from data made publicly available by proteomics research groups since 2004. Most of the human experimental data has been analyzed using ENSEMBL protein sequences and accession numbers (currently 1.5 billion assigned spectrum-tosequence matches). The original data comes mainly from published studies, many of which involve the investigation of post-translational modification acceptor site assignments, e.g., phosphorylation, ubiquitination, SUMOylation, acetylation, and N-linked glycosylation sites. Given this repository, it should be possible to create a curated list of PTM acceptor sites for any given ENSEMBL accession number. In practice, it would be very difficult for most users to construct such a list for more than a few proteins at a time because the necessary database queries can take long enough to cause time-outs using the existing web and RESTfulstyle interface.9 The logical steps and script implementations necessary to generate curated results would also be difficult for a user to design and implement without significant bioinformatics training and knowledge. In this paper, we report on our efforts to solve this practical issue by the creation of a new database (g2pDB) for the purpose of providing access to mapping information that reliably associates a large amount of experimentally validated PTM acceptor site information with the appropriate protein and genome sequence coordinates allowing users to retrieve the residue and codon responsible for that specific protein residue. The primary purpose of this mapping was to highlight missense single nucleotide variants (SNVs) that lead to single amino acid variants (SAVs) at specific PTM acceptor sites, which have the effect of abolishing the potential for a PTM at that site. This relationship between SNVs, SAVs, and PTMs should extend our understanding of the biological and health-related consequences of SNVs found in next-generation sequencing studies of populations and clinical investigations.



variants for a particular protein-coding gene, making it simple to test for these variants without additional software or algorithms specifically designed for this task. As GPMDB became populated with information regarding the modification state of individual proteins (identified by ENSEMBL protein accession numbers), the information regarding the modifications were automatically curated and stored in XML files. These files were then used to inform the search engine in subsequent searches about which biologically relevant modifications were to be expected on a protein-byprotein basis (e.g., phosphorylation, sulfonylation, and proline hydroxylation). Additional information regarding single amino acid variants (based on known missense single nucleotide variants from dbSNP) and a small number of rare posttranslational modifications (e.g., hypusine in EIF5A) were also used to inform the searches using ENSEMBL protein accession numbers and coordinates to link them to individual protein sequences. Using this method of informing searches with archival information autocurated from GPMDB and dbSNP as well as metadata from publications and repositories, GPMDB now contains a significant number of post-translational modification site annotations for several speciesparticularly Homo sapiens and Mus musculusobtained from a wide range of experimental samples and PTM enrichment protocols. For the purposes of this study, the three PTMs most commonly investigated using tandem mass spectrometry methods were selected: phosphorylation (at serine, threonine, and tyrosine residues),12,13 ubiquitination (at lysine residues),14 and acetylation (at lysine residues).15 GPMDB was queried using the ENSEMBL v. 70 Homo sapiens protein sequences (Genome Reference Consortium Human genome build GRCh37) and accession numbers for information available regarding these three PTM types. This information was autocurated into individual files based on the chromosome associated with the protein sequence and the PTM type. The “.ptm” files generated by this process were made available on the GPM project FTP server (ftp://ftp.thegpm.org/projects/ g2pDB). The curation process required that to be included, a PTM acceptor site must have been observed at least five times with peptide-to-spectrum matches (PSMs) with an E-value < 0.01. The PSMs associated with particular acceptor sites were tested to ensure that they were unique, i.e., not the result of multiple entries of the same spectrum. The E-value distribution was also examined, and the site was rejected if the E-values were grouped together near the high end of the distribution. The process also checked the protein coordinates of the determined PTM site to be sure that there was no change in the acceptor residue caused by changes in protein sequence resulting from protein sequence versioning. In any case where there was no experimental evidence capable of distinguishing between the assignment of adjacent acceptor sites, equivalent sites were recorded. In any case where a peptide could be assigned to more than protein sequence/splice variant or to more than one location in a particular protein sequence, all possible assignments were recorded. An example of a PTM annotation file entry for protein acetylation acceptor sites is shown in Table 1. Each protein accession number is represented by a text block that contains a standard set of information about the protein (lines 1−10), including the protein sequence in a single line (sequence). Line 11 indicated the total number of experimental spectra assigned

MATERIALS AND METHODS

PTM Proteome Annotation for g2pDB

GPMDB stores the results of the analysis of publicly available proteomics data obtained either directly from users of the public GPM analysis system or from analyses performed off-line using data downloaded from repositories.8−10 Data sources and inclusion policies were made available on the project Wiki site (http://wiki.thegpm.org/wiki/GPMDB_Data_Sources). The parameters used by the proteomics analysis search engine (in most cases, a version of X! Tandem11) were set based on the metadata available regarding the experimental data (either from the associated repository or publication) and then checked for consistency in preliminary studies. When the metadata indicated that particular post-translational modifications were to be expected, that information was also used in the search parameter programming. All of the identifications for human samples were made using protein sequence FASTA files obtained from ENSEMBL.6 The protein sequences provided by ENSEMBL were chosen because the ENSEMBL system made it straightforward to determine the position of any codon in chromosome or RNA coordinates given an amino acid residue position in protein coordinates. Although it may be possible to determine this mapping for other sets of protein sequences, the underlying structure of the ENSEMBL database system requires this mapping to be correct. ENSEMBL also lists all predicted splice 984

DOI: 10.1021/acs.jproteome.5b01018 J. Proteome Res. 2016, 15, 983−990

Article

Journal of Proteome Research

(Reference_ptm). If more than one ENSEMBL protein accession share the codon (caused by alternate splice variant protein sequences), then multiple, semicolon-separated Reference_ptm entries are listed. In the case of residues that mapped to discontinuous codons, two rows are used to annotate the complete codon with the Reference_ptm entries repeated in each row. The Reference_ptm values comply with the GPM PTM naming rules as specified in the GPM Wiki §2.1 (http://wiki. thegpm.org/wiki/PTM_naming). Taking the example in Table 2, the PTM specification “ENSP00000311766:pm.K146+Acetyl” means that, in the protein sequence corresponding to the ENSEMBL accession “ENSP00000311766”, lysine residue 146 is an observed acceptor site for acetylation. It should be noted that a ubiquitin modification was indicated in the GFF3 files by the suffix “+GlyGly”. This notation is used because the modification detected experimentally was the remnant sequence “GlyGly”, which remains attached to the lysine side chain’s ε-amino group following the trypsin cleavage used during sample preparation.13

Table 1. Example of a Single Text Block Entry in a PTM Protein Acetylation Acceptor Site Annotation File line

value

1 2 3 4 5 6 7 8 9 10 11 12 13 14

protein:ENSP00000357189 pep:known gene:ENSG00000143321 transcript:ENST00000368206 gene_biotype:protein_coding transcript_biotype:protein_coding chromosome:1 start:156713137 end:156722240 strand:-1 sequence:MHPEGGQFVPQLLGHLLATKLKRFLLSKGGRRAQ··· desc:HDGF, hepatoma-derived growth factor Source:HGNC Symbol Acc:4856 total_spectra_assigned:29696 modified_peptide_obs:19 pos:213 res:K obs:7 gc:156713569,156713570,156713571 pos:242 res:K obs:12 gc:156713482,156713483,156713484

to the protein (total_spectra_assigned) and line 12 is the number of experimental spectra assigned with the PTM (modified_peptide_obs). All subsequent lines in the text block indicate the position of the acceptor residue in protein coordinates (pos), the residue (res), and the number of spectra assigned to that site (obs) with one site per line. The text block for a protein ends with a blank line. The first text block in each file contains metadata regarding the parameters used to create the file.

g2pDB Database and API

The information recorded in the GFF3 files was parsed into a MongoDB database,17 g2pDB, using a Python script. The scripts, PTM, and GFF3 files used to create g2pDB are available at the project FTP site mentioned above. MongoDB, a “NoSQL” database, was chosen due to ease of use and its suitability for this type of system. The results of a query in MongoDB are returned in BSON format, a modified form of JSON. This allowed for minimal processing of the data before it was sent to the requestor via the REST API, resulting in a very lightweight back-end. In addition, MongoDB is a “documentoriented” database: instead of storing the data in tables as in a relational model, the data is stored in a set of documents, the format of which can be defined based upon the expected usecases of the system. Because both protein-centric and DNAcentric queries are expected, the database contains two document collections, one customized for queries based on protein ENSEMBL IDs and one customized for queries based on chromosome position. This method allows for similar retrieval times for both types of queries. It also creates redundancy of data, but because database updates are infrequent (until an entirely new build is created), this type of model is well-suited for our application. To allow other researchers direct access to the information in g2pDB, a simple REpresentational State Transfer Application Programming Interface (RESTful-style API) was created, using a Python script to process the requests, query the database, and return a text result formatted in Javascript Object Notation (JSON). The current interface contains the methods described in Table 3, designed to be accessed by HTTP GET. The base URL for these methods is http://openslice.fenyolab.org/ g2pdb/grch37/ensembl_70. As new genome assemblies become available, the base URL will change to gain access to these assemblies, but the previous assemblies will remain permanently at their existing URL. A GET to the base URL retrieves a short description of the RESTful-style API. Up-todate documentation for the methods and links to the code and build data will be maintained at http://wiki.thegpm.org/wiki/ G2PDB_REST. Examples of the JSON objects returned for selected queries are illustrated in Figure 1. We anticipate that g2pDB will be updated twice per year on an ongoing basis.

PTM Genome Annotation for g2pDB

The PTM annotation information files for each chromosome generated in 4.1 were updated with the known mapping of ENSEMBL protein coordinates to GRCh37 chromosome coordinates using the appropriate ENSEMBL RESTful-style method (GET map/translation/:id/:region).16 This information was then used to construct genome annotation GFF3 files for each chromosome. The GFF3 format was designed to follow the general recommendations for GFF3 files (http:// www.sequenceontology.org/gff3.shtml). Each modification type was parsed into separate files; thus, three files were created for each chromosome. An example of the information in the nine standard GFF columns is shown in Table 2. The first eight columns of each row contain the chromosome coordinates of the codon corresponding to the modified protein residue and the DNA strand of the codon. The ninth column contains a unique ID for the row (ID), the DNA reference codon sequence (Reference_seq), the reference amino acid residue (Reference_aa), and the PTM annotation for that residue Table 2. Example of a Single GFF3 File Row GFF3 column

value

1 2 3 4 5 6 7 8 9

chr1 gpmdb SO:0001089 1414480 1414482 . + . ID = 1; Reference_seq = AAG; Reference_aa = K; Reference_ptm = ENSP00000311766.pm:K146+Acetyl 985

DOI: 10.1021/acs.jproteome.5b01018 J. Proteome Res. 2016, 15, 983−990

Journal of Proteome Research



Table 3. RESTful-Style API for g2pDB method /dna/CHR/POS/ /dna/CHR/POS/mod = MOD /dna/CHR/POS/snp = SNP /dna/CHR/POS/snp = SNP&mod = MOD /protein/ACC /protein/ACC/mod = MOD

Article

RESULTS

The major sources of proteomics raw mass spectrometry data (and metadata) used to create GPMDB (which were then curated into g2pDB) are summarized in Table 4. In cases where the same raw data is held by different resources, the resource where the data was originally deposited is reported to avoid duplication. Tranche and MASSIVE are included as a single entry, as MASSIVE is the successor repository to a fraction of the recoverable Tranche raw data sets. PRIDE/ProteomeXchange is the largest single raw data source in terms of protein identifications. The number of genome bases annotated in g2pDB for the different types of PTM acceptor sites and modification types is summarized in Figure 2. For the initial release of the database,

description all PTM information involving base POS on chromosome CHR information about PTM MOD involving base POS on chromosome CHR all PTM changes associated with base variant SNP at base POS on chromosome CHR changes in the PTM MOD with base variant SNP at base POS on chromosome CHR listing of all PTM-to-chromosome mappings for ENSEMBL accession ACC listing of PTM-to-chromosome mappings with modification MOD for ENSEMBL accession ACC

Figure 1. Examples of JSON responses to selected queries. 986

DOI: 10.1021/acs.jproteome.5b01018 J. Proteome Res. 2016, 15, 983−990

Article

Journal of Proteome Research Table 4. Major Repository Sources of Raw Data Used for the Creation of g2pDB through Their Reanalyzed Entries in GPMDB repository

data files

protein identifications

PRIDE/ProteomeXchange PeptideAtlas MASSIVE/Tranche CPTAC data portal ProteomicsDB All other sources

80,106 30,810 59,647 5,560 1,449 178,181

44,253,618 6,435,223 19,835,684 10,735,527 1,013,927 12,307,169

Figure 2. Summary of the number of bases mapped in g2pDB by the PTM associated with those bases. Figure 3. Histograms of the number of bases annotated on each human chromosome for phosphoryl, ubiquinityl, or acetyl acceptor sites.

it was decided to use a high stringency for curating the acceptor sites. This choice was made to avoid any overinterpretation of the information retrieved from the database, as many researchers involved in genomic and transcriptomic analysis may not be familiar with assessing the quality of proteomics data. Because reproducibility of a site assignment from experiment to experiment was one of the curation parameters, the total number of sites in the database was lower than what might be expected from a simple union of the number of sites found in individual studies where reproducibility was not considered. The chromosome-specific distribution of the three types of mapped PTM acceptor sites in terms of DNA bases is shown in Figure 3. Serine, threonine, and tyrosine phosphorylations were the most abundant modifications with a distribution roughly correlated with the number of protein-coding genes on each chromosome. The only chromosome with no phosphorylation sites was the mitochondrial chromosome, whereas the Y chromosome had only 8 acceptor sites that passed the curation tests. This particular method of displaying the results removed any bias toward multiple assignments of the same acceptor site codon commonly presented in the proteomics literature associated with the highly variable number of splice variant forms of proteins. Ubiquitination of lysine ε-amino groups was the next most abundant PTM with approximately 50% as many bases per chromosome involved compared to that of phosphorylation. Ubiquitination is a PTM that is readily detected using tandem mass spectrometry-based proteomics methods even though proteins bearing this PTM are quickly degraded by a cell’s proteasomes, keeping the normal concentration of proteins with this modification very low. To compensate for this effect, most of the ubiquitination studies used to populate GPMDB involved treating cells with a proteasome inhibitor, allowing the concentration of modified proteins to build up followed by an

affinity purification step to further enrich the ubiquitinmodified proteins. All chromosomes had curated acceptor sites with the smallest numbers being three codons on the mitochondrial chromosome and thirty-six codons on the Y chromosome. Acetylation of lysine ε-amino groups was the least abundant PTM with approximately 12.8% as many bases per chromosome involved compared to that of phosphorylation. Acetylation proved to be the most difficult PTM to curate because of the widespread use of urea as a protein and peptide solubilization reagent in protein acetylation studies. Unless suitable steps are taken, the presence of urea during the protein isolation and peptide digestion steps of an experimental protocol can result in the carbamylation of lysine ε-amino groups: a modification that can be difficult to distinguish from acetylation in large-scale studies unless particular care is taken in the analysis of the resulting tandem mass spectrometry data. Data from lysine acetylation studies in GPMDB were tested for urea artifacts, and if the number of PSMs assigned to carbamylation exceeded 5% of those assigned to acetylation, that data was not used in the construction of g2pDB. The chromosome-specific trends or acetylation shown in Figure 3 were similar to the other two types of PTMs with the exception of chromosome 6, which showed an anomalously high number of codons associated with this modification. The anomaly was caused by the presence of the histone 1 gene family on this chromosome.18 This unusual cluster of genes consists of 51 paralogues−all of which have multiple acetylation acceptor sites−leading to the over-representation of acetylation sites on proteins originating from this chromosome. The mitochondrial chromosome had no codons mapped to 987

DOI: 10.1021/acs.jproteome.5b01018 J. Proteome Res. 2016, 15, 983−990

Article

Journal of Proteome Research

Table 5. Comparison of SIFT, PolyPhen, and g2pDB Information for dbSNP SNVs that Results in SAVs at Reference Residues S, T, and Y in TP53BP1 (Genome Assembly GRCh37) dbSNP

SNV

ref

SAV

SIFT

PolyPhen

rs140524878 rs147151977 rs61751060 rs144313241 rs116252997 rs151069796 rs201233064 rs149162670 rs115849551 rs150748973 rs145227124 rs141222111 rs200924195 rs61758068 rs201774341 rs200277921 rs142024493 rs115590044 rs144064168 rs140119148 rs200691218 rs199716902 rs201180017 rs150764275 rs199724544 rs137879443

15:g.43784585G>C 15:g.43771640C>T 15:g.43769851A>G 15:g.43767874G>C 15:g.43767763G>A 15:g.43762153G>C 15:g.43749340G>A 15:g.43749236T>C 15:g.43749149T>C 15:g.43749094C>T 15:g.43749051A>C 15:g.43748962G>A 15:g.43748479G>A 15:g.43748162A>G 15:g.43738722G>T 15:g.43738614C>G 15:g.43733749T>C 15:g.43724550A>C 15:g.43720218G>A 15:g.43714320G>A 15:g.43713277G>A 15:g.43708587G>A 15:g.43707954T>C 15:g.43701282A>G 15:g.43700191G>C 15:g.43699735G>A

S S S S T S S T T S S T S S T S T S T T T S T Y T T

C N P C M C F A A N R M F P N T A A I I M F A H S M

0 0.38 0.08 0.12 0.05 0.05 0.04 0.95 0.18 0.84 0.02 0.03 0.02 0.26 0.01 0.01 0.48 0.05 0.07 0 0 0 0.16 0.01 0.05 0

1 0.005 0.625 0.015 0.176 0.003 0.828 0.003 0.209 0.002 0.361 0.092 0.622 0.007 0.462 0.874 0.003 0.877 0.646 0.915 0.999 0.912 0.137 0.999 0.674 0.996

g2pDB

-phospho -phospho -phospho

-phospho -phospho -phospho -phospho -phospho -phospho

-phospho

environment of a specific residue and scores SAVs based on their potential to perturb that environment in a deleterious manner. PolyPhen-223 extends this analysis to use additional information, including updated three-dimensional protein structural information, in a Bayes classifier. SNAP24 uses a neural-network approach to incorporate a wide variety of both structural, annotation, and homology information to predict deleterious SAVs. Condel25 attempts to combine the results from other algorithms into a consensus deleterious score for missense SNVs to simplify the task of comparing the output of multiple algorithms applied to the same sets of experimentally determined variants. The purpose of the work described here was not to replace these existing valuable methods but to add an additional resource that would easily allow an investigator additional biochemical insight into the consequences of a particular SNV without the necessity of the investigator generating their own mappings to particular sets of protein coordinates for the affected protein splice variants and researching the potential PTM consequences themselves. Additional value to our approach comes from the use of PTM information derived directly from a very large set of experimental data that spans many types of experimental protocols, tissues, and cell types. Conventional literature-based approaches to creating PTM listings have had difficulty dealing with this type of information because of a lack of standardized quality control measures, the use of different algorithms for PTM assignments, and differences in the actual protein sequences used for the original data reduction that require additional sequence similarity mapping to the basis set of sequences used for the list. Our approach removes most of these difficulties by using a uniform set of analytical algorithms

acetylation sites, and the Y chromosome had only two codons listed. The NoSQL database platform MongoDB performed well for this particular application. For example, the queries necessary to construct responses to the API requests used to create Figure 1 ran in less than 1 ms on the platform currently being used to host the database and RESTful-style API service. This speed should be sufficient to respond to any anticipated volume of use for the API: additional compute cores can be added to the MongoDB instance if sustained community use exceeds our current capacity to provide responses. The database and API were designed to favor fast responses to individual SNV and protein queries. Any researcher interested in calculating overall statistics or investigating patterns within the g2pDB annotation is encouraged to download the “.ptm” or “.gff” files associated with the particular build they wish to examine and use those files in their calculations. For example, Figure 3 was calculated from the “.ptm” files for each individual chromosome rather than a large number of API calls.



DISCUSSION There is no shortage of bioinformatics resources for making an assessment of the potential consequences of missense SNVs derived from next-generation sequencing. These resources have been reviewed extensively.19 One of the most popular algorthims is SIFT (Sorting Intolerant from Tolerant),20,21 which uses multiple sequence alignment methods to determine whether a particular SAV affects residues that have been determined to be highly conserved in homologous proteins sequences. PolyPhen22 uses a set of physical characteristics (either known or calculated) about the protein sequence 988

DOI: 10.1021/acs.jproteome.5b01018 J. Proteome Res. 2016, 15, 983−990

Journal of Proteome Research



ACKNOWLEDGMENTS We thank the data contributors, curators, and development teams that have made the proteomics data repositories Massive, PeptideAtlas, Peptidome, ProteomeXchange, ProteomicsDB, and Tranche so useful. We also thank all of the research groups that have donated their data directly to GPM.

and protein sequences that were designed to be easily converted to genome coordinates. It also allows an interested researcher to examine the detailed provenance of the assignment of a particular PTM acceptor site using the existing GPMDB web-based infrastructure. This approach is illustrated in Table 5 for a set of SNVs drawn from dbSNP for the human gene TP53BP1 (tumor protein 53 binding protein 1). These SNVs were chosen because they all generate an SAV that affects an S, T, or Y residue. The SIFT and PolyPhen calculations show varying levels of severity for the SAVs generated, and the g2pDB lookup indicates that ten of the SNVs also result in the loss of a phosphorylation acceptor site. The RESTful-style interface also supplies the currently known splice variants containing the associated SAV and the SAV protein coordinates. With this additional knowledge (on top of the conventional scores) an investigator has the capability of directly formulating a testable biochemical hypothesis with respect to the consequences of those ten SNVs without further effort. The examples given in this work are a very straightforward instance of the direct use of information derived from large scale proteomics (protein biochemistry using generalized hypotheses) in the interpretation of missense SNV results in terms of the SAV generated. A simple extension of this study would be including consideration of the effects associated with changes to amino acid residues adjacent to known modification sites. Many of the enzymes responsible for PTM reactions are quite sensitive to the sequence motifs in the neighborhood of the acceptor sites, so SAVs in the domains on either side of the site can inhibit the formation of a PTM as effectively as SAVs that effect the site itself. For example, many serine and threonine phosphorylation acceptor sites occur adjacent to a proline residue (-SP- or -TP-). Any SAV that changes the identify of those proline residues would result in the effective removal of the acceptor site. Similarly, many tyrosine phosphorylation acceptor sites have an adjacent asparatic acid residue (-DY-), so any SAV changing that residue may adversely affect the utility of that tyrosine for signaling purposes.



REFERENCES

(1) Gray, V. E.; Liu, L.; Nirankari, R.; Hornbeck, P. V.; Kumar, S. Signatures of Natural Selection on Mutations of Residues with Multiple Posttranslational Modifications. Mol. Biol. Evol. 2014, 31, 1641−1645. (2) Reimand, J.; Wagih, O.; Bader, G. D. Evolutionary constraint and disease associations of post-translational modification sites in human genomes. PLoS Genet. 2015, 11, e1004919. (3) Frankish, A.; Uszczynska, B.; Ritchie, G. R.; Gonzalez, J. M.; Pervouchine, D.; Petryszak, R.; Mudge, J. M.; Fonseca, N.; Brazma, A.; Guigo, R.; Harrow, J. Comparison of GENCODE and RefSeq gene annotation and the impact of reference geneset on variant effect prediction. BMC Genomics 2015, 16 (Suppl 8), S2. (4) UniProt Consortium. Update on activities at the Universal Protein Resource (UniProt) in 2013. Nucleic Acids Res. 2013, 41, D43−47. (5) Pruitt, K. D.; Tatusova, T.; Klimke, W.; Maglott, D. R. NCBI Reference Sequences: current status, policy and new initiatives. Nucleic Acids Res. 2009, 37, D32−36. (6) Flicek, P.; Amode, M. R.; Barrell, D.; Beal, K.; Billis, K.; Brent, S.; Carvalho-Silva, D.; Clapham, P.; Coates, G.; Fitzgerald, S.; Gil, L.; Girón, C. G.; Gordon, L.; Hourlier, T.; Hunt, S.; Johnson, N.; Juettemann, T.; Kähäri, A. K.; Keenan, S.; Kulesha, E.; Martin, F. J.; Maurel, T.; McLaren, W. M.; Murphy, D. N.; Nag, R.; Overduin, B.; Pignatelli, M.; Pritchard, B.; Pritchard, E.; Riat, H. S.; Ruffier, M.; Sheppard, D.; Taylor, K.; Thormann, A.; Trevanion, S. J.; Vullo, A.; Wilder, S. P.; Wilson, M.; Zadissa, A.; Aken, B. L.; Birney, E.; Cunningham, F.; Harrow, J.; Herrero, J.; Hubbard, T. J.; Kinsella, R.; Muffato, M.; Parker, A.; Spudich, G.; Yates, A.; Zerbino, D. R.; Searle, S. M. Ensembl 2014. Nucleic Acids Res. 2014, 42, D749−755. (7) Kersey, P. J.; Duarte, J.; Williams, A.; Karavidopoulou, Y.; Birney, E.; Apweiler, R. The International Protein Index: an integrated database for proteomics experiments. Proteomics 2004, 4, 1985−1988. (8) Craig, R.; Cortens, J. P.; Beavis, R. C. Open source system for analyzing, validating, and storing protein identification data. J. Proteome Res. 2004, 3, 1234−1242. (9) Fenyö, D.; Beavis, R. C. The GPMDB REST Interface. Bioinformatics 2015, 31, 2056−2058. (10) Perez-Riverol, Y.; Alpi, E.; Wang, R.; Hermjakob, H.; Vizcaíno, J. A. Making proteomics data accessible and reusable: Current state of proteomics databases and repositories. Proteomics 2015, 15, 930−949. (11) Craig, R.; Beavis, R. C. TANDEM: matching proteins with tandem mass spectra. Bioinformatics 2004, 20, 1466−1467. (12) Gnad, F.; Forner, F.; Zielinska, D. F.; Birney, E.; Gunawardena, J.; Mann, M. Evolutionary constraints of phosphorylation in eukaryotes, prokaryotes, and mitochondria. Mol. Cell. Proteomics 2010, 9, 2642−2653. (13) Gnad, F.; Ren, S.; Cox, J.; Olsen, J. V.; Macek, B.; Oroshi, M.; Mann, M. PHOSIDA (phosphorylation site database): management, structural and evolutionary investigation, and prediction of phosphosites. Genome Biol. 2007, 8, R250. (14) Udeshi, N. D.; Mani, D. R.; Eisenhaure, T.; Mertins, P.; Jaffe, J. D.; Clauser, K. R.; Hacohen, N.; Carr, S. A. Methods for quantification of in vivo changes in protein ubiquitination following proteasome and deubiquitinase inhibition. Mol. Cell. Proteomics 2012, 11, 148−159. (15) Mertins, P.; Qiao, J. W.; Patel, J.; Udeshi, N. D.; Clauser, K. R.; Mani, D. R.; Burgess, M. W.; Gillette, M. A.; Jaffe, J. D.; Carr, S. A. Integrated proteomic analysis of post-translational modifications by serial enrichment. Nat. Methods 2013, 10, 634−637. (16) Yates, A.; Beal, K.; Keenan, S.; McLaren, W.; Pignatelli, M.; Ritchie, G. R.; Ruffier, M.; Taylor, K.; Vullo, A.; Flicek, P. The



CONCLUSIONS This work demonstrated that it was possible to create a very large set of post-translational modification acceptor sites directly curated from raw experimental data and map those sites onto a genome assembly. With that mapping in place, it was then possible to design a simple RESTful-style interface to query the effect of SNVs on those mapped sites using a NoSQL-type platform. Hopefully the success of this approach will encourage other protein-centric information projects and repositories to provide this type of genome-to-proteome mapping in a transparent manner to increase the use of protein biochemistry information in the interpretation of biological effects of genome variation.



Article

AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]; phone: 1-204-290-9105. Notes

The authors declare no competing financial interest. 989

DOI: 10.1021/acs.jproteome.5b01018 J. Proteome Res. 2016, 15, 983−990

Article

Journal of Proteome Research Ensembl REST API: Ensembl Data for Any Language. Bioinformatics 2015, 31, 143−145. (17) Banker, K.; MongoDB in Action; Manning Publications Co: Greenwich, CT, 2011; pp 15−75. (18) Marzluff, W. F.; Gongidi, P.; Woods, K. R.; Jin, J.; Maltais, L. J. The human and mouse replication-dependent histone genes. Genomics 2002, 80, 487−498. (19) Gnad, F.; Baucom, A.; Mukhyala, K.; Manning, G.; Zhang, Z. Assessment of computational methods for predicting the effects of missense mutations in human cancers. BMC Genomics 2013, 14 (Suppl 3), S7. (20) Kumar, P.; Henikoff, S.; Ng, P. C. Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nat. Protoc. 2009, 4, 1073−1078. (21) Ng, P. C.; Henikoff, S. SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res. 2003, 31, 3812−3814. (22) Ramensky, V.; Bork, P.; Sunyaev, S. Human non-synonymous SNPs: server and survey. Nucleic Acids Res. 2002, 30, 3894−3900. (23) Adzhubei, I. A.; Schmidt, S.; Peshkin, L.; Ramensky, V. E.; Gerasimova, A.; Bork, P.; Kondrashov, A. S.; Sunyaev, S. R. A method and server for predicting damaging missense mutations. Nat. Methods 2010, 7, 248−249. (24) Bromberg, Y.; Rost, B. SNAP: predict effect of non-synonymous polymorphisms on function. Nucleic Acids Res. 2007, 35, 3823−35. (25) González-Pérez, A.; López-Bigas, N. Improving the assessment of the outcome of nonsynonymous SNVs with a consensus deleteriousness score, Condel. Am. J. Hum. Genet. 2011, 88, 440−449.

990

DOI: 10.1021/acs.jproteome.5b01018 J. Proteome Res. 2016, 15, 983−990