XMAn: A Homo sapiens Mutated-Peptide Database for the MS

Sep 11, 2014 - The two databases are FASTA-formatted to enable data retrieval by commonly used tandem MS search engines. While the largest number of ...
1 downloads 0 Views 3MB Size
Article pubs.acs.org/jpr

XMAn: A Homo sapiens Mutated-Peptide Database for the MS Analysis of Cancerous Cell States Xu Yang and Iulia M. Lazar* Department of Biological Sciences, Virginia Polytechnic Institute and State University, Integrated Life Sciences Building, Kraft Drive 1981, Blacksburg, Virginia 24061, United States S Supporting Information *

ABSTRACT: To enable the identification of mutated peptide sequences in complex biological samples, in this work, two novel cancer- and diseaserelated protein databases with mutation information collected from several public resources such as COSMIC, IARC P53, OMIM, and UniProtKB were developed. In-house developed Perl scripts were used to search and process the data and to translate each gene-level mutation into a mutated peptide sequence. The cancer and disease mutation databases comprise a total of 872 125 and 27 148 peptide entries from 25 642 and 2913 proteins, respectively. A description line for each entry provides the parent protein ID and name, the cDNA- and protein-level mutation site and type, the originating database, and the disease or cancer tissue type and corresponding hits. The two databases are FASTA-formatted to enable data retrieval by commonly used tandem MS search engines. While the largest number of mutations were encountered for the amino acids A/D/ E/G/L/P/R/S, the global mutation profiles replicate closely the outcome of the 1000 Genomes Project aimed at cataloguing natural mutations in the human population. The affected proteins were primarily involved in transcription regulation, splicing, protein synthesis/folding/binding, redox/energy production, adhesion/motility, and to some extent in DNA damage repair and signaling. The applicability of the database to identifying the presence of mutated peptides was investigated with MCF-7 breast cancer cell extracts. KEYWORDS: amino acid mutations, cancer, peptide, database



INTRODUCTION Cancer is a disease involving genomic changes, arising through multiple somatic or germline alterations.1 The analysis of mutations in cancer is important for advancing the understanding of the molecular mechanisms that lead to cancer development and for uncovering novel biomarkers and drug targets that facilitate disease detection and treatment. As shown in Table 1, cancer can arise as a result of various types of transformations, ranging from point mutations (missense, nonsense, silent) to small-scale (insertions, deletions) and large-scale alterations in the chromosomal structure, all affecting the protein sequence in different ways. Although a variety of high-throughput nucleic-acid-based approaches have been developed for the identification of gene-level mutations (e.g., cytogenetics methods, FISH, PCR, DNA sequencing, microarray), the complex processes that are involved in gene transcription/translation often lead to a lack of correlation between mRNA and protein expression levels. Because proteins play a central role in the regulation of the cell cycle, the response of a particular phenotype to environmental stimuli, and the development of a disease, mutation information generated directly from proteomic data will be invaluable to deriving functional inference and developing new strategies for drug discovery and diagnostics. © 2014 American Chemical Society

Mass spectrometry (MS) has evolved into a powerful tool for the analysis of complex protein extracts. The dominant workflow, shotgun proteomics, generates large collections of tandem mass spectra (MS/MS) that uniquely identify the amino acid sequence of a peptide. The experimental sequences are then matched to the theoretical sequence of the parent protein(s) from an organism-specific database.2 This commonly used strategy has been successfully implemented for large-scale peptide identifications but is only applicable to sequences that are present in the database. Protein polymorphisms, random amino acid mutations, splice-variants, and post-translational modifications may result in a large number of high-quality, but uninterpreted tandem mass spectra.3,4 To account for mutations, various algorithms that match mutated peptide sequences to mass spectra have been developed. Gatlin and colleagues developed a modified version of SEQUEST that enabled the identification of all possible SNPs in a DNA sequence database to generate singly mutant forms of hemoglobin.5 Roth and colleagues reported a top-down approach that used accurate mass and selected ion targets to detect amino acid variations in HeLA-S3 nuclear proteins by Received: May 5, 2014 Published: September 11, 2014 5486

dx.doi.org/10.1021/pr5004467 | J. Proteome Res. 2014, 13, 5486−5495

Journal of Proteome Research

Article

Table 1. Description of DNA Mutation Types and Their Impact on the Amino Acid Sequence in a Protein types of DNA mutations point mutation

small-scale chromosome rearrangement other large-scale

effect on protein sequence

missense nonsense silent deletion, insertion translocation, inversion, interstitial deletions amplification, deletion, loss of heterozygosity

substitution of a different amino acid premature stop codon usually resulting in nonfunctional truncated protein no effect (no change in amino acid sequence) frameshift, usually resulting in a completely different translation protein fusion usually with distinct function or nonfunction no effect

Table 2. Online Mutation Data Resourcesa database COSMIC UniProt IARC P53 OMIM HGMD TCGA ICGC

data type missense, nonsense, deletion/insertion, fusion missense missense, nonsense, deletion/insertion missense, nonsense, deletion/insertion somatic/germline disease mutations and polymorphisms broad range of cancerrelated data broad range of genomic abnormalities

no. extracted mutations

description

1 025 579

best-known publicly available database with cancer-related somatic mutations in cancer samples (http://cancer. sanger.ac.uk/cancergenome/projects/cosmic/; ftp://ftp.sanger.ac.uk/pub/CGP/)

22 354

curated protein sequence database for all species; provides lists of single amino acid variants in human proteins, including disease and tumor specific (http://www.uniprot.org/; http://www.uniprot.org/docs/humsavar) largest locus-specific database with all P53 Somatic and germline mutations in human populations, tumor samples and cell lines (http://www.iarc.fr/; http://p53.iarc.fr/DownloadDataset.aspx) database containing information about known Mendelian disorders (germline mutations); it focuses on the relationship between phenotype and genotype (http://www.omim.org/; http://omim.org/downloads) commercial database with known gene lesions responsible for human inherited diseases including cancer (http://www.hgmd.org/; http://www.hgmd.cf.ac.uk/ac/index.php)

3917 14 620

comprehensive database containing cancer-related clinical information, genomic characterization data, high-level sequence analysis of tumor genomes, large-scale genome sequencing, germline mutations (http:// cancergenome.nih.gov/; https://tcga-data.nci.nih.gov/tcga/) comprehensive catalogues of genomic abnormalities (somatic mutations, abnormal expression of genes, epigenetic modifications) in tumors from 50 different cancer types across the globe (https://www.icgc.org/; http://dcc.icgc.org/)

a

Number of extracted mutations from each database that was used in this study is provided in the table. TCGA and ICGC mutations are included in COSMIC. HGMD is a commercial database, and full access requires the purchase of a license.

including known SNPs, splice data, and PTMs in the protein database.6 Bunger and colleagues introduced nonsynonymous coding SNPs from dbSNP in a peptide database that was used for identifying mutated sequences in the DU4475 cell line.7 Edwards demonstrated the identification of novel peptides from ESTs, rather than whole proteome FASTA databases.8 Zhang and colleagues developed a human Cancer Proteome Variation Database (CanProVar) that included 10 254 peptide entries.9 At last, recently developed tandem MS data search engines enable the identification of mutated sequences by allowing amino acid substitutions in the search or by identifying mass shifts by direct comparisons of tandem mass spectra to a spectral library and calculating the precursor delta mass. These approaches demonstrate that database search algorithms can be pressed into service for the identification of mutated peptides. Currently, the NCBI short genetic variations (SNV) database, also known as the dbSNP database, compiles the largest number of SNPs, short indels, tandem repeats, and microsatellites.10 Disease- and cancer-related mutations are scattered throughout locus-specific databases (LSDBs) and several large comprehensive databases.11 A summary of the most relevant resources is provided in Table 2. The IARC (International Agency for Research on Cancer) TP53 database is the largest LSDB with all TP53 gene variations identified in human populations and tumor samples.12 The database compiles information from the peer-reviewed literature and generalist databases with somatic mutations in sporadic cancers, germline mutations in familial cancers, TP53 polymorphisms, mutations of TP53 in human cell lines, experimentally induced mutations, and mouse models with engineered TP53. It also provides detailed information about each mutation such as

tumor origin, sample source, mutation rate and potential function, and offers tools for searching and mining the data by multiple criteria. The COSMIC (Catalogue of Somatic Mutations in Cancer) database is the best-known publicly available resource for cancer-related somatic mutation information.13 The data are extracted from the primary literature and the Cancer Genome Project and curated manually to ensure high quality. COSMIC provides detailed information about each mutation, including frequency, tumor/ tissue histology, and patient related data. In collaboration with IARC, COSMIC has brought the somatic P53 mutations into the database. The data can be easily searched by tissue or gene and extracted in a graph, table, or various other formats. UniProtKB/SwissProt is a curated protein sequence database for all species, which provides a high level of annotation incorporating a description of the protein function, domain structures, and post-translational modifications, among others.14 It also provides a list of single amino acid variants in human proteins, including mutations related to cancer or other diseases, polymorphisms, and some unclassified variants. The OMIM (Online Medelian Inheritance in Man) database is an online catalog of human genes and genetic disorders, which focuses on the relationship between phenotype and genotype. It provides a list of human genes and information related to clinical features, inheritance, population genetics, mapping, molecular genetics, and Mendelian disease-associated mutations.15 The HGMD (Human Gene Mutation Database) is a commercial mutation database incorporating human somatic and germline mutations associated with inherited diseases as well as disease-associated polymorphisms compiled from scientific publications and LSDBs.16 Recently, the International 5487

dx.doi.org/10.1021/pr5004467 | J. Proteome Res. 2014, 13, 5486−5495

Journal of Proteome Research

Article

gioma, neurocytoma, neurilemoma, leiomyoma, xanthoastrocytoma, cystadenoma, pheochromocytoma, keratoacanthoma, germinoma, histiocytoma, seminoma, myeloma, hemangiopericytoma, mesothelioma, or keloid. Mutations with other disease names were considered as disease-related. In the second stage, mutated peptide entries for the FASTA database were generated by pursuing the following strategy. In the case of missense mutations, the position of mutation and type of original amino acid was verified in the full protein sequence downloaded from COSMIC or UniProt. If correspondence was found, the original amino acid was substituted with the mutated amino acid to generate the mutated sequence. In the case of nonsense mutations, a similar procedure was used, except that the stop signal denoted by “*” at the mutation site was used to delete all following amino acids in the protein sequence. For frameshift insertion/deletion/ complex cases (the complex having both insertion and deletion in the same mutation), the mutated amino acid information in the original database included usually only the start site of the frameshift and the site of the stop signal after the frameshift. In this case, after verifying the position of the frameshift start site in both protein and cDNA sequences, the insertion/deletion/ complex change was performed at the cDNA level, and the new cDNA sequence was translated into a protein sequence. The position of the stop signal in the new protein sequence was then verified and used to delete all following amino acids. If the frameshift did not encounter a cDNA stop codon, then the sequence was terminated at the end of the cDNA sequence. For COSMIC inframe mutations, the new sequence was generated by performing the corresponding deletion/insertion/complex change directly at the protein-level. For inframe insertion/ deletion/complex mutations in the IARC or OMIM databases, a similar method was used as for frameshifts. Once a mutated protein sequence was created by this procedure, a peptide that carried the mutation site was generated by performing an in silico digestion of the protein, allowing for both N- and Cterminals to be tryptic and for two additional “K” or “R” missed cleavages at the left and right of the mutation site. For frameshift deletion/insertion/complex mutations, the mutated peptide sequence included all amino acids after the frameshift site until the stop signal. The use of trypsin specificity for defining the peptide boundaries was preferred because the large majority of proteomic experiments involve the use of trypsin for proteolytic digestion. However, ∼85−90% of the sequences contain K, D, or E, enabling database searches for experiments in which protein digestion was conducted with other common proteases such as LysC, Asp-N, or Glu-C. Proteolytic digestion with one or more of these proteases could be used to confirm the presence of a mutation. In the third stage, the proteins originating from various databases with different identifiers were assigned a UniProt ID. To match the COSMIC IDs to the UniProt IDs, the protein sequences provided by COSMIC were compared with each entry in the UniProt human protein database (July 4, 2014 release, comprising UniProt/SwissProt reference/reviewed canonical and isoform sequences and UniProt/TrEMBL nonreviewed sequences). If a UniProt entry matched exactly the COSMIC sequence, then the COSMIC ID was replaced by the UniProt/SwissProt or TrEMBL ID. If no match was found, then the COSMIC ID was used as an identifier for its corresponding protein sequence in the mutated FASTA database. Mutated entries originating from UniProt or IARC P53 already had a SwissProt ID, so no change was made.

Cancer Genome Consortium (ICGC) has launched an effort to coordinate several research projects across the globe, with the goal of generating a comprehensive catalogue of genomic, transcriptomic, and epigenetic abnormalities in tumors from 50 different cancer types.17 TCGA (Cancer Genome Atlas) is one of the projects, launched by the United States National Cancer Institute (NCI) and the National Human Genome Research Institute (NHGRI).18 Available information from ICGC and TCGA is imported into COSMIC. While the data incorporated in these databases varies, all together, the compiled information provides a rich source for disease-related research studies. To further enable access to these valuable data through MS-driven identification of gene mutations at the protein level, in this work, we created two FASTA databases that comprise cancer- and disease-related mutated peptides. The newly developed databases include as many as 1 030 903 unique cancer-related and 27 172 diseaserelated mutations. The two databases include not only single amino acid mutations but also small-scale deletions/insertions. It is anticipated that the results of this study will support research aimed at the elucidation of molecular mechanisms that drive cellular malfunction, the identification of mutated protein sequences that alter protein−protein interaction networks and biochemical signaling cascades that lead to diseased cell states, the identification of mutated protein markers and drug targets that characterize particular tumor types, and the development of molecular diagnostic assays.



METHODS The two mutated FASTA databases were developed by following a four-stage procedure. In the first stage, mutation information was acquired from online resources according to the following steps. (a) The sites and types of gene or proteinlevel mutations were downloaded from the COSMIC, IARC P53, OMIM, and the UniProt databases (Table 2). Except UniProt, which contained only protein-level information, the other three databases contained both gene- and protein-level mutations. In-house developed PERL scripts were used to extract and organize the data. Missense and nonsense mutations were extracted at the protein level, while deletions/insertions at the gene level (i.e., cDNA). The great majority of somatic mutations originated from COSMIC (v.69) and of germline mutations from OMIM. Chromosome rearrangements and large-scale insertions/deletions, which can often result in proteins with distinct function or nonfunction, were ignored because accurate information regarding these changes could not be extracted from the public databases. Additional P53 mutations were extracted from IARC P53 (R17) that contains not just somatic but also germline and cell-line mutations. (b) The complete protein and cDNA sequences that encompassed these mutations were downloaded from COSMIC. Protein sequences associated with the UniProt mutations were downloaded from UniProt. (c) Some protein IDs in OMIM were aliases and were changed to standard gene names by searching the GeneCards database. (d) For the UniProt and OMIM data that contained not just cancer but also other disease-related mutations, the alterations were considered to be cancer-related if they could be identified by the following keywords: carcinoma, cancer, leukemia, tumor, sarcoma, blastoma, lymphoma, epithelioma, glioma, melanoma, oligodendroglioma, ependymoma, astrocytoma, mesonephroma, prolactinoma, papilloma, hemangioma, neoplasia, neoplasm, insulinoma, adenoma, chordoma, cystadenoma, menin5488

dx.doi.org/10.1021/pr5004467 | J. Proteome Res. 2014, 13, 5486−5495

Journal of Proteome Research

Article

OMIM gene IDs were matched to gene names in the SwissProt database and changed to the corresponding protein ID. In the fourth stage, redundant mutations were removed. Replicate hits from within the same database were deleted first. Entries with the same protein ID, same mutated cDNA (or same mutated amino acid if cDNA info was not available), and the same mutated peptide sequence were grouped into one single entry. The number of hits for each entry were summed and provided in a descriptive head-line. For IARC, the three mutated peptide databases (somatic, germline, and cell-line) were initially combined for replicate deletion. At last, all four databases were merged, and replicate mutated peptide entries were grouped into one to generate a nonredundant database. For each unique entry, the original database and tissue source were listed in the description line. If different types of mutations resulted in the same mutated peptide sequence, each type of mutation was described in the head-line. Ultimately, entries containing alphabet letters other than the ones corresponding to the 20 amino acids (e.g., “X” or “*”) were deleted. Entries with sequence length less than five amino acids were removed as well, as such short sequences can cause a dramatic increase in the false discovery rate during database searching of raw MS files.



Figure 1. Venn diagram illustrating the contribution of the four databases to the count of unique peptide entries in the cancer database.

OMIM includes mainly Mendelian genetic disorders (i.e., germline mutations), UniProt provides only missense point mutations, and IARC P53 provides only mutations in protein P53. Only six mutations were found to be common to all four databases. (b) The disease mutation database (Supplemental Table 2 in the Supporting Information) comprises a total of 27 148 entries representing 27 172 unique mutations from 2913 proteins, compiled from missense point mutations in the OMIM and the UniProt databases. For each mutated peptide entry, a descriptive head-line was created. The description line starts with the symbol “>”and provides the following information, from left to right, and delimited by “|”: (1) a tag that includes the description “CANCER” or “DISEASE” to mark the entry as being canceror disease-related, the abbreviations “sp”/“tr” or “co” to indicate the database source that supplied the protein ID (SwissProt, TrEMbL, or COSMIC), and the protein ID and description; (2) the site and type of mutated cDNA; (3) the site and type of amino acid mutation; (4) a short sequence that includes the mutated amino acid encompassed by three amino acids at the left and right of the mutation site (the mutated amino acid is missing in case of deletions; for frameshifts, all amino acids at the right of the mutated site were included until a stop codon was found); (5) the type of mutation (missense, nonsense, deletion−frameshift, insertion−frameshift, complex−frameshift, deletion−inframe, insertion−inframe, or complex−inframe); (6) the originating database(s); and (7) the tissue source and number of hits that generated the mutated entry (provided in parentheses). For the case of different mutations that generated the same mutated peptide sequence (e.g, mutations occurring in the same area of two proteins from the same family, or mutations in protein isoforms/variants/fragments), additional description lines, separated by “OR,” were added. The head-line was then followed by the mutated peptide sequence. A few relevant examples that illustrate the processing of the mutation information are provided in Tables 3 and 4. To facilitate manual verification of the data, a mutation list for each subdatabase, categorized by the parent proteins, is provided in Excel format (Supplemental Tables 3 and 4 in the Supporting Information). The lists provide detail related to the site of the mutated cDNA and corresponding amino acid, the short sequence, the mutation type, the originating database, and the tissue source and hits. Separate entries account for cDNA mutations that lead to the same amino acid mutation or for redundant mutations pertaining to protein isoforms. Yearly updates to the database will be provided through follow-up publications or will be made available for download from the

RESULTS AND DISCUSSION

Database Content

The mutated FASTA database that was created in this work was named XMAn, derived from the need of assembling a database that will enable unknown (X) peptide-level Mutation Analysis. It comprises unique peptide entries representing two subdata sets (Supplemental Tables 1 and 2 in the Supporting Information). (a) The cancer database (Supplemental Table 1 in the Supporting Information) comprises a total of 872 125 mutated entries representing 1 030 903 unique mutations from 25 642 proteins, compiled from COSMIC, IARC P53, OMIM, and UniProtKB. We note that different original sequences may lead to the same mutated sequence and that one original sequence may mutate into various different sequences; therefore, the number of total mutations that are represented in the database is larger than the number of unique peptide entries. We also note that while these mutations were detected in cancer tissues they do not necessarily represent cancerspecific mutations. To place things in perspective, while the rate of mutations varies between species and cell types (prokaryotic/eukaryotic, uni/multicellular organisms, germline/somatic), the average rate of germline and transient somatic mutations in humans is estimated to be ∼0.06 × 10−9 and ∼1 × 10−9/nucleotide site/cell division, respectively.19 For a diploid genome size of 6 × 109, an average of ∼1013 cells per body, and 100−600 cell divisions per generation, a middle-aged human may accumulate as many as >1016 mutations, of which ∼1% are coding. Additional factors such as transcriptional/translational fidelity and posttranslational surveillance will further contribute to determining the mutation counts in a protein. The Venn diagram provided in Figure 1 illustrates the contribution of each of the four databases to the count of unique peptide entries in the cancer database. The COSMIC database was the most inclusive resource, contributing with 870 307 mutations out of the total of 872 125. UniProt contributed with 4127, IARC P53 with 2677, and the OMIM database with only 684 mutations, respectively. None of the databases was comprehensive. COSMIC comprises only cancer-related somatic mutations, 5489

dx.doi.org/10.1021/pr5004467 | J. Proteome Res. 2014, 13, 5486−5495

Journal of Proteome Research

Article

Table 3. Examples of Relevant Mutations from the Cancer FASTA Database Grouped into Missense, Nonsense, DeletionFrameshift, Insertion-Frameshift, Complex-Frameshift, Deletion-Inframe, Insertion-Inframe or Complex-in Frame (duplications were grouped into insertions)a cDNA mutation

a

AA mutation

explanation

Missense Nonsense

c.1522C>T c.577C>T

p.R508C p.R193*

Del-FS

c.1312_1313delCA

p.H438 fs*>16

Del-FS

c.726del1

p.?

Ins-FS

c.1128_1129insC

p.W377 fs*>35

Com-FS

c.495_501GAAAAAA>GAAAAAAA

p.Q165 fs

Del-inframe

c.840_842delTGA

p.D282delD

Del-inframe

c.?

p.N566_D572del

Ins-inframe

c.1047_1048insGGC

p.G354_S355insG

Com-inframe

c.1895_1918>TGCGGC

p.E632_A640>VRP

SNP in the R codon; missense mutation changes R at position 508 into C R codon mutated to a termination codon; the peptide sequence stops with the amino acid in position 192 (prior to R193) deletion of CA at positions 1312−1313; the resulting frameshift replaces H at position 438 and the subsequent sequence with 16 new amino acids, until a new stop codon is encountered deletion of one nucleotide at position 726; the amino acid information is inferred from the mutated cDNA sequence insertion of C between nucleotides 1128 and 1129; the resulting frameshift replaces W at position 377 and the subsequent sequence with 35 new amino acids, until a new stop codon is encountered deletion/insertion of nucleotides; the resulting complex frameshift replaces Q at position 165 and the following amino acids with a sequence of new amino acids deletion of TGA at positions 840−842; the resulting inframe shift deletes D at position 282; the rest of the sequence is not affected cDNA level information not known; the amino acid sequence N566-to-D572 deleted; the rest of the sequence is not affected insertion of GGC between nucleotides 1047 and 1048; the resulting inframe shift inserts G between G at position 354 and S at position 355; the rest of the sequence is not affected deletion/insertion of nucleotides; the resulting complex inframe shift results in the replacement of the amino acid sequence E632-to-A640 with VRP

Mutated DNA and amino acid-level annotations follow the COSMIC style.

From a biological perspective, the most substantial impact on protein structure, and, therefore, function, will be governed by interchanges between: (a) acidic (D, E) and basic (H, K, R) amino acids, (b) polar uncharged (C, M, N, Q, S, T)/charged (D, E, H, K, R) and nonpolar (A, G, I, L, P, V)/aromatic (F, Y, W), and (c) gains/losses in C with important role in stabilizing the protein structure through the formation of disulfide bonds. In addition, gains/losses in amino acids with essential role in signal transduction (e.g., S/T/Y with roles in phosphosignaling processes) will impact substantially the cell fate. An earlier analysis of the “1000 Genome Project” data,20−22 which included only nonsynonymous SNPs (missense mutations) filtered to contain 106 311 natural amino acid variants on 19 058 human proteins that occurred in a single population, revealed that amino acids with a higher frequency of occurrence in the human proteome, that is, with a larger number of codons, mutated in larger numbers.22 R and L were outliers, with R displaying a much larger number of mutations and L a somewhat lower number than expected, respectively. While several factors may affect the mutability of an amino acid, this R/L anomaly was explained through the presence/absence of the CpG dinucleotide sequence in their codons, which is known to mutate at much higher rates than other dinucleotides (CpG being present in four-out-of-the-six R codons and absent in all six L codons).23 When the number of mutations in the present cancer database was normalized to the amino acid frequency in the human proteome, the same trend was observed: R mutated by far the most, L the least, and the rate of mutation of other amino acids was roughly the same, except for A/D/E and G, which displayed slightly higher rates than the average (Figure 3A). The mutation gain profile revealed that the gains were not affected by the amino acid frequency (Figures 2B and 3B). A direct comparison of the “from/to” mutation ratios between the cancer database and an amino acid exchange matrix generated from the analysis of the 1000 Genome Project21 (Figures 5 and 6) confirmed that the major trends in amino acid changes were very similar,

author’s home page via the Biological Sciences Department Web site at Virginia Tech. Database Content Analysis

The cancer database, with the largest number of entries, was subjected to further analysis. An evaluation of the different types of mutations revealed that missense mutations represented the largest proportion in the data set (88.3%), while nonsense and insertions/deletions represented only 6.8 and 4.9%, respectively. The proportion of insertions/deletions in the original databases was, however, larger, but some of these mutations could not be included in the FASTA databases due to the lack of complete information (e.g., there was no amino acid or nucleotide sequence available for insertions in the IARC database). Silent point mutations with no impact on the protein sequence, protein fusions from chromosome rearrangements which usually generate the same sequence as the original proteins, large-scale amplification/deletions which typically change only the copy number of the gene products and have no effect on the amino acid sequence, and intronic mutations which create alternative-spliced proteins with often hard-topredict sequences, were ignored. The counts of missense mutations “from” and “to” each amino acid residue, raw and normalized to the frequency of amino acids in the human proteome, was captured in stacked column charts (Figures 2 and 3). To avoid distorting biological relevance, all mutations, including those that generated peptide sequences with fewer than five amino acids were included. Supplemental Table 5 in the Supporting Information provides the numerical values for these changes. The amino acids that experienced the largest number of mutation losses and gains included A/D/E/G/L/P/R/S (Figure 2A) and C/H/I/K/L/ N/Q/S/T/V (Figure 2B), respectively. A 3D-chart illustrates the counts for each unique mutation (Figure 4), some of the most notable alterations being observed for E → K, A → T, A → V, R → C, R → H, and R → Q. The largest number of unique missense mutations was observed for E → K (33333), and the amino acid that underwent most mutations was R. 5490

dx.doi.org/10.1021/pr5004467 | J. Proteome Res. 2014, 13, 5486−5495

Article

preserving the same amino acid losers (“from/to” ratio >1) and winners (“from/to” ratio C|p.D186H|SDSHGLA| Missense|COSMIC&IARC(Somatic)|Breast(2);Ovary(2)&Breast(2);Ovary(2) OR sp|P04637,P53_HUMAN Cellular tumor antigen p53|c.?|p.D186H|SDSHGLA|Missense|Swissprot|Sporadic cancers(1) RCPHHERCSDSHGLAPPQHLIRVEGNLRVEYLDDR >CANCER_sp_P01112,RASH_HUMAN GTPase HRas|c.30_31insGGC|p.G10_A11insG|VVGGAGG|Insertion - in frame|COSMIC|Upper aerodigestive tract(1) OR sp|P01116−2,RASK_HUMAN Isoform 2B of GTPase KRas|c.30_31insGGA|p.G10_A11insG|VVGGAGG|Insertion - in frame|COSMIC|Haematopoietic and lymphoid tissue(2);Large intestine(3) OR sp|P01116−2,RASK_HUMAN Isoform 2B of GTPase KRas| c.31_32insGAG|p.G10_A11insG|VVGGAGG|Insertion - in frame|COSMIC|Haematopoietic and lymphoid tissue(1) OR sp|P01116,RASK_HUMAN GTPase KRas|c.31_32insGAG|p.G10_A11insG|VVGGAGG| Insertion - in frame|COSMIC|Haematopoietic and lymphoid tissue(1) MTEYKLVVVGGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRK >CANCER_sp_P04637,P53_HUMAN Cellular tumor antigen p53|c.989delT|p.Q331 fs*14|FTLRSVGVSASRCSES|Deletion - frameshift|COSMIC&IARC(Somatic)|Ovary(1)&Ovary(1) OR sp|P04637,P53_HUMAN Cellular tumor antigen p53|c.990del1|p.?|YFTLRSVGVSASRCSES|Deletion - frameshift|IARC(Somatic)&IARC(Cellline)|Lung(1)&Lung(1) OR sp|P04637,P53_HUMAN Cellular tumor antigen p53|c.991del1|p.?| FTLRSVGVSASRCSES|Deletion - frameshift|IARC(Somatic)|Other female gen. org.(1) KKPLDGEYFTLRSVGVSASRCSES >CANCER_sp_P10721,KIT_HUMAN Mast/stem cell growth factor receptor Kit|c.?|p.N566_D572del|INGPTQ|Deletion - in frame|COSMIC|Skin(1) YLQKPMYEVQWKVVEEINGPTQLPYDHKWEFPRNR >CANCER_sp_P59646,FXYD4_HUMAN FXYD domain-containing ion transport regulator 4|c.206_207AG>CT|p.Q69>?|SQKPHSP|Missense|COSMIC|Large intestine(1) CKSSQKPHSPVPEKAIPLITPGSATTC

Table 4. Relevant Examples from the FASTA Mutated Cancer Database

Journal of Proteome Research

Query for Mutated Peptides in MCF-7 Cells

The FASTA-formatted mutation databases can be accessed by any search engine such as SEQUEST or MASCOT. To test their utility, a set of 118 raw LC-MS files generated from various MCF-7 (ER+) breast cancer cell cultures with an LTQ−MS instrument were searched against a UniProt human protein database appended with the two mutation databases.25−26 The search result files were filtered with various SEQUEST parameters to reduce the peptide-level FDRs below 1%. Because of the presence of two missed K/R cleavages on both sides of the mutated amino acid site, the matching peptides in the output list contained a large number of entries that did not encompass, in fact, any mutation site. To eliminate such entries, a Perl script was developed to select the peptide sequences from the search engine output list that incorporated the short mutated sequence provided in the description line of each match. Such Perl scripts can be developed for mining 5491

dx.doi.org/10.1021/pr5004467 | J. Proteome Res. 2014, 13, 5486−5495

Journal of Proteome Research

Article

Figure 2. Relative frequency of cancer missense mutation counts: (A) From each AA residue. (B) To each AA residue.

Figure 3. Relative frequency of cancer missense mutation counts normalized to the frequency of specific amino acids in the human proteome: (A) From each AA residue. (B) To each AA residue.

Figure 4. Relative frequency of cancer missense mutations (3D-column chart). Amino acid mutations with the largest frequency are indicated in the Figure.

5492

dx.doi.org/10.1021/pr5004467 | J. Proteome Res. 2014, 13, 5486−5495

Journal of Proteome Research

Article

Figure 5. Total missense mutation counts “from” and “to” each amino acid in the cancer database. The upper, positive counts indicate the total mutation counts from each amino acid into another (the losses). The lower, negative counts indicate the total mutation counts into a particular amino acid (the gains).

Figure 7. Protein−protein STRING interaction diagram of the top 250 proteins carrying the largest number of mutations in the cancer database. The proteins cluster into the following major categories: [1] cell development, differentiation, proliferation, adhesion, response to growth factors, protein autophosphorylation and signaling; [2] DNA damage sensing and repair; [3] locomotion, cell movement, muscle contraction; [4] locomotion, cell motility; and [5] cell adhesion.

proteins present in the fetal bovine serum (FBS) used in the cell culture medium, which had very similar or identical stretches of amino acid sequences with their mutated human counterparts, made it impossible to distinguish whether some of the mutated peptides were of human or bovine origin (this was revealed by a separate search against a combined human/ bovine database appended with the cancer-mutated peptides). The use of high-resolution/mass-accuracy instruments for data acquisition is, therefore, highly recommended to enable unambiguous interpretation of such results. Moreover, because many cell cultures are performed in the presence of FBS, the level of contamination with bovine serum proteins should be verified. Despite such challenges, the mutated database enabled the confident identification of some insertions and deletions. An example is provided in Figure 8, highlighting the insertion of the AAA sequence between A159 and K160 in a peptide originating from RL14_HUMAN 60S ribosomal protein (P50914): the original sequence GT149AAAA153AAAAAA159K160 converted into GTAAAAAAAAAA159AAAK160. The corresponding bovine sequence KGAAAVAAAAAAKV (Q3T0U2) was manually verified, the lack of T at position 149 and the change from A to V at position 153 confirming that the mutated sequence was of human origin.

Figure 6. Comparison of mutation ratios “from each amino acid/to each amino acid” in the cancer database and the “1000 Genome Project.” Ratios that are >1 represent amino acid gains, while ratios that are