The Hybrid Search: A Mass Spectral Library Search Method for

Apr 3, 2017 - We present a mass spectral library-based method to identify tandem mass spectra of peptides that contain unanticipated modifications and...
1 downloads 0 Views 3MB Size
Article pubs.acs.org/jpr

The Hybrid Search: A Mass Spectral Library Search Method for Discovery of Modifications in Proteomics Meghan C. Burke,*,† Yuri A. Mirokhin,† Dmitrii V. Tchekhovskoi,† Sanford P. Markey,† Jenny Heidbrink Thompson,‡ Christopher Larkin,‡ and Stephen E. Stein† †

Mass Spectrometry Data Center, National Institute of Standards and Technology, 100 Bureau Drive, Gaithersburg, Maryland 20899, United States ‡ Analytical Sciences, MedImmune LLC, One MedImmune Way, Gaithersburg, Maryland 20878, United States S Supporting Information *

ABSTRACT: We present a mass spectral library-based method to identify tandem mass spectra of peptides that contain unanticipated modifications and amino acid variants. We describe this as a “hybrid” method because it combines matching both ion m/z and mass losses. The mass loss is the difference between the mass of an ion peak and the mass of its precursor. This difference, termed DeltaMass, is used to shift the product ions in the library spectrum that contain the modification, thereby allowing library product ions that contain the unexpected modification to match the query spectrum. Clustered unidentified spectra from the Clinical Proteomic Tumor Analysis Consortium (CPTAC) and Chinese hamster ovary cells were used to evaluate this method. The results demonstrate the ability of the hybrid method to identify unanticipated modifications, insertions, and deletions, which may include those due to an incomplete protein sequence database or to search settings that exclude the correct identification, in high-resolution tandem mass spectra without regard to their precursor mass. This has been made possible by indexing of the m/z value of each fragment ion and its difference in mass from its precursor ion. KEYWORDS: peptide mass spectral library, post-translational modifications, spectral library searching



sequencing;11−15 the use of mass tags,16,17 in which both de novo sequencing and a sequence database routine are utilized; Mascot error-tolerant search;18 and an open or “blind” modification search, in which a wide precursor tolerance is allowed to enable the identification of unanticipated modifications in a sequence database search.19,20 These methods are associated with a decrease in search speed and are more error-prone because of the increase in search space.8 The high-mass-accuracy instruments that are currently available have leveraged spectral libraries as a widely applicable tool for the analysis of tandem mass spectrometry data. Two key advantages of spectral libraries are the reduced search space and peak-intensity-based scoring. Because spectral libraries are composed of only experimentally observed reference spectra, the search space is reduced compared with that of a sequence database search, which can result in an improved search speed.21,22 Second, the similarity-based scoring considers all observed product ions, assigned and unassigned, as well as peak intensities, meaning that no assumptions are made about how a peptide will fragment.6 When taken together, these advantages allow spectral libraries to use what has previously been

INTRODUCTION Proteomics is most commonly performed using a bottom-up approach, in which proteins are enzymatically digested, typically using trypsin, and the resulting peptides are analyzed via liquid chromatography−tandem mass spectrometry (LC−MS/MS). The fragmentation pattern acquired in the MS/MS spectrum of a given peptide ion is used to derive peptide sequence information.1,2 Several methods are available to assign peptide identifications to MS/MS spectra in a high-throughput manner using, for example, a sequence database approach,3−5 in which an in silico digestion is performed on the protein sequences provided in the database, or a spectral library approach,6 which matches spectra in a collection of confidently identified MS/ MS spectra that have been previously observed. Recent attention has been focused on spectra that are not confidently assigned to a peptide sequence and therefore remain unassigned, with a recent report demonstrating that in some experiments as many as 75% of tandem mass spectra acquired in a proteomics experiment can remain unidentified.7 High-quality spectra may remain unassigned for many reasons, such as an incorrect charge state assignment, an incomplete protein database, or database search parameters that exclude the correct peptide identification (e.g., variable modifications and tryptic termini).8−10 Current methods to address the latter two causes of unidentified high-quality spectra include de novo © XXXX American Chemical Society

Received: November 15, 2016 Published: April 3, 2017 A

DOI: 10.1021/acs.jproteome.6b00988 J. Proteome Res. XXXX, XXX, XXX−XXX

Article

Journal of Proteome Research

Figure 1. Illustration of the difference in match factor obtained (A) from a partial alignment, in which peaks are not shifted by the normalized DeltaMass, and (B) when product ions that match after shifting by the normalized DeltaMass (pink) are included in the match factor for the peptide identification M(iTRAQ)VVPVAALFTPLK(iTRAQ) (charge = 2, DeltaMass = −101.0974, match factor difference = 372), where DeltaMass corresponds to carbamylation (M) and loss of iTRAQ 4-plex.

and a library spectrum for those product ions within the specified product ion tolerance.31 Spectra for which the number of matching product ions is determined to be statistically significant are then subjected to matching that allows for a modification mass equal to the difference in precursor mass.31 The partial alignment performed in both the QuickMod and SpectraST tierwise scoring search algorithms may miss identifications for which the DeltaMass is located at either terminus of the peptide sequence, causing a shift in the entire series of b or y ions, such as the example shown in Figure 1. Unlike the previously described methods, pMatch generates a spectral library of “optimized” consensus spectra that incorporates theoretical peaks into the library spectrum at equal intensity.27 pMatch allows the difference in precursor mass between the query and library precursor to be treated as a modification in the open search. The modification is then localized by comparing the query spectrum to a series of theoretical spectra containing the modification at each possible position in the peptide sequence.27 The hybrid mass spectral library search method expands on currently available methods that aim to use spectral library searching for the identification of unanticipated modifications in at least three significant ways. First, the hybrid mass spectral library search method does not require a predefined mass tolerance to determine which library spectra are compared with the query spectrum because of indexing of peaks and their losses.23 In all three spectral library search methods described above, the user defines the maximum difference in precursor mass between the query and library precursor, which is usually ±200−300 Da. Second, allowing product ions present in the query spectrum to match either the unshifted or shifted library product ion rather than a theoretical spectrum preserves the advantages of spectral library searching compared with sequence database searching. Third, this is the first demonstration of the identification of unanticipated modifications in high-resolution tandem mass spectra using a mass spectral

observed to determine what can be learned from an unidentified spectrum. A challenge associated with the spectral library method has been that in order to identify the correct peptide sequence, the corresponding spectrum must be in the library; however, a newly enhanced “hybrid” spectral library method, described here, can circumvent this challenge if a similar reference spectrum is already in the library. In the hybrid method, a combination of ion m/z and neutral mass loss searching,23 formerly implemented only for low-resolution electron ionization spectra, has been added to the publicly available NIST MS Search24 (single-spectrum searching and viewing) and MSPepSearch25 (batch file searching; text output) programs. It relies on the simple concept that if a fixed modification is present on any precursor ion, it will shift both the precursor mass and the masses of all product ions to which it is attached by the mass of that modification. Furthermore, if the fragmentation pathways are not greatly affected by the modification, the major peaks in the modified and unmodified (library) spectra will match by allowing for this mass shifting, which has been previously reported for product ion spectra analyzed in the ion trap.26,27 This idea, in fact, has also recently been shown to enable interconversion of isobaric tags for relative and absolute quantitation (iTRAQ)28 and tandem mass tag (TMT)29 labeled spectra.30 Moreover, the product ions that match after shifting can provide position information on the modification. Current methods available for the identification of unanticipated modifications using spectral libraries include QuickMod,26 SpectraST open modification or tierwise scoring search,31 and pMatch.27 QuickMod searches a query spectrum against all library spectra whose difference in precursor mass relative to the query spectrum is within the user-defined tolerance and subsequently determines the best position for the modification.26 SpectraST open modification or tierwise scoring search first considers the similarity between a query spectrum B

DOI: 10.1021/acs.jproteome.6b00988 J. Proteome Res. XXXX, XXX, XXX−XXX

Article

Journal of Proteome Research

each sequence ion in the library must be annotated, which is the case for the NIST peptide libraries. Also, peaks and their losses are indexed for rapid searching, as done in the earlier hybrid search method.23 Such product ion indexing is essential for rapid searching since library spectra must be found by their product ion peak m/z values and neutral losses. The query spectrum product ion charges and library spectrum precursor mass are not known at the time of using the index. When both an original library product ion peak and a library product ion peak shifted by DeltaMass/z are matched to two query peaks, the original library peak intensity is divided among the original and shifted peaks according to the relative intensities of the two query peaks. The match factor reported by the search algorithm reflects the degree of spectral similarity and is based on using a modified cosine of the angle between the query and library spectra.35 For example, an identical match would result in a match factor of 999.35 The algorithms for the hybrid and direct identity spectral matches are identical except that when precursor masses differ in the library and search spectra, mass shifting by that difference in mass is also allowed to constitute a peak match. Consequently, when DeltaMass is zero, scores are identical for the hybrid and direct search identifications. To the degree that shifting peaks increases the score, the hybrid scores will be greater than the direct score, such as the identification shown in Figure 1. In this example, the match factor increased from 494 (without including shifted product ions) to 866 with the hybrid mass spectral library search method. In this work, recurrent unidentified consensus spectra from CPTAC and CHO tryptic digests were searched using this new hybrid search function. Precursor peaks were ignored with a window of m/z 1.6 in order to prevent precursor ions from contributing to the similarity score. A product ion tolerance of 40 ppm was allowed with output of the top hit only using MSPepSearch.25 Here, a match factor of at least 600 or greater was required. The DeltaMass values (mass differences between query and library precursors), which are included in the output, were then used to propose possible modifications based on unique monoisotopic masses available in Unimod (http://www. unimod.org)36 with a mass tolerance of 30 ppm. A modification whose monoisotopic mass was isobaric with multiple modifications was assigned a chemical formula followed by the possible modifications. For example, methylation, which may correspond to multiple different modifications, would be assigned the following possible modification: “H(2)C; Methyl/ Asp → Glu/Gly → Ala/Ser → Thr/Val → Xle/Asn → Gln”. Next, spectra containing at least 10 product ions that were assigned to proposed modifications of interest were manually evaluated to determine whether the observed DeltaMass could be unambiguously assigned to a single amino acid. Since there is general concern that automated methods risk misidentifying SNPs and other biological modifications as accidental matches to chemical modifications made in the digestion process, manual inspection was done to check the reliability of our methods. This involved the assignment of all major product ions present in the query spectrum in addition to the presence of flanking product ions of significant intensity on either side of the modified residue in order to localize the modification to a single position. For identifications where matching shifted product ions indicated conflicting localization of the DeltaMass, the product ion annotations were first inspected to determine whether the conflict could be resolved by overlapping product

library search method. The product ion tolerance for the hybrid mass spectral library method can be expressed in units of parts per million or m/z, while the current methods available use product ion tolerances (or bin size) in units of either m/z or thomsons. Collectively, these advantages enable the hybrid mass spectral library search to identify unexpected modifications without regard to the magnitude of the mass shift including single-nucleotide polymorphisms (SNPs) and insertions or deletions of amino acids. Here, recurrent unidentified spectra from the Clinical Proteomic Tumor Analysis Consortium (CPTAC)32 and Chinese hamster ovary (CHO) cells have been used to evaluate the ability of the hybrid mass spectral library search method to identify unexpected modifications in high-quality unidentified spectra.



METHODS

Recurrent Unidentified Spectra for Hybrid Method Evaluation

High-resolution MS/MS spectra for the evaluation of the hybrid mass spectral library search method were obtained from tumor-derived tryptic peptides from the CPTAC32 analyses of both samples analyzed by The Cancer Genome Atlas Program and the human-in-mouse xenograft reference standard (CompRef). The raw data for tryptic peptides derived from 105 breast tumor samples and separately 72 and 75 ovarian cancer tumor samples, of which 32 were shared, were analyzed at the Broad Institute of MIT and Harvard, Johns Hopkins School of Medicine, and Pacific Northwest National Laboratory, respectively, in addition to the CompRef standard. Tryptic peptides were covalently labeled with iTRAQ for quantitation analyzed using 2D LC−MS/MS, in which both precursor and product ion scans were acquired using an Orbitrap mass analyzer. The raw spectra from which recurrent unidentified spectra were clustered into consensus spectra are publicly available through the CPTAC Data Portal (https:// cptac-data-portal.georgetown.edu/cptac/public).33 Tandem mass spectra corresponding to the same precursor that were observed more than a single time and were unidentified were clustered into “consensus” spectra.34 Collectively, a total of 1 428 188 clustered spectra were generated from more than 56 million total collected spectra. Hybrid Search

The former hybrid spectral library search method23 combined ion m/z and neutral mass loss peak matching for spectra of low mass accuracy. In this work, a new high-mass-accuracy hybrid search capability has been added to NIST MS Search24 and MSPepSearch.25 Because it is intended to deal with multiply charged precursor ions instead of a neutral loss, it incorporates an ion loss function. The basic principle of the search is that when two precursor ions differ only in a single modification that does not greatly affect the fragmentation mechanism, each product ion peak in one spectrum of one precursor corresponds to a peak created by exactly the same fragmentation in the other precursor spectrum. These pairs will have precisely the same masses and m/z values if neither contains the modification. However, their masses will differ by the mass of the modification (DeltaMass) and their m/z values will differ by DeltaMass/z for pairs of peaks that do contain the modification. Specifically, the method attempts to match each peak in the query spectrum either directly to a peak in the library spectrum or to a library peak whose m/z has been shifted by DeltaMass/z. For this to work, the charge state of C

DOI: 10.1021/acs.jproteome.6b00988 J. Proteome Res. XXXX, XXX, XXX−XXX

Article

Journal of Proteome Research

Figure 2. (a) DeltaMass distribution of the hybrid spectral library identifications from the recurrent unidentified spectra in the CPTAC library, which includes data acquired from the analyses of TCGA and CompRef samples at the Broad Institute of MIT and Harvard, Johns Hopkins School of Medicine, and Pacific Northwest National Laboratory, shown with mass bins of 0.5 Da with a cutoff of 20 000 identifications to highlight the possible modifications present. (b) The DeltaMass distribution around 144.10 Da (±0.04 Da), with mass bins of 0.001 Da, has a mean DeltaMass of 144.1021 Da, which is well within the 5 ppm mass accuracy of the monoisotopic mass of iTRAQ 4-plex.

tolerance of 20 ppm and product ion tolerance of 0.02 Da were allowed. Variable modifications included carbamidomethyl (C), iTRAQ 4-plex (K), iTRAQ 4-plex (N-term), and oxidation (M). In addition, MS-GF+ searches5 were performed with multiple fasta files. For each search, a precursor tolerance of 20 ppm and a product ion tolerance appropriate for a Q Exactive mass spectrometer with higher-energy collisional dissociation fragmentation were allowed. Static modifications included carbamidomethyl (C), iTRAQ 4-plex (N-term), and iTRAQ 4-plex (K). Variable modifications included oxidation (M) and deamidation (N, Q). The protein databases that were used in each search included RefSeq human (December 2011) and mouse (December 2011).

ions in the library spectrum. If this did not resolve the discrepancy, the DeltaMass was annotated as ambiguous because the modification could not be localized to a single residue. The requirement that the modification localization be unambiguous is necessary in order to ensure that the DeltaMass is not due to the presence of multiple modifications on multiple residues. Following the unambiguous localization of the DeltaMass, for cases where the chemical formula of the modification may correspond to multiple different chemical and/or biological modifications, the modification corresponding to the amino acid to which the modification was localized in the peptide sequence was assigned. For example, a DeltaMass equivalent to the addition of CH2, which may correspond to methyl, Asp → Glu, Gly → Ala, Ser → Thr, Val → Xle, or Asn → Gln, that has been unambiguously localized to Val would be assigned a Val → Leu/Ile substitution. Further analysis is underway to automatically localize and identify the origin of the modification and develop means of estimating the false discovery rate, which presents a challenge because multiple library peptides can be considered a correct identification of a single query spectrum.



RESULTS

The Hybrid Mass Spectral Library Search Method Identifies Previously Unidentified Spectra

The search method has been evaluated using high-resolution recurrent unidentified spectra obtained from CPTAC32 analyses of both samples analyzed by The Cancer Genome Atlas (TCGA) Program and the human-in-mouse xenograft reference standard (CompRef). The recurrent unidentified spectra were clustered into consensus spectra using in-house tools and searched against the publicly available human and mouse iTRAQ libraries37 using the hybrid spectral library method with a match factor cutoff of 600 and the requirement that a spectral match be of the same charge state as the query precursor. The calculated mass difference between the query spectrum and the library spectrum precursor, termed DeltaMass, for each

Sequence Database Analysis of CPTAC Recurrent Unidentified Spectra

As part of the evaluation of hybrid mass spectral library search identifications, identifications were compared to sequence database search results obtained from Mascot error-tolerant searching18 and MS-GF+.5 When possible, scores obtained from sequence database searches are provided in addition to the match factor obtained from the hybrid mass spectral library search. A Mascot error-tolerant search18 was performed on the CPTAC recurrent unidentified consensus spectra for tryptic peptides with up to three missed cleavages. A precursor mass D

DOI: 10.1021/acs.jproteome.6b00988 J. Proteome Res. XXXX, XXX, XXX−XXX

Article

Journal of Proteome Research

Figure 3. (A) Effect of loss of Pro on the match factor (i) for an exact MS/MS ion search without the loss of Pro for the peptide V(iTRAQ)ALLEDVNR(iTRAQ) (charge = 2, DeltaMass = 0.0028 Da) and (ii) a hybrid identification that includes a loss of Pro for the peptide V(iTRAQ)AIPIEDVNR(iTRAQ) (charge = 2, DeltaMass = −97.0500 Da). (B) Effect of the amino acid substitution Leu → Val on the match factor (i) for an exact MS/MS ion search without the substitution for the peptide F(iTRAQ)C(CAM)FEVVSPTK(iTRAQ) (charge = 3, DeltaMass = 0.0003 Da) and (ii) a hybrid identification that includes the Leu → Val substitution for the peptide F(iTRAQ)C(CAM)FEVLSPTK(iTRAQ) (charge = 3, DeltaMass = −14.0154 Da).

Information). The DeltaMass distribution of hybrid identifications with a match factor cut off of 600 is shown in Figure 2a. A majority of the hybrid identifications (56%) were

hybrid identification was assigned to possible modifications based on unique monoisotopic masses available from Unimod with a mass tolerance of 30 ppm (Table S1 in the Supporting E

DOI: 10.1021/acs.jproteome.6b00988 J. Proteome Res. XXXX, XXX, XXX−XXX

Article

Journal of Proteome Research

Figure 4. Hybrid spectral library identification of the query spectrum (top) to the library spectrum (bottom) corresponding to the peptide sequence V(iTRAQ)AIEHLDK(iTRAQ) (charge = 3, match factor = 827) with a DeltaMass of −29.9736 Da. The head-to-tail plot shows matching unshifted product ions in blue, unmatching unshifted product ions in gray, and matching shifted product ions in pink. Here the matching shifted b4 and b5 ions indicate that the modification is located on Glu at position 3 and corresponds to a Glu →Val SNP.

identified with a DeltaMass within ±0.005 Da (Figure S1 in the Supporting Information), which is consistent with previous reports.7,20 In addition to several common modifications, such as methylation, oxidation, and dehydration, the histogram highlights several amino acid insertions and deletions. Additionally, Figure 2b, with a mass bin of 0.001 Da, demonstrates that the DeltaMass that appears to correspond to the addition of an iTRAQ 4-plex label does in fact have a mean of 144.1021 Da, which is within 5 ppm of the monoisotopic mass. As discussed earlier, one advantage of spectral-library-based searching is the similarity-based scoring. Because of the requirement that the query precursor and library precursor be of the same charge state, modifications most likely to influence peptide fragmentation are expected to include those that change the number of basic residues in the peptide sequence and the addition or loss of Pro.38 In order to evaluate the effect of modifications on the product ion intensity and therefore the spectral similarity, the effects of two modifications, loss of Pro and Leu → Val substitution, on the match factor are presented in Figure 3. The loss of Pro (localization shown in bold), which can produce a dramatic effect on the product ion intensity,38 resulted in a match factor decrease of 392 between the exact match (V(iTRAQ)ALLEDVNR(iTRAQ), charge = 2, DeltaMass = 0.0028 Da, match factor = 932) and the hybrid identification (V(iTRAQ)AIPIEDVNR(iTRAQ), charge = 2, DeltaMass = −97.0500 Da, match factor = 540). The amino acid substitution Leu → Val (localization shown in bold), which corresponds to the loss of CH2, resulted in a match factor decrease of 11 between the exact match (FCFEVVSPTK, charge = 3, DeltaMass = 0.0003, match factor = 929) and the hybrid identification (FCFEVLSPTK, charge = 3, DeltaMass = −14.0154, match factor = 918). Together, these results show that this method would be less reliable in identifying SNPs involving Pro and that more typical SNPs such as the substitution Leu → Val have relatively small effects on hybrid search scores.

are unique to the Broad Institute of MIT and Harvard (Broad), Johns Hopkins School of Medicine (JHU), and Pacific Northwest National Laboratory (PNNL). For example, the DeltaMass corresponding to desulfurization of cysteine residues caused by protein reduction in the presence of tris(2carboxyethyl)phosphine hydrochloride (TCEP) at elevated temperatures39 is apparent in Figure S2b. Additionally, a peak at the DeltaMass corresponding to phosphorylation (79.9663 Da) is uniquely present, as shown in Figure S2a. Figure S2c also contains a greater number of spectra identified with a DeltaMass of 14.0157 Da, corresponding to methylation of Asp and Glu residues. The Hybrid Mass Spectral Library Search Method Can Identify SNPs

Because of the nature of the data used in the analysis and the ability of the method to identify unexpected modifications, identifications corresponding to an SNP with a unique monoisotopic mass, which cannot correspond to a known isobaric modification, were chosen to evaluate the method. In this case, the forward and reverse direction of Glu → Val was selected for evaluation. From the hybrid identifications, a total of 13 unique peptides containing a Glu → Val (−29.9742 Da) modification and eight unique peptides containing a Val → Glu (+29.9742 Da) modification were identified following visual examination of the spectra with the requirement that the localization of the DeltaMass be unambiguous. The spectral match shown in Figure 4 demonstrates how the observed DeltaMass corresponding to Glu → Val can be visually localized, as the originally unmatching product ions in the library spectrum (or ghost peaks) are shown in gray and matching product ions that have been shifted by the observed DeltaMass are shown in pink. In this example, both the b4 and b5 ions match after the product ions are shifted by the observed DeltaMass, which unambiguously indicates that the DeltaMass is located on Glu at position 3. Following the successful demonstration of SNP identification using the hybrid mass spectral library search method, the same method was applied to all possible amino acid substitutions. The distribution of amino acid substitutions corresponding to nonsynonymous SNPs, following normalization for the

The Hybrid Mass Spectral Library Search Method Can Identify Modifications Caused by Experimental Artifacts

A comparison of the DeltaMass distributions across laboratories, shown in Figure S2, can highlight modifications that F

DOI: 10.1021/acs.jproteome.6b00988 J. Proteome Res. XXXX, XXX, XXX−XXX

Article

Journal of Proteome Research

Figure 5. Distribution of unique identifications per SNP following normalization based on the known frequency of the substituted amino acid.48 *The SNP Asp → Glu may also correspond to methylation of Asp, as the two modifications are isobaric (+14.01565 Da).

Figure 6. Hybrid spectral library identification of the query spectrum (top) to the library spectrum (bottom) corresponding to the peptide sequence E(iTRAQ)SK(iTRAQ)PLTAQQTTK(iTRAQ) (charge = 2, match factor = 679) with a DeltaMass of −158.154 Da. The head-to-tail plot shows matching unshifted product ions in blue, unmatching unshifted product ions in gray, and matching shifted product ions in pink. Here the matching shifted b2 and b4 through b11 product ions as well as the matching y14 product ion indicate that the DeltaMass is localized at the missed cleavage, Lys at position 2, and corresponds to Lys + iTRAQ 4-plex → Asn.

expected frequency of the variant amino acid, is shown in Figure 5. A total of 74 SNPs were unambiguously identified using the hybrid mass spectral library search method. Of the 67 possible SNPs that were not identified, 36 contain Arg and 12 contain Lys. An amino acid substitution of a C-terminal Lys or Arg would generate different tryptic peptides not identifiable by the hybrid search. In the case of Arg or Lys that is not located at the C-terminus, for example due to a missed cleavage, the intensities of the resulting product ions may change, which may also cause the substitution to remain unidentifiable.40 The hybrid identification shown in Figure 6 demonstrates that a peptide E(iTRAQ)SK(iTRAQ)PLTAQQTTK(iTRAQ) (charge = 2, match factor = 679, DeltaMass = −158.154 Da) containing a missed cleavage at Lys-Pro has been identified with a substitution of Lys + iTRAQ 4-plex → Asn (localization shown in bold), which may be due to an SNP; however, confident identifications containing an amino acid substitution of Lys or Arg with unambiguous localization of the DeltaMass

are relatively rare. This may be due to changes in the intensities of the product ions as a result of the substitution. Moreover, a total of 138 unique peptides corresponding to 32 unique amino acid substitutions were identified using the hybrid mass spectral library search method that cannot be due to SNPs. A hybrid spectral library match corresponding to a Glu → Xle substitution is shown in Figure 7, in which the shifted b6, b7, and y8 product ions unambiguously indicate that the DeltaMass is localized on Glu at position 5. Additional evidence for this identification has been obtained from a database search using MS-GF+5 with a human Ensembl protein fasta file (E value = 5.4 × 10−14), indicating that the spectrum was not originally assigned to the correct peptide sequence because it was absent from the protein database used. From the 378 962 total hybrid identifications, 5803 spectra were found to have DeltaMass values that might correspond to the replacement of one amino acid by another. These were manually examined and found to correspond to 4872 unique G

DOI: 10.1021/acs.jproteome.6b00988 J. Proteome Res. XXXX, XXX, XXX−XXX

Article

Journal of Proteome Research

Figure 7. Hybrid spectral library identification of the query spectrum (top) to the library spectrum (bottom) corresponding to the peptide sequence A(iTRAQ)YLEGEC(Carbamidomethyl)VEWLR (charge = 3, match factor = 898) with a DeltaMass of −15.9588 Da. The head-to-tail plot shows matching unshifted product ions in blue, unmatching unshifted product ions in gray, and matching shifted product ions in pink. Here the matching shifted b6, b7, and y8 product ions indicate that the DeltaMass is located on Glu at position 5 and corresponds to a Glu →Xle substitution. Although this modification is not due to an SNP, this spectrum was also assigned to the same peptide sequence by MS-GF+ using a human Ensembl protein database with an E value of 5.4 × 10−14.

Figure 8. Numbers of manually confirmed hybrid identifications above a match factor of 600. The total numbers of peptides are shown in gray (5803), and those whose precise localizations were possible are shown in orange (4018). At these scores no false positive identifications were evident by manual examination.

spectral library search method. While these sequence “modifications” are visually abundant in Figure 2, since they are not connected with a single residue, database search methods may miss them, as they are not present in Unimod and not otherwise identifiable by present sequence identification methods. The hybrid spectral library identification shown in Figure 9 illustrates a peptide that was identified with a loss of Ala. This case illustrates another example of the dependence of automated search strategies on the completeness of protein databases. Amino acids for which DeltaMass values corresponding to an insertion or deletion were observed were all neutral and hydrophobic (Ala, Thr, Leu, Ile, Gly, and Val), which may be logical as a modification because these amino acids will not change the charge of the peptide and

peptide identifications, from which 3457 (71%) of the hybrid identifications contained a DeltaMass that could be localized to a single position in the peptide sequence (Table S1 and Figure 8), yielding 139 unique peptide modifications. The localizations of the remaining 29% of the unique hybrid method identifications were flagged as ambiguous following manual inspection. While less than half (41%) of all possible amino acid substitutions theoretically correspond to a possible SNP, 87% of hybrid identifications corresponded to a potential SNP (Table S2). The Hybrid Mass Spectral Library Search Method Can Identify Amino Acid Insertions and Deletions

Amino acid insertions and deletions, while not considered “modifications”, can also be identified using the hybrid mass H

DOI: 10.1021/acs.jproteome.6b00988 J. Proteome Res. XXXX, XXX, XXX−XXX

Article

Journal of Proteome Research

Figure 9. Hybrid spectral library identification of the query spectrum (top) to the library spectrum (bottom) corresponding to the peptide sequence E(iTRAQ)VIYDMLNALAAYHAPEEADK(iTRAQ) (charge = 4, match factor = 815) with a DeltaMass of −71.0352 Da. The head-to-tail plot shows matching unshifted product ions in blue, unmatching unshifted product ions in gray, and matching shifted product ions in pink. Here the matching shifted product ions from y3 through y14 indicate that the DeltaMass is located on Ala at position 18 and corresponds to a loss of alanine.

incomplete protein database and database search parameters that do not include the correct peptide or modification.8−10 The results also show that a large fraction of unidentified spectra have sequence similarity to peptides that are identified. In addition to highlighting the experimental and bioinformatic sources of unidentified spectra, illustrated in Figure 2, the evaluation has demonstrated additional advantages of the hybrid mass spectral library search method over open or “blind” modification searches of high-resolution tandem mass spectra, including (1) scoring through the use of peaks containing and not containing the modification, (2) a simple and computationally inexpensive approach, and (3) the identification of amino acid insertions and deletions. Additionally, the hybrid mass spectral library search method can be useful for crossspecies spectral library searches. The first advantage of the hybrid mass spectral library search method over similar methods available for high-resolution product ion spectra, such as the open or “blind” modification search, where a wide precursor tolerance is used to allow for unexpected modifications, is that the product ions containing the unexpected modification are considered in the final match factor of the spectral match. Figure 1 illustrates the effect of including product ions that match after shifting by the normalized DeltaMass in the final match factor for a peptide that contains an N-terminal modification. In this example, the effect of including the shifted product ions in the final score for the spectral match to M(iTRAQ)VVPVAALFTPLK(iTRAQ) (charge = 2, DeltaMass = −101.0974), which contains an Nterminal carbamylation of Met and loss of iTRAQ 4-plex, resulted in a match factor increase of 372. The dramatic increase in match factor observed in this example corresponds to the gain of the shifted b-ion series (b1 through b3 and b5 through b7). The head-to-tail plots of the hybrid spectral library identifications, shown in Figures 1, 3, 4, 6, 7, and 9, highlight how an experienced user can use the product ion spectrum to localize the DeltaMass. Because the originally unmatched product ions are shown as ghost peaks (gray) and the shifted

consequently will produce mass-shifted patterns with similar intensities. The Hybrid Spectral Library Method Can Be Useful for Cross-Species Searching

The hybrid method can be useful for identification of otherwise unidentifiable peptides for a species having a limited spectral library by use of a more complete library from another species. To demonstrate this application, we examined spectra unidentified by an MS-GF+ analysis of unlabeled CHO tryptic peptides using a UniProt Cricetulus griseus (CHO) protein database41 (see Supplemental Methods) with a spectral library of tryptic peptides derived from human samples. This illustrates how the hybrid method can serve to identify peptides using comprehensive libraries from other species with some degree of sequence homology. In this sense, the hybrid mass spectral library search has some similarity to the widely used BLAST42 search for genomics sequence homology. The spectra shown in Figure S3 illustrate how the hybrid method was again able to identify peptides with amino acid insertions and substitutions in a cross-species search. The spectrum shown in Figure S3a corresponds to a hybrid identification of an unknown precursor from a CHO tryptic digest as a match for a human peptide, DAVDLMK, with a Met insertion, which was supported by identifications from both Mascot error-tolerant search (score = 36.53) and MS-GF+ (E value = 5.4 × 10−10). Additionally, Figure S3c highlights how the hybrid method was able to identify a spectral match for a CHO peptide as the human peptide IVLYAK containing a Tyr → Val substitution.



DISCUSSION The application of the in-source search available in both NIST MS Search24 and MSPepSearch,25 here called the hybrid mass spectral library search, has successfully been used in recovering peptide identifications from high-quality recurrent unidentified spectra obtained from CPTAC32 and CHO tryptic digests. The evaluation of this method has shown how the hybrid mass spectral library search method can address two of the main reasons that high-quality spectra are not identified: an I

DOI: 10.1021/acs.jproteome.6b00988 J. Proteome Res. XXXX, XXX, XXX−XXX

Article

Journal of Proteome Research

large human spectral library are shown in Figure S3. The ability to take advantage of the large high-resolution human spectral library37 that is publicly available and identify unexpected modifications, in a manner that is similar to a BLAST42 search, makes the hybrid mass spectral library search method widely applicable. Here, identifications of unknown, unlabeled CHO peptides from the human spectral library included modifications such as Glu and Met insertions and Tyr → Val and Ala → Thr substitutions (Figure S3). With as many as 75% of MS/MS spectra remaining unidentified in a given proteomic experiment, it is logical that unidentified spectra have become the focus of many recent reports.7,8,20,31 Here we have shown how the hybrid mass spectral library search method is uniquely able to take advantage of the high mass accuracy information available in high-resolution MS/MS spectra through using the DeltaMass information available to shift the unmatching “ghost” peaks and by considering the similarity gained from the shifted peaks in the final spectral match score. Manual inspection of 5803 hybrid identifications did not reveal any false positives with a match factor cutoff of 600. While it is not possible to estimate the false discovery rate of manual examination, the requirement that the DeltaMass can be localized to an unambiguous position in the peptide sequence is expected to produce an acceptably small error rate. Future versions of the hybrid spectral library search method will be able to provide localization information from an analysis of shifted and unshifted y and b ions; however, the ability to compare library and query spectra will enable the analysis to validate any result reported by the search. In fact, because this library identification always involves comparison of two actual spectra, with shifted peaks clearly labeled, this is in some ways equivalent to comparison to a synthesized peptide. This method may be used as a tool to evaluate current sample preparation and bioinformatic methods as well as provide identification information following visual examination. In summary, the hybrid search described here represents a major new way of identifying peptides based on matching peptide spectra already reported in a library without limiting the mass or position of any chemical entity attached or inserted in the sequence. As libraries grow in quality and coverage, one can expect such libraries to be essential tools in the elucidating the “dark matter” of proteomics.

product ions are shown in pink, the pink product ions are expected to correspond to the modification responsible for the precursor mass shift. While it is possible that an observed DeltaMass may correspond to a combination of modifications, all hybrid identifications reported were limited to one modification per DeltaMass. It is expected that as the spectral library grows in size and new modifications are added, DeltaMasses corresponding to multiple modifications on a single peptide may be identified. On the basis of the number of hybrid identifications with DeltaMasses ≥ 200 Da and ≤ −200 Da, approximately 6% of hybrid identifications are expected to correspond to multiple amino acid insertions/deletions (Table S2). Second, the hybrid mass spectral library search method provides a relatively simple and computationally inexpensive general approach for the identification of less common peptide variants present in high-quality unidentified spectra. Because of the scope of the search space of open or “blind” modification and de novo methods, these approaches are typically utilized after a “closed” database search is performed;8,20 however, the hybrid mass spectral library search method is able to identify exact matches (DeltaMass within 10 ppm) as well as DeltaMass values corresponding to unexpected modifications, as shown in Figure 2. This significantly reduces the computational time required to perform the search. A hybrid spectral library search of 83 900 MS/MS spectra against both human (1 201 632 spectra) and mouse (91 068 spectra) iTRAQ libraries using preindexed peaks completed in 5 h, 18 min on a personal computer. A third advantage of the hybrid mass spectral library search method (because it does not rely on a predefined list of possible modifications) is its ability to identify amino acid insertions and deletions, although deletions may also include in-source fragmentation. The hybrid spectral library identification shown in Figure 9 highlights the ability to identify an internal amino acid loss, which in this case was not previously identified because the sequence was absent from the protein database used. The loss of Ala at position 18 in this spectrum causes a dramatic shift in the y-ion series from the y3 product ion through y14. Figure 9 also demonstrates how the user interface for MS Search can aid in the localization of modifications, particularly in the identification of amino acid insertions and deletions. In the case of the amino acid deletion shown in Figure 9, the user can see how the shifted y3 product ion now shares the same m/z value as the original y2 product ion. Similarly, for amino acid insertions the user interface can be used to see that an insertion, e.g., between b2 and b3, would result in the presence of a matching unshifted b2 product ion and a matching shifted b2 product ion followed by a shifted b3 product ion. Lastly, we demonstrate how the hybrid mass spectral library method is useful for cross-species searching, which has previously been used in searching unidentified spectra from an organism whose genome is not yet sequenced against the protein sequence database of an organism whose genome has been sequenced.43−47 In this application, the ability of the hybrid mass spectral library search method to identify amino acid substitutions, insertions, and deletions can be useful for the identification of previously unidentified spectra belonging to a species with a limited reference spectral library by searching against a species with high sequence homology and a more comprehensive library. Identifications obtained from a crossspecies search of unlabeled CHO tryptic peptides against the



ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jproteome.6b00988. Supplemental methods; DeltaMass histogram for all hybrid identifications from the recurrent unidentified spectra in the CPTAC library (Figure S1); distribution of DeltaMass values as determined by the hybrid search for the analysis of TCGA samples at the Broad, JHU, and PNNL (Figure S2); hybrid spectral library search identifications of unlabeled recurrent spectra from CHO tryptic peptides (Figure S3) (PDF) Summary of the modifications identified per unique peptide in recurrent unidentified spectra from analyses of TCGA and CompRef at Broad, JHU, and PNNL (Table S1); summary of hybrid identifications with a DeltaMass outside of ±0.005 Da and a match factor greater than 600 (Table S2) (XLSX) J

DOI: 10.1021/acs.jproteome.6b00988 J. Proteome Res. XXXX, XXX, XXX−XXX

Article

Journal of Proteome Research



(14) Fischer, B.; Roth, V.; Roos, F.; Grossmann, J.; Baginsky, S.; Widmayer, P.; Gruissem, W.; Buhmann, J. M. NovoHMM: A Hidden Markov Model for de Novo Peptide Sequencing. Anal. Chem. 2005, 77 (22), 7265−7273. (15) Taylor, J. A.; Johnson, R. S. Sequence database searches via de novo peptide sequencing by tandem mass spectrometry. Rapid Commun. Mass Spectrom. 1997, 11 (9), 1067−1075. (16) Mann, M.; Wilm, M. Error-Tolerant Identification of Peptides in Sequence Databases by Peptide Sequence Tags. Anal. Chem. 1994, 66 (24), 4390−4399. (17) Tabb, D. L.; Ma, Z.-Q.; Martin, D. B.; Ham, A.-J. L.; Chambers, M. C. DirecTag: Accurate Sequence Tags from Peptide MS/MS through Statistical Scoring. J. Proteome Res. 2008, 7 (9), 3838−3846. (18) Creasy, D. M.; Cottrell, J. S. Error tolerant searching of uninterpreted tandem mass spectrometry data. Proteomics 2002, 2 (10), 1426−1434. (19) Tsur, D.; Tanner, S.; Zandi, E.; Bafna, V.; Pevzner, P. A. Identification of post-translational modifications via blind search of mass-spectra. In Proceedings of the 2005 IEEE Computational Systems Bioinformatics Conference (CSB’05); IEEE: Piscataway, NJ, 2005; pp 157−166. (20) Chick, J. M.; Kolippakkam, D.; Nusinow, D. P.; Zhai, B.; Rad, R.; Huttlin, E. L.; Gygi, S. P. A mass-tolerant database search identifies a large proportion of unassigned spectra in shotgun proteomics as modified peptides. Nat. Biotechnol. 2015, 33 (7), 743−749. (21) Lam, H.; Deutsch, E. W.; Eddes, J. S.; Eng, J. K.; Stein, S. E.; Aebersold, R. Building consensus spectral libraries for peptide identification in proteomics. Nat. Methods 2008, 5 (10), 873−875. (22) Zhang, X.; Li, Y.; Shao, W.; Lam, H. Understanding the improved sensitivity of spectral library searching over sequence database searching in proteomics data analysis. Proteomics 2011, 11 (6), 1075−1085. (23) Stein, S. E. Chemical substructure identification by mass spectral library searching. J. Am. Soc. Mass Spectrom. 1995, 6 (8), 644−655. (24) The NIST Mass Spectral Search Program for the NIST/EPA/ NIH Mass Spectral Library (version 2.2, build May 18, 2016). http:// chemdata.nist.gov/dokuwiki/doku.php?id=peptidew:nistmssearch. (25) The NIST MSPepSearch Mass Spectral Library Search Program (Version 0.95 Build May 17, 2016). http://chemdata.nist.gov/ dokuwiki/doku.php?id=peptidew:mspepsearch. (26) Ahrné, E.; Nikitin, F.; Lisacek, F.; Müller, M. QuickMod: A Tool for Open Modification Spectrum Library Searches. J. Proteome Res. 2011, 10 (7), 2913−2921. (27) Ye, D.; Fu, Y.; Sun, R.-X.; Wang, H.-P.; Yuan, Z.-F.; Chi, H.; He, S.-M. Open MS/MS spectral library search to identify unanticipated post-translational modifications and increase spectral identification rate. Bioinformatics 2010, 26 (12), i399−i406. (28) Ross, P. L.; Huang, Y. N.; Marchese, J. N.; Williamson, B.; Parker, K.; Hattan, S.; Khainovski, N.; Pillai, S.; Dey, S.; Daniels, S.; Purkayastha, S.; Juhasz, P.; Martin, S.; Bartlet-Jones, M.; He, F.; Jacobson, A.; Pappin, D. J. Multiplexed Protein Quantitation in Saccharomyces cerevisiae Using Amine-reactive Isobaric Tagging Reagents. Mol. Cell. Proteomics 2004, 3 (12), 1154−1169. (29) Thompson, A.; Schäfer, J.; Kuhn, K.; Kienle, S.; Schwarz, J.; Schmidt, G.; Neumann, T.; Hamon, C. Tandem Mass Tags: A Novel Quantification Strategy for Comparative Analysis of Complex Protein Mixtures by MS/MS. Anal. Chem. 2003, 75 (8), 1895−1904. (30) Zhang, Z.; Yang, X.; Mirokhin, Y. A.; Tchekhovskoi, D. V.; Ji, W.; Markey, S. P.; Roth, J.; Neta, P.; Hizal, D. B.; Bowen, M. A.; Stein, S. E. Interconversion of Peptide Mass Spectral Libraries Derivatized with iTRAQ or TMT Labels. J. Proteome Res. 2016, 15 (9), 3180− 3187. (31) Ma, C. W. M.; Lam, H. Hunting for Unexpected PostTranslational Modifications by Spectral Library Searching with TierWise Scoring. J. Proteome Res. 2014, 13 (5), 2262−2271. (32) Rudnick, P. A.; Markey, S. P.; Roth, J.; Mirokhin, Y.; Yan, X.; Tchekhovskoi, D. V.; Edwards, N. J.; Thangudu, R. R.; Ketchum, K. A.; Kinsinger, C. R.; Mesri, M.; Rodriguez, H.; Stein, S. E. A Description of the Clinical Proteomic Tumor Analysis Consortium

AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. ORCID

Meghan C. Burke: 0000-0001-7231-0655 Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS We acknowledge support from the NIH/NCI CPTAC Program. Certain commercial equipment or materials are identified in this paper to specify the experimental procedure. Such identification does not imply recommendation or endorsement by the National Institute of Standards and Technology, nor does it imply that the materials or equipment identified are necessarily the best available for the purpose. The authors declare no competing financial interest.



REFERENCES

(1) Aebersold, R.; Goodlett, D. R. Mass Spectrometry in Proteomics. Chem. Rev. 2001, 101 (2), 269−296. (2) Aebersold, R.; Mann, M. Mass spectrometry-based proteomics. Nature 2003, 422 (6928), 198−207. (3) Eng, J. K.; McCormack, A. L.; Yates, J. R. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 1994, 5 (11), 976−989. (4) Perkins, D. N.; Pappin, D. J. C.; Creasy, D. M.; Cottrell, J. S. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20 (18), 3551−3567. (5) Kim, S.; Pevzner, P. A. MS-GF+ makes progress towards a universal database search tool for proteomics. Nat. Commun. 2014, 5, 5277. (6) Lam, H.; Deutsch, E. W.; Eddes, J. S.; Eng, J. K.; King, N.; Stein, S. E.; Aebersold, R. Development and validation of a spectral library searching method for peptide identification from MS/MS. Proteomics 2007, 7 (5), 655−667. (7) Griss, J.; Perez-Riverol, Y.; Lewis, S.; Tabb, D. L.; Dianes, J. A.; del-Toro, N.; Rurik, M.; Walzer, M.; Kohlbacher, O.; Hermjakob, H.; Wang, R.; Vizcaino, J. A. Recognizing millions of consistently unidentified spectra across hundreds of shotgun proteomics datasets. Nat. Methods 2016, 13 (8), 651−656. (8) Ning, K.; Fermin, D.; Nesvizhskii, A. I. Computational analysis of unassigned high-quality MS/MS spectra in proteomic data sets. Proteomics 2010, 10 (14), 2712−2718. (9) Nesvizhskii, A. I.; Roos, F. F.; Grossmann, J.; Vogelzang, M.; Eddes, J. S.; Gruissem, W.; Baginsky, S.; Aebersold, R. Dynamic spectrum quality assessment and iterative computational analysis of shotgun proteomic data toward more efficient identification of posttranslational modifications, sequence polymorphisms, and novel peptides. Mol. Cell. Proteomics 2006, 5 (4), 652−670. (10) Nielsen, M. L.; Savitski, M. M.; Zubarev, R. A. Extent of modifications in human proteome samples and their effect on dynamic range of analysis in shotgun proteomics. Mol. Cell. Proteomics 2006, 5 (12), 2384−2391. (11) Ma, B.; Zhang, K.; Hendrie, C.; Liang, C.; Li, M.; DohertyKirby, A.; Lajoie, G. PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry. Rapid Commun. Mass Spectrom. 2003, 17 (20), 2337−2342. (12) Ma, B. Novor: Real-Time Peptide de Novo Sequencing Software. J. Am. Soc. Mass Spectrom. 2015, 26 (11), 1885−1894. (13) Frank, A.; Pevzner, P. PepNovo: De Novo Peptide Sequencing via Probabilistic Network Modeling. Anal. Chem. 2005, 77 (4), 964− 973. K

DOI: 10.1021/acs.jproteome.6b00988 J. Proteome Res. XXXX, XXX, XXX−XXX

Article

Journal of Proteome Research (CPTAC) Common Data Analysis Pipeline. J. Proteome Res. 2016, 15 (3), 1023−1032. (33) Edwards, N. J.; Oberti, M.; Thangudu, R. R.; Cai, S.; McGarvey, P. B.; Jacob, S.; Madhavan, S.; Ketchum, K. A. The CPTAC Data Portal: A Resource for Cancer Proteomics Research. J. Proteome Res. 2015, 14 (6), 2707−2713. (34) Yang, X.; Neta, P.; Stein, S. E. Quality Control for Building Libraries from Electrospray Ionization Tandem Mass Spectra. Anal. Chem. 2014, 86 (13), 6393−6400. (35) Stein, S. E. Estimating probabilities of correct identification from results of mass spectral library searches. J. Am. Soc. Mass Spectrom. 1994, 5 (4), 316−323. (36) Creasy, D. M.; Cottrell, J. S. Unimod: Protein modifications for mass spectrometry. Proteomics 2004, 4 (6), 1534−1536. (37) The NIST Libraries of Peptide Tandem Mass Spectra. http:// chemdata.nist.gov/dokuwiki/doku.php?id=peptidew:cdownload. (38) Huang, Y.; Triscari, J. M.; Tseng, G. C.; Pasa-Tolic, L.; Lipton, M. S.; Smith, R. D.; Wysocki, V. H. Statistical Characterization of the Charge State and Residue Dependence of Low-Energy CID Peptide Dissociation Patterns. Anal. Chem. 2005, 77 (18), 5800−5813. (39) Wang, Z.; Rejtar, T.; Zhou, Z. S.; Karger, B. L. Desulfurization of cysteine-containing peptides resulting from sample preparation for protein characterization by mass spectrometry. Rapid Commun. Mass Spectrom. 2010, 24 (3), 267−275. (40) Tabb, D. L.; Huang, Y.; Wysocki, V. H.; Yates, J. R. Influence of Basic Residue Content on Fragment Ion Peak Intensities in LowEnergy Collision-Induced Dissociation Spectra of Peptides. Anal. Chem. 2004, 76 (5), 1243−1248. (41) UniProt Consortium. UniProt: a hub for protein information. Nucleic Acids Res. 2015, 43, D204−D212. (42) Altschul, S. F.; Gish, W.; Miller, W.; Myers, E. W.; Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 1990, 215 (3), 403−410. (43) Huang, L.; Jacob, R. J.; Pegg, S. C.-H.; Baldwin, M. A.; Wang, C. C.; Burlingame, A. L.; Babbitt, P. C. Functional Assignment of the 20 S Proteasome from Trypanosoma brucei Using Mass Spectrometry and New Bioinformatics Approaches. J. Biol. Chem. 2001, 276 (30), 28327−28339. (44) Liska, A. J.; Shevchenko, A. Expanding the organismal scope of proteomics: Cross-species protein identification by mass spectrometry and its implications. Proteomics 2003, 3 (1), 19−28. (45) Mackey, A. J.; Haystead, T. A. J.; Pearson, W. R. Getting More from Less: Algorithms for Rapid Protein Identification with Multiple Short Peptide Sequences. Mol. Cell. Proteomics 2002, 1 (2), 139−147. (46) Taylor, J. A.; Johnson, R. S. Implementation and Uses of Automated de Novo Peptide Sequencing by Tandem Mass Spectrometry. Anal. Chem. 2001, 73 (11), 2594−2604. (47) Waridel, P.; Frank, A.; Thomas, H.; Surendranath, V.; Sunyaev, S.; Pevzner, P.; Shevchenko, A. Sequence similarity-driven proteomics in organisms with unknown genomes by LC-MS/MS and automated de novo sequencing. Proteomics 2007, 7 (14), 2318−2329. (48) Shen, S.; Kai, B.; Ruan, J.; Torin Huzil, J.; Carpenter, E.; Tuszynski, J. A. Probabilistic analysis of the frequencies of amino acid pairs within characterized protein sequences. Phys. A 2006, 370 (2), 651−662.

L

DOI: 10.1021/acs.jproteome.6b00988 J. Proteome Res. XXXX, XXX, XXX−XXX