VEMS 3.0: Algorithms and Computational Tools for Tandem Mass

Nov 2, 2005 - Windows binaries are available at http://www.yass.sdu.dk/. Keywords: mass ..... spectrometry. Journal of Mass Spectrometry 2006,1247-125...
4 downloads 0 Views 194KB Size
VEMS 3.0: Algorithms and Computational Tools for Tandem Mass Spectrometry Based Identification of Post-translational Modifications in Proteins Rune Matthiesen,* Morten Beck Trelle, Peter Højrup, Jakob Bunkenborg, and Ole N. Jensen Department of Biochemistry & Molecular Biology, University of Southern Denmark, Campusvej 55 DK, 5230 Odense M, Denmark Received August 11, 2005

Protein and peptide mass analysis and amino acid sequencing by mass spectrometry is widely used for identification and annotation of post-translational modifications (PTMs) in proteins. Modificationspecific mass increments, neutral losses or diagnostic fragment ions in peptide mass spectra provide direct evidence for the presence of post-translational modifications, such as phosphorylation, acetylation, methylation or glycosylation. However, the commonly used database search engines are not always practical for exhaustive searches for multiple modifications and concomitant missed proteolytic cleavage sites in large-scale proteomic datasets, since the search space is dramatically expanded. We present a formal definition of the problem of searching databases with tandem mass spectra of peptides that are partially (sub-stoichiometrically) modified. In addition, an improved search algorithm and peptide scoring scheme that includes modification specific ion information from MS/MS spectra was implemented and tested using the Virtual Expert Mass Spectrometrist (VEMS) software. A set of 2825 peptide MS/MS spectra were searched with 16 variable modifications and 6 missed cleavages. The scoring scheme returned a large set of post-translationally modified peptides including precise information on modification type and position. The scoring scheme was able to extract and distinguish the near-isobaric modifications of trimethylation and acetylation of lysine residues based on the presence and absence of diagnostic neutral losses and immonium ions. In addition, the VEMS software contains a range of new features for analysis of mass spectrometry data obtained in large-scale proteomic experiments. Windows binaries are available at http://www.yass.sdu.dk/ Keywords: mass spectrometry • diagnostic ions • neutral losses • variable modifications • database searching • distributed computing

Introduction Mass spectrometry has revolutionized protein chemistry and proteomics by allowing mass measurement and amino acid sequencing of nanogram levels of proteins and peptides. A particular advantage of tandem mass spectrometry is that it allows detection and identification of post-translational modifications by accurate measurements of modification-specific mass increments of intact peptides.1 Tandem mass spectrometry analysis of peptides generate MS/MS spectra that contain information about the molecular mass of the full length peptide (precursor ion mass) as well as the mass values of peptide fragment ions from which the partial or complete amino acid sequence of the peptide can be derived. This is usually achieved by manual interpretation or by computational matching of individual MS/MS fragment ion spectra to predicted fragmentation patterns for sequence entries in biological databases.2 * To whom correspondence should be addressed. Protein Research Group, Department of Biochemistry & Molecular Biology, University of Southern Denmark, Campusvej 55 DK 5230 Odense M, Denmark. Tel: +45 65 50 24 12. Fax: +45 65 93 26 61. E-mail [email protected].

2338

Journal of Proteome Research 2005, 4, 2338-2347

Published on Web 11/02/2005

In addition to amino acid sequence patterns, tandem mass spectra contain signals that are due to amino acid specific neutral losses or molecular fragments such as immonium ions. Furthermore, post-translationally modified peptides usually generate modification-specific neutral loss signals and also modification-specific ion signals that can be used as diagnostistic patterns for improved detection and identification of PTM’ed peptides. For example, phosphoserine usually exhibits a neutral loss of 98 Da (H3PO4) due to elimination of phosphoric acid during MS/MS of phosphopeptides,1 whereas acetylated lysine exhibits a diagnostic ion at m/z 126.0913 during MS/ MS of peptides.3 Identification of post-translationally modified peptides in large-scale proteomics by searching protein sequence databases has, however, turned out to be a major challenge due to the large number of potential amino acid modifications.4 For computational purposes modifications are usually divided into ‘fixed modification’ that always occur on a specified type of amino acid residue and ‘variable modifications’ that may sometimes occur on a specified amino acid.5 S-alkylation of proteins with iodoacetamide is a commonly employed sample preparation method that stoichiometrically 10.1021/pr050264q CCC: $30.25

 2005 American Chemical Society

research articles

Post-translational Modifications by MS/MS Data Analysis

converts all cysteine residues to S-carbamidomethylcysteine and this is then defined as a fixed modification of Cys in the database search algorithms. In contrast, in vivo phosphorylation is a site specific and sub-stoichiometric modification of proteins and it is therefore defined as a variable (or partial) modification of Ser, Thr, and Tyr residues during database searches by mass spectrometry data. The advantage of fixed modifications is that no additional computational overhead is required for the search algorithm as compared to searching with no modifications since it corresponds to changing the apparent molecular mass of individual amino acid residues by a delta-mass value, i.e., +57 Da for conversion of all Cys residues to S-carbamidomethylcysteine. Several database search engines contain algorithms for testing all possible arrangements of variable modifications.5-7 Algorithms for searching modifications are very powerful for finding partial modifications such as phosphorylation,8 glycosylation,9 ICAT biotin tags for quantification,10 and residue substitutions.5 However, these algorithms are not so flexible and may not perform well when many variable modifications and missed proteolytic cleavage sites (sub-optimal enzyme performance) are defined. The latter is a common occurrence in PTM’ed peptides, because the chemical modifications mask the trypsin substrate site (Arg or Lys residues) thereby reducing the cleavage efficiency near modified residues. In addition, the previously reported search algorithms are not fully documented and a precise definition of the computational problem of searching for post-translationally modified peptides has not been published. Algorithms such as error tolerant search11 and the extended interpretation algorithm of VEMS12 (Virtual Expert Mass Spectrometrist) use a peptide sequence tag approach for locating modifications. These algorithms are computationally efficient but will fail for peptides for which a tag cannot be obtained and they are often limited to definition of only one or a few modifications per peptide. The variable modification algorithms search all combinations of modifications in a peptide against all tandem mass spectra. Since this type of algorithm is computationally demanding it is an advantage to limit the full combinatorial search to only those proteins found by database dependent search algorithms, such as VEMS,12 ProteinProspector,13 Mascot,5 Sequest,14 X! Tandem,7 and ProbID.6 A good approach is therefore to perform the initial database dependent search (‘first pass search’) with a few variable modifications. The retrieved protein sequences are then passed to a ‘second pass search’ algorithm that makes all combinations of the tryptic peptides given a larger set of variable modification and a larger number of missed cleavages.15,16

defined as absent then there is no purpose in iterating over all possibilities of modifications that generate the same parent ion mass. We have now developed a more efficient approach by dividing expression (1) into two terms; one that iterates over all possible masses and one term that iterates over all possible positions of modifications. Much of this work was prompted by analysis of a set of LCMS/MS spectra of a protein mixture which mainly contains histone proteins that were found to be heavily post-translationally modified (Salcedo et al., manuscript in preparation). More than 16 variable amino acid modifications have been reported to occur in histone proteins,17 and the modifications on Arg and Lys residues give rise to numerous missed tryptic cleavage sites. The improved algorithm was implemented in the Virtual Expert Mass Spectrometrist (VEMS) program,12 and the performance was further improved by using distributed computing techniques. VEMS 3.0 contains a range of versatile tools for data validation and mining. An improved scoring function is based on analysis of the intensity of the different types of ions as in the earlier VEMS version,12 but here extended to consider also modification-specific diagnostic ions and all combinations of specified neutral losses. This improved scoring function was able to distinguish between the near-isobaric modifications lysine acetylation and lysine tri-methylation, which was not possible with any other available software. Windows binaries for VEMS 3.0 are available at http:// www.yass.sdu.dk/.

The expression for calculation of the number of possible modifications m of a peptide with length n is given by

The total number N of potential modified peptides from a protein given a set of variable modifications and a number of maximum missed cleavages C is given by eqs 1, 2, and 3

Methods Search Algorithm. The number of possible non-modified tryptic peptides t from a given protein containing C cleavable lysine and arginine residues is given by t ) (u + 1)(C + 1) -

∏(V + 1) i

(1)

i)1

n is the number of residues in the peptide, Vi is the number of possible variable modifications at residue i.16 For long peptides with many variable modifications expression (1) gives astronomical numbers and thereby a large number of iterations. It was suggested to stop the iteration when the calculated masses become larger than the largest measured mass.16 However, the algorithm will iterate over many combinations that have the same mass of the full length peptide but only differ in the position of the modifications. If a parent ion mass is already

(2)

if the protein does not have an arginine or lysine residue at the carboxyl terminus. The variable u is the maximum number of missed cleavages being considered with the boundary condition being u e C because the number of missed cleavage sites necessarily is less than or equal to the total number of cleavage sites. If the protein has arginine or lysine at the C-terminal then t is given by t ) C(u + 1) -

n

m)

u2 + u 2

t

N)

u2 + u 2

(3)

n

∑∏(V

j,i

+ 1)

(4)

j)1 i)1

where j is iterated over all over tryptic peptides. The problem can now be defined as to search all MS/MS spectra against an indexed database of the N peptides of all proteins. This approach, however, requires that all the possible peptides are stored in a database in the memory causing the computer to run out of memory for large N. Implementation of an iterative algorithm that runs through all possible modifications calculated by eq 4 is not optimal. For instance, if a theoretical parent ion mass is found to be absent in all the observed MS/MS Journal of Proteome Research • Vol. 4, No. 6, 2005 2339

research articles

Matthiesen

Table 1. MS/MS Diagnostic Ion Weight Matrix for Amino Acidsa aa

P(y1)

mass (Da)

P

G A S P V T L I N D Q K E M H F R R Y W C

0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 90/89 0/0 0/0 0/0 0/0 84/92

30.0344 44.0500 60.0449 70.0657 72.0813 74.0606 86.0964 86.0964 87.0558 88.0399 101.0715 129.1140 102.0555 104.0534 110.0718 120.0813 158.0924 100.0875 136.0762 159.0922 133.0436

na/0 na/0 45/1 59/3 92/2 50/2 77/40 70/48 70/12 62/3 72/36 74/77 90/24 89/47 78/67 82/74 80/40 27/1 64/84 39/75 45/59

0/0 0/0 0/0

mass (Da)

P

mass (Da)

69.0704

62/1

70.0293 70.0293 129.1028 101.1079 84.0449

59/2 41/1 39/64 55/17 65/12

91.0548 157.1084 87.092 91.0548 132.0813

36/1 52/28 49/8 36/2 44/32

P

mass (Da)

P

mass (Da)

P

55.0548

40/1

41.0391

na/0

84.0449 84.0813

36/11 70/14

56.0500 56.0500

64/2 73/2

129.1140 70.0660 107.0497 130.0657

33/40 35/1 77/4 22/76

115.0984 60.0560

69/22 36/1

112.0870

84/27

117.0578

4/2

77.0391

8/1

a The probabilities are based on observed fragments in a reference set of 902 Q-TOF MS/MS spectra. The column P(y1) gives the probability P(aa-C|y1) (in percentage) of the amino acid being C-terminal given the corresponding y1 ion is observed vs the probability P(y1|aa-C) of the y1 ion being observed given that the sequence has the specific amino acid C terminal. The column P gives the probability P(aa|diagnostic ion) of a sequence to contain a particular amino acid given that the mass corresponding to the immonium ion of the particular amino acid or a fragment of the immonium ion is observed vs the probability P(diagnostic ion|aa) of observing the diagnostic ion given the amino acid in the sequence. C is carbamidomethylated cysteine. na: not applicable.

spectra then there is no need to iterate through all combinations of peptides of this mass. It is therefore advantageous to split eq 4 into two parts, one that parses all possible parent ion masses and a second part which iterate through all peptide combinations with a certain parent ion mass. Nm, the number of possible combinations of variable modifications independent of position in the sequence, is given by NV,aa-1 Nv

Nm )

∏ i)1

∏ (N

aa,i

+ j)

j)1

(5)

(NV,aa-1)!

NV is the number of variable modifications specified in the search. NV, aa is the number of potential modification for a particular amino acid. Naa,i is the number of amino acids for which the variable modification i is possible. The number of combinations due to different positions of the variable modifications for a certain composition in eq 5 is given by

∏(N NV

NC )

i)1

NCaa,i aa,i

)

(6)

where NCaa,i is the number of variable modification i there is in a certain composition. The total number of peptides for a protein is now given by combining eqs 2, 3, 5, and 6

∑∑ ∏ ( N t

N)

Nm NV

j)1 k)1 i)1

NCaa,i,k,j aa,i,k,j

)

(7)

Scoring Function. Searching with a large number of variable modifications increases the probability of producing random hits (false positives), especially for nonoptimized scoring functions. It is therefore important to optimize the scoring function and carefully validate the results from the MS/MS database searches. The scoring function used here is based on weights that depend on the probability of observing a sequence feature 2340

Journal of Proteome Research • Vol. 4, No. 6, 2005

given an observed ion and the probability of observing different types of ions given the peptide sequence. This is a slightly different approach than previously reported where the weights depend only on the probability of observing different types of ions given the peptide sequence 18. The rationale behind this is that an ion that is observed in 100% of all cases of a given sequence is not very interesting if it is also observed for many other sequences. Additional information on the peptide composition can sometimes be gained by considering low mass diagnostic ions such as immonium ions of amino acid residues. The probabilities P(aa|diagnostic ion) for a sequence to contain the standard amino acids given different observed diagnostic ions was estimated from a reference set of 902 Q-TOF MS/MS spectra assigned to peptide sequences with no modifications (Table 1). The reference peptides were obtained from purified proteins from different projects in the group (see http:// yass.sdu.dk for details). The probabilities P(aa|diagnostic ion) can now be used as weights for positive scoring of matching parent ions and the probabilities P(diagnostic ion|aa) can be used as weights for negative penalty scores. The diagnostic ion masses presented in Table 1 have previously been observed in high energy collision induced dissociation;19 however, we find that some of these diagnostic ions can also be observed in low energy collision induced dissociation. The probabilities presented are dependent on the type of mass spectrometer and settings. Using the reference dataset, the probability P(aa-C|y1) of the amino acid being at the C-terminus given that the corresponding y1 ion is observed in the spectrum and the probability P(y1|aa-C) of the y1 ion being observed in the spectrum given that the sequence contains the specific amino acid in the C terminal position were determined. These probabilities were estimated for tryptic peptides, which explains the high probability for having Lys or Arg at the C terminal given that their corresponding y1-ion was observed. The obtained probabilities can be used as positive and negative weights in a scoring function. For example Val and Glu have the most specific diagnostic ions, however, they are not so likely

research articles

Post-translational Modifications by MS/MS Data Analysis Table 2. MS/MS Diagnostic Ion Weight Matrix for Post-translationally Modified Amino Acidsa

meR me2aR me2sR me3R meK me2K me3K acK phY oxM

mass (Da)

P

mass (Da)

P

mass (Da)

P

mass (Da)

P

mass (Da)

P

143.1291 157.1448 157.1448 171.1604 115.123 129.1386 143.1543 143.1179 216.042 120.0478

21.2 20.6 20.6 24.3 11.8 55.5 20.6 21.2 14.2 32.7

115.0866 115.0866 115.0866

12.4 12.4 12.4

112.0869 112.0869 112.0869

12.6 12.4 12.4

74.0713 88.0869 88.0869

0.2 1 1

70.0651 71.0604 71.0604

0.2 0 0

98.0964 84.0808 84.0808 126.0913

0.8 8.9 8.9 2

84.0808

8.9

84.0808

8.9

a Monoisotopic masses of diagnostic ions for modified residues and the probability P to observe them in unmodified peptides. oxM is methionine sulfone. Other abbreviations are listed in Table 1.

Table 3. Neutral Losses from Modified Residues Which Were Considered by the Program modification

neutral loss molecular composition

∆m (Da)

oxM phS phT phY me3K meR me2aR me2sR

CH4OS H3O4P HO3P H3O4P HO3P HO3P C3H9N C2H7N3 C2H4N2 CH5N C3H9N3 C3H6N2 C2H7N C3H9N3 C3H6N2 C2H7N

63.9983 97.9769 79.9663 97.9769 79.9663 79.9663 59.0735 73.064 56.0374 31.0422 87.0796 70.0531 45.0579 87.0796 70.0531 31.0422

to be observed. The diagnostic ion at 129.1140 Da of Arg is not very specific since it is actually more frequently observed for peptides with Lys. The diagnostic ion 158.1 on the other hand is almost as specific as the y1 for Arg but it is less frequently observed. It is sometimes difficult to assign the exact modificationsite in peptides due to incomplete information in the MS/MS spectra. Isobaric (same mass) modifications may also lead to false assignments. To optimize the VEMS program to achieve correct assignments of post-translational modifications we introduced a scoring function that considers diagnostic ions (Table 2) and neutral losses (Table 3) that are specific for various types of modifications. By comparing Tables 1 and 2 it is evident that a number of diagnostic ion masses generated by unmodified amino acid residues are identical to diagnostic ion masses produced from certain modified amino acids. As an example the diagnostic ion at m/z 129.1 Da is characteristic for arginine, lysine, and di-methylation of lysine. To minimize random hits the weights for diagnostic ions and modification specific neutral loss ions were chosen to be as small as possible without eliminating the ability to distinguish between alternative modifications and their positions. These weights have no theoretical meaning and were found empirically. The following empirical scoring function was used

( ( J

S)

naa

j)1 i)1

)

Iji e-(x)2/2σ

∑∑ w I * j

l

σx2π

J

-

naa

N

( ))

∑∑(w ) - ∑ w p

j)1 i)1

up

k)1

Ik Il

(8)

where S is the score. The first term gives a positive score for each of the fragment ions matching; j is iterated over marker ions (Tables 1 and 2), neutral losses (Table 3), a, b, and y ions. The variable i is iterated over the amino acid residues in a peptide sequence. Peaks matching neutral losses specific for a modification were only counted if they were not already assigned as a, b, y-NH3, y-H2O and y-ion signals. The variable wj are positive weights for a specific ion type, e.g., for the ion

at 158.0924 Da characteristic for arginine the positive weight 0.80 is used if the ion is observed and the sequence contains arginine. Iji is the intensity of an observed ion in the MS/MS spectrum, and Il is the highest intensity in the last 50 Da before the peak currently being evaluated. This local normalization compensates for the background peaks in the spectrum. The second term gives a negative score for each of the diagnostic ions from the amino acids in the candidate peptide sequence that are missing; where wp’s are penalty weights for a specific ion type, e.g., the penalty weight 0.40 is used if the sequence contains a C-terminal arginine but the ion y1-NH3 at 158.0924 Da is not observed. The third term gives a negative score for all the peaks in the spectrum that could not be explained by the candidate peptide sequence; k in the penalty term is iterated over all unexplained peaks. The variable Ik is the intensity of an unexplained peak above a specified background. The variable wup is a weight optimized by the method described below. VEMS performs linear recalibration of the fragments ions in a MS/MS spectrum to theoretical fragment masses for each possible peptide candidate for the spectrum. The information from the linear recalibration is directly used in the scoring function. The mass error x after recalibration is used to weight the score from a matching fragment by a normal distribution in eq 8. σ is the standard deviation from the calibration line. The empirical scoring function considers the type of observed ion, the intensity of the observed ion, the noise in the spectra and due to the penalty term it has an enhancing effect in cases where multiple ions are observed. The weights for wup, a, a2, b, b2, y-H2O, y-NH3, y-ions, and the minimum number of the 10 most intense peaks that should be explained were optimized by using a genetic algorithm (see technical report on http:// yass.sdu.dk) to maximize the area under the ROC curve. The ROC curve is obtained by searching spectra with known solutions and has previously been used to test the performance of scoring algorithms.20 Validation of Peptide Hits. The search results can be evaluated by visual inspection and by a number of visualization tools and validation functions in the VEMS 3.0 program. Validation filters can be applied directly on the scoring function or they can be used afterward for grouping of database search results based on the validation filter rather than the scoring function. Different types of scoring function constraints can be used, such as; how many of the 10 most intense peaks that can be assigned to sequence specific fragment ions, the y1ion signal must match the C-terminal amino acid residue of tryptic peptides (m/z 147 for Lys, m/z 175 for Arg), a minimum correlation between the observed and theoretical spectrum or the minimum percentage of possible y- or b-ions that should Journal of Proteome Research • Vol. 4, No. 6, 2005 2341

research articles

Matthiesen

Figure 1. A: Scores obtained by searching random generated spectra vs the logarithm of the probability of the scores. The points are from the probability distributions of peptides with length 6 (X), 10 (+) and 15 (*). B: Peptide length vs the slopes of the linear equation of the type shown in A.

be observed.21,22 If the b-ions fulfill the last constraint, then there is an additional requirement enforced that at least one of the b-ions should be a fragment containing a Lys, Arg, or His. In addition to these features, VEMS provides an error model that validates the search results by iteratively making linear calibration on the a-, b-, and y-ion fragment masses removing one outlier per iteration. The outlier is determined by recalculating the linear regression assuming that each of the fragment ions is the outlier. The outlier used is the one that gives the best linear fit after removal. This error model is also directly used in the scoring function (8). The iteration continues until the standard deviation from the fitted linear curve reaches a user defined threshold. The user defined threshold will depend on the general mass accuracy of the instrument used. The obtained standard deviation can be used to compare the mass accuracy of alternative peptide solutions to the same spectrum (see result and discussion). Calculation of Expectation Values. A model for simulating random matches was made by randomly picking 1000 parent ion masses and 100 fragment ion masses for each parent ion from a set of 35 000 experimental MS/MS spectra. This is necessary since experimental mass values are not spread uniformly across the whole mass scale.23,24 The randomly generated spectra were searched against all entries in UniprotKB/Swiss-Prot database containing 162 780 protein sequences (http://au.expasy.org/sprot/download.html, 1/10/2004). Peptide scores were calculated by using the scoring function described in a previous section (eq 8). It was assumed that the probability distribution of scores from searching random spectra would depend on the peptide sequence length or mass. The peptide scores were therefore grouped according to the length of the peptide sequences. It was found that these distributions could be approximated by an exponential distribution (Figure 1A). The parameters for the exponential distribution are estimated from the slope and intercept of a linear equation fitted to the scores and log P (probability). As expected the probability for obtaining higher scores by searching random spectra are less for longer peptides. We have previously shown that the probability distribution for low protein scores (assumed to be random hits) obtained during searches can also be approximated by the exponential distribution.12 The above2342

Journal of Proteome Research • Vol. 4, No. 6, 2005

described model was only used for testing the statistical model. For real MS/MS data, the distribution described above will underestimate random matching probabilities due to for example sequence homology. VEMS therefore uses low scores assumed to be due to random matches obtained during a database search for determining the probability distribution.25 The distributions made during the search will reflect sequence database size, redundancy, and sequence homology in a database, the number of variable modifications and missed cleavage specified, and the amount of submitted MS/MS data. An approximation of the probability distribution of low peptide scores for long peptides during a search is problematic. The frequency of low scores for long peptides is low when small datasets and databases are searched. This leads to poor estimation of the score distribution for larger peptides. However, it turned out that there is a linear relationship between the parameters for the exponential distributions and the sequence length (Figure 1B). The exponential distribution for longer peptides can therefore be estimated from the parameters of the exponential distributions of smaller peptide. This strategy has been tested to work well for searches against databases containing more than 2000 proteins. For fewer than 2000 protein sequences, the program will estimate the exponential parameters based on the searches of random spectra against UniprotKB/Swiss-Prot. Protein probability values are calculated from the peptide probabilities as previously reported.26 Implementation. The algorithm was constructed so the N peptides for each protein were generated one by one and searched against an indexed parent ion mass database of all MS/MS spectra. We found indexing to reduce search time for large datasets (>1000 MS/MS spectra), whereas indexing does not have a large effect on search time when ∼100 spectra are searched. To increase database search speed we developed two modules for distributing the computational MS/MS data searches for variable modifications to several CPUs in the laboratory (desktop PC’s). The maximal increase in speed that can be obtained is the ratio of the sum of the CPU speed of all computers to the CPU speed of a single computer. To obtain the maximal possible performance, the algorithm should be able to divide the job to a number of smaller problems and minimize the data transfer time between computers.27 A tree

research articles

Post-translational Modifications by MS/MS Data Analysis Table 4. Number of Combination Iterated through during the Search of 2825 MS/MS Spectra against Histone H3 from Plasmodium falciparum by Using eqs 2-7a missed cleavages var. mod.

0 a

1 b

a

6 b

non 5 23 25 54 phS 8 30 37 76 : : : : : acK 61 3488 1543 234595 all 61 3.6 × 103 1.6 × 103 2.4 × 105

..

a

b

.. 65 169 .. 111 317 .. : : .. 184142 2.55 × 109 .. 1.9 × 105 2.9 × 109

a (a) Compared to using eq 1 (b) when specifying 0, 1 or six missed cleavages. The last row corresponds to searching all variable modifications shown in Table 1.

like system was built to minimize transfer time between computers. It is advantageous for the master computer not to send jobs to all slave computers (n > 3) since the data content within each job is relative large. The master computer, therefore, sends the data to 3 slave computers which can then distribute their part of the job to 3 new slave computers, if necessary, etc. If the number of available slave computers is less than 5, then the master computer also works as slave, otherwise the master computer only coordinates the jobs. The slave module can be efficiently run at the same time as other programs, it can be stopped at any point to obtain maximum performance of the host computer, and it uses a low amount of memory since the peptide combinations are generated iteratively. When finished the slave computers send their solutions to the master module that collects and parses them into a final result. Test-Application. The search algorithm and validation scoring function was tested using data obtained from a study of histone proteins (Salcedo et al, in preparation). Histone preparations were subjected to SDS-PAGE, alkylated by iodoacetamide, in-gel digested with trypsin and analyzed by LCMS/MS. The LC-MS/MS experiments were conducted on a nanoflow HPLC system (LC-Packings, The Netherlands) with reversed phase columns coupled online to an electrospray ionization Q-TOF instrument (Micromass, UK). The LC-MS/ MS dataset obtained during this analysis contained 2825 MS/ MS spectra. Many tryptic peptides were expected to be posttranslationally modified as histones are heavily acetylated and methylated.

Results and Discussion Combinatorial Explosion. The number of combinations according to eqs 2-7 will depend on the amino acid composition of the protein sequences, the specified types of variable modification, the number of missed cleavages, the number of MS/MS spectra, the precursor ion masses, and the mass accuracy of the precursor ion used in the search. The number of iterations generated by using eq 1 or eqs 2-7 for searching the 2825 MS/MS spectra against Histone H3 from Plasmodium falciparum using parent ion mass accuracy (0.2 Da is summarized in Table 4. The number of combinations generated by eqs (2-7) is considerably lower as compared to an algorithm that uses eq 1. This difference becomes more pronounced when searching with increasing numbers of variable modifications. Search Results. The VEMS program is able to merge all the LC-MS/MS runs and still keep track of which LC run the

Table 5. Variable Modifications Used for the Demonstration Dataa ∆m (Da) parent

∆m (Da) fragment

aa

comments

79.96634 79.96634 79.96634 15.99491 14.03130 28.01565 42.04695 14.01565 28.01565 28.01565 42.04695 42.01056 42.01056 114.04293 383.22809 0.98402

-18.01057 -18.01057 79.96634 15.99491 14.03130 28.01565 42.04695 14.01565 28.01565 28.01565 42.04695 42.01056 42.01056 114.04293 383.22809 0.98402

phS phT phY oxM me1K me2K me3K me1R me2aR me2sR me3R acK Ac-( ub1K ub2K deimR

Phosphoserine Phosphothreonine Phosphotyrosine Methionine oxidation Methylation of lysine Di-methylation of lysine Tri-methylation of lysine Methylation of argenine Asymmetric di-methylation of arginine Symmetric di-methylation of arginine Tri-methylation of arginine Acetylation of lysine N-terminal acetylation Ubiquitination of lysine Ubiquitination of lysine Deimination of arginine

a “(“ indicates the N-terminus of the proteins not to be confused with the N-terminal of the tryptic peptides for which the modules use the symbol “