Monoisotopic Mass Determination Algorithm for Selenocysteine

Jul 11, 2012 - Division of Computer Science and Engineering, Hanyang University, Seoul, .... mination algorithm for selenocysteine-containing polypept...
0 downloads 0 Views 3MB Size
Article pubs.acs.org/jpr

Monoisotopic Mass Determination Algorithm for SelenocysteineContaining Polypeptides from Mass Spectrometric Data Based on Theoretical Modeling of Isotopic Peak Intensity Ratios Jin Wook Kim,† Sunho Lee,‡ Kunsoo Park,*,‡ Seungjin Na,§ Eunok Paek,*,§ Hyung Seo Park,§ Heejin Park,§ Kong-Joo Lee,∥ Jaeho Jeong,∥ and Hwa-Young Kim⊥ †

Medical Information Center, Seoul National University Hospital, Seoul, 110-744, Korea School of Computer Science and Engineering, Seoul National University, Seoul, 151-742, Korea § Division of Computer Science and Engineering, Hanyang University, Seoul, 133-791, Korea ∥ The Center for Cell Signaling & Drug Discovery Research, College of Pharmacy and Division of Life & Pharmaceutical Sciences, Department of Bioinspired Science, Ewha Womans University, Seoul, 120-750, Korea ⊥ Department of Biochemistry and Molecular Biology, Yeungnam University College of Medicine, Daegu, 705-717, Korea ‡

S Supporting Information *

ABSTRACT: Selenoproteins, containing selenocysteine (Sec, U) as the 21st amino acid in the genetic code, are well conserved from bacteria to human, except yeast and higher plants that miss the Sec insertion machinery. Determination of Sec association is important to find substrates and to understand redox action of selenoproteins. While mass spectrometry (MS) has become a common and powerful tool to determine an amino acid sequence of a protein, identification of a protein sequence containing Sec was not easy using MS because of the limited stability of Sec in selenoproteins. Se has six naturally occurring isotopes, 74Se, 76Se, 77Se, 78Se, 80Se, and 82Se, and 80Se is the most abundant isotope. These characteristics provide a good indicator for selenopeptides but make it difficult to detect selenopeptides using software analysis tools developed for common peptides. Thus, previous reports verified MS scans of selenopeptides by manual inspection. None of the fully automated algorithms have taken into account the isotopes of Se, leading to the wrong interpretation for selenopeptides. In this paper, we present an algorithm to determine monoisotopic masses of selenocysteine-containing polypeptides. Our algorithm is based on a theoretical model for an isotopic distribution of a selenopeptide, which regards peak intensities in an isotopic distribution as the natural abundances of C, H, N, O, S, and Se. Our algorithm uses two kinds of isotopic peak intensity ratios: one for two adjacent peaks and another for two distant peaks. It is shown that our algorithm for selenopeptides performs accurately, which was demonstrated with two LC−MS/MS data sets. Using this algorithm, we have successfully identified the Sec−Cys and Sec−Sec cross-linking of glutaredoxin 1 (GRX1) from mass spectra obtained by UPLCESI-q-TOF instrument. KEYWORDS: mass spectrometry, selenocysteine, selenopeptide, selenium, isotopic distribution, intensity ratio



INTRODUCTION Selenoproteins, containing selenocysteine (Sec, U) as the 21st amino acid in the genetic code, are well conserved from bacteria to human, except yeast and higher plants that miss the Sec insertion machinery. Twenty five selenoproteins have been identified in human genome1 and are known to play key roles in antioxidant defense and cellular redox homeostasis. Among the characterized selenoproteins, most are oxidoreductases catalyzing the following three ways. First, glutathione peroxidases (GPxs) can get rid of reactive oxygen species (ROS) by catalyzing the reduction of hydrogen peroxide and lipid peroxide.2,3 Second, thioredoxin reductases (TRs)4 and methionine sulfoxide reductases (Msrs)5 reduce the oxidized cysteine and methionine residues, respectively, © 2012 American Chemical Society

in proteins oxidized by ROS. Third, deiodinases catalyze deiodination of thyroid hormones.6 Most of unknown selenoproteins are expected as redox enzymes having Sec in a thioredoxin fold active site (CxxC or CxxS/T).7 Since Sec is a highly reactive residue having lower pKa than Cys (5.2 vs 8.3), it thus becomes a better nucleophile in physiological pH 7.4.8 Further studies should be performed to understand the function and mode of action of unknown selenoproteins. Determination of Sec association is important to find substrates and to understand redox action of selenoproteins. Received: March 9, 2012 Published: July 11, 2012 4488

dx.doi.org/10.1021/pr300232y | J. Proteome Res. 2012, 11, 4488−4498

Journal of Proteome Research



While mass spectrometry (MS) has become a common and powerful tool to determine an amino acid sequence of a protein,9 identification of a protein sequence containing Sec was not easy using MS because of the limited stability of Sec in selenoproteins. Sec in selenoprotein is readily degraded into dehydroalanine via a β-elimination. It is required to modify the sample preparation procedure to identify the Sec-containing peptide with MS/MS by reducing Sec with DTT and alkylating with iodoacetamide (IAM). Even using this methodology, it is not good enough for accurate analysis.10 On the other hand, similar to disulfide bonds between two cysteines (Cys), Sec can form a selenenylsulfide (Se−S) bond with another Cys or a diselenide (Se−Se) bond with another Sec. Diselenide bonds are more stable than disulfide bonds and have a significant advantage over disulfide bonds for protein folding.11 Some MS-based analyses have reported identification of the peptides that contain selenenylsulfide or diselenide bonds,12,13 where the peaks of selenopeptide on an MS spectrum could be clearly distinguished from those of common peptides (that do not contain Se) because they have extraordinary isotopic distributions due to isotopes of Se. Se has six naturally occurring isotopes, 74Se, 76Se, 77Se, 78Se, 80Se, and 82 Se, and 80Se is the most abundant isotope as shown in Table 1. These characteristics provide a good indicator for

C (%)

H (%)

N (%)

O (%)

S (%)

Se (%)

0 +1 +2 +3 +4 +5 +6 +7 +8

98.93 1.07

99.989 0.0115

99.642 0.368

99.757 0.038 0.205

94.93 0.76 4.29

0.87

0.02

EXPERIMENTAL SECTION

Peptide Model

A peptide, “DVIVYTSNTUPHCFTVK”, used in this study was synthesized considering peptide length and amino acid composition by Thermo Scientific Inc. (Rockford, IL, USA). Its intraselenenylsulfide-linked form was obtained by incubation at 37 °C for 0.5 h. GRX 1 Preparation and LC−MS/MS

Mutated selenoprotein GRX 1 (C16S) from Clostridium was prepared as described previously.16 The sample was separated on SDS-PAGE, and the gel bands were destained and digested with trypsin for peptide extraction. The extracted solutions were evaporated to dryness in Speedvac for MS analysis. Formic acid was added to the final solution so that the final concentration of acid in solvent was 0.1% to facilitate electrospray. Peptides were analyzed by nanoAcquity UPLC/ESI/MS (SYNAPT HDMS, Waters Co., U.K.). Peptides were separated by using a C18 reversed-phase 75 μm i.d. × 200 mm analytical column (1.7 μm particle size, BEH130 C18, Waters) with an integrated electrospray ionization PicoTip (±10 μm, New Objective, USA). Five microliters of peptide mixtures were dissolved in buffer A (Water/formic acid; 100:0.1, v/v), injected on a column, and eluted by a linear gradient of 5−80% buffer B (ACN/formic acid; 100:0.1, v/v) over 120 min. Samples were desalted on line prior to separation using trap column (i.d. 180 μm × 20 mm, Symmetry C18, Waters) cartridge. Initially, the flow rate was set to 300 nL/min and the capillary voltage (2.8 keV) was applied to the UPLC mobile phase before spray. Chromatography was performed on line to SYNAPT HDMS. The mass spectrometer was programmed to record scan cycles composed of one MS scan followed by MSMS scans of the 4 most abundant ions in each MS scan. MS parameters for efficient data-dependent acquisition were intensity (>10) and number of components (3−4) to be switched from MS to MS/MS analysis. In the first run analysis, the 4 most abundant precursors were selected for MS/MS analysis.17

Table 1. Natural Abundances of Atoms in A′ = {C,H,N,O,S,Se} isotope

Article

9.36 7.63 23.78 49.61 8.73

selenopeptides but make it difficult to detect selenopeptides using software analysis tools developed for common peptides. Thus, previous reports verified MS scans of selenopeptides by manual inspection. Fully automated algorithms have been introduced to detect peptide features on an MS spectrum14,15 and analyzed isotopic patterns of peptides only based on isotopes of C, H, N, O, and S elements. None have taken into account the isotopes of Se, leading to the wrong interpretation for selenopeptides. In this paper, we present an algorithm to determine monoisotopic masses of selenocysteine-containing polypeptides. Our algorithm is based on a theoretical model for an isotopic distribution of a selenopeptide, which regards peak intensities in an isotopic distribution as the natural abundances of C, H, N, O, S, and Se. In spite of the singularity (natural abundances and count) of Se, our theoretical distribution model fits very closely to the real data. Our algorithm uses two kinds of isotopic peak intensity ratios: one for two adjacent peaks and another for two distant peaks. It is shown that our algorithm for selenopeptides performs accurately, which was demonstrated with two LC− MS/MS data sets of a selenoprotein. Using this algorithm, we have successfully identified the Sec−Cys and Sec−Sec crosslinking of glutaredoxin 1 (GRX1) from mass spectra obtained by UPLC-ESI-q-TOF instrument.



METHODS

We first present a theoretical model for an isotopic distribution of a polypeptide that contains one or two selenocysteines. Then, using this model, we describe our isotopic peak intensity ratio functions. Finally, we explain a monoisotopic mass determination algorithm for selenocysteine-containing polypeptides. In order to develop our model for isotopic distributions of selenocysteine-containing cross-linked peptides, we generated a virtual database of selenocysteine-containing cross-linked peptides. From UniProt database, all the Cys-containing tryptic peptides were first selected. Then, we randomly chose two polypeptides and formed virtual disulfide linked peptides. By substituting, in silico, one or two sulfur atoms forming the disulfide bridge for selenium atom, we generated a virtual database of Sec−Cys and Sec−Sec cross-linked peptides and used emass software to render the theoretical isotopic distribution of these selenocysteine-containing cross-linked peptides. Model for Isotopic Distribution

We first give some notations and then we present a monoisotopic peak intensity function, followed by a theoretical 4489

dx.doi.org/10.1021/pr300232y | J. Proteome Res. 2012, 11, 4488−4498

Journal of Proteome Research

Article

Figure 1. Isotopic distributions of selenocysteine-containing polypeptides with mass 2000. (a) Distribution for a peptide with one selenocysteine, I1k, and (b) distribution for a peptide with two selenocysteines, I2k.

Figure 2. Results of isotopic intensity ratio model Ik+1/Ik (red line). Though I1k+1/I1k and I2k+1/I2k are different from each other, our analysis based on simulated ratio (gray dots) between two neighboring isotopic peaks showed that any such ratio can be considered as one of the four cases shown. (a) I11/I10, (b) I13/I12, (c) I16/I15, and (d) I18/I17.

A′ − {Se}. Let Xa be the +a isotope of an atom X ∈ A′ where a is an integer larger than or equal to 0 and let PXa be the natural abundance of Xa. Table 1 shows the abundances of the atoms in A′. Let CnCHnHNnNOnOSnSSenSe denote a polypeptide where the

model of the kth peak intensity function for selenocysteine-containing polypeptides. Let A′ = {C,H,N,O,S,Se} be a set of atoms that a selenocysteinecontaining polypeptide consists of, and let A = {C,H,N,O,S} denote 4490

dx.doi.org/10.1021/pr300232y | J. Proteome Res. 2012, 11, 4488−4498

Journal of Proteome Research

Article

denote by I1k a peak intensity for polypeptides where nSe = 1 and I2k a peak intensity for polypeptides where nSe = 2. Supporting Information Section S2 contains more details. Now we change the model of Ik as a formula about polypeptide masses. To do this, we assume the linearity between the number of atoms and mass m, i.e., nX ≈ aXm where aX is a constant for each X ∈ A.14 Note that we do not apply the linearity assumption to the number of selenocysteines, nSe. Then, we can approximate Ik as in Lemma 2. A detailed derivation of Lemma 2 is given in Supporting Information Section S3. Lemma 2. The intensity Ik approximates to a polynomial of mass m with degree k, i.e., Ik = dkmk + dk−1mk−1 + ··· + d1m + d0 where di for 0 ≤ i ≤ k are constants. To compute di in Lemma 2, we first find aX for each X ∈ A. Since an average polypeptide with one or two selenocysteines is C 285.59 H 439.09 N 71.66 O 87.85 S 3.27 Se 1 and C 285.59 H 439.09 N 71.66 O87.85S2.27Se2, respectively, we can deduce that aC = 0.0442, aH = 0.0680, aN = 0.0111, aO = 0.0136, aS = 0.000506 and aC = 0.0439, aH = 0.0676, aN = 0.0110, aO = 0.0135, aS = 0.000349, respectively. It is clear that I1k and I2k become polynomials of mass m with degree k where nSe = 1 for I1k and nSe = 2 for I2k , respectively. For example, I10 = 0.87, I11 = 4.70 × 10−4m, I14 = 3.08 × 10−15m4 + 6.21 × 10−12m3 + 1.37 × 10−6m2 + 4.58 × 10−3m + 23.8 and I20 = 0.757, I21 = 4.06 × 10−4m, I24 = 2.61 × 10−15m4 + 4.74 × 10−12m3 + 2.34 × 10−6m2 + 7.83 × 10−3m + 129. Figure 1 shows two isotopic distributions I1k and I2k of selenocysteine-containing polypeptides with m = 2000.

number of each atom X in the polypeptide is nX. For example, the average polypeptides of our two data sets with one and two selenocysteines are C285.59H439.09N71.66O87.85S3.27Se1 and C285.59H439.09N71.66O87.85S2.27Se2, respectively. We denote by I0 the monoisotopic peak intensity for a selenocysteine-containing polypeptide CnCHnHNnNOnOSnSSenSe. Since I0 is the monoisotopic peak intensity, there is no isotope in CnCHnHNnNOnOSnSSenSe, i.e., every atom X is X0, and thus I0 = P CnC0 P HnH0P NnN0P OnO0 P Sn0SPSenSe0

This formula means that the intensity is the existential probability of such a polypeptide. Now we present a theoretical model of the kth peak intensity for selenocysteine-containing polypeptides, Ik. The meaning of the kth peak is that the mass of the kth peak is bigger than the mass of the monoisotopic peak by k and it is caused by the isotopes of the atoms in the polypeptide. Lemma 1 shows a theoretical model of Ik. A detailed derivation of Lemma 1 and an example are given in Supporting Information Section S1. Lemma 1. The intensity Ik approximates to



Ik = I0

s + 2t 2 + 3t3+ 4t4 + 6t6 + 8t8= k

⎛ ⎞ T1k1T2k 2T4k4 ⎟ ⎜ ∑ ⎜ ⎟ ⎝ k1+ 2k 2 + 4k4 = s k1! k 2! k4! ⎠

nSe ⎞⎛ P 2 ⎞t2 ⎛ P 3 ⎞t3⎛ P 4 ⎞t4 ⎛ ⎟⎜ Se ⎟ ⎜ Se ⎟ ⎜ Se ⎟ ×⎜ t t t t t ⎝ 2 3 4 6 8 ⎠⎝ P Se0 ⎠ ⎝ P Se0 ⎠ ⎝ P Se0 ⎠

Intensity Ratio Functions

⎛ P 6 ⎞t6 ⎛ P 8 ⎞t8 × ⎜ Se ⎟ ⎜ Se ⎟ ⎝ P Se0 ⎠ ⎝ P Se0 ⎠

Using our model for isotopic distribution, we describe two kinds of isotopic peak intensity ratios: one for two adjacent peaks, Ik+1/Ik, and the other for two distant peaks. We first present some features of Ik+1/Ik and then show a method of computing the lower bound and the upper bound of Ik+1/Ik. At last we explain the other ratio. By Lemma 2, Ik+1/Ik is a formula of polynomials of degree k+1 over degree k. Figure 2 shows four cases of Ik+1/Ik. If mass m is sufficiently large such that m ≥ 5000, Ik+1/Ik approximates to a linear function. However, there exist selenocysteine-containing polypeptides with mass m < 5000, we cannot approximate to a linear function except I1/I0. For example,

where T1 =

∑ X∈A

nX P X1 , T2 = P X0

∑ X∈A

nX P X2 , and T4 = P X0

∑ X∈A

nX P X 4 P X0

and s, t2, t3, t4, t6, t8, k1, k2, and k4 are integers larger than or equal to 0. Since the polypeptides that we consider have only one or two selenocysteines, we give two kinds of models: one for polypeptides that contain only one selenocysteine and another for polypeptides that contain only two selenocysteines. We I61 I51

=

2.99 × 10−23m6 + 1.51 × 10−19m5 + 3.33 × 10−14m 4 + 2.67 × 10−10m3 + 3.68 × 10−6m2 + 1.17 × 10−3m + 49.6 3.32 × 10−19m5 + 1.12 × 10−15m 4 + 2.46 × 10−10m3 + 1.36 × 10−6m2 + 1.32 × 10−2m

is shown as the red line in Figure 2c. We can see that our theoretical model of isotopic peak intensity ratios fits very well with the simulated data. We compute the lower bound, Rlower(k,m), and the upper bound, Rupper(k,m), of Ik+1/Ik as follows. First, we divide the sample data set into partitions with a mass range 10. Second, we select the minimum ratio and the maximum ratio from each partition. Using these data, we can approximate the minimum ratio function, Rmin(k,m), and the maximum ratio function, Rmax(k,m), by least-squares fitting in gnuplot program. Note that Rmin(k,m) and Rmax(k,m) are formulas of degree p + 1 over degree p where p ≤ k and the average, Ravg(k,m), is also approximated using the entire data set. Finally, Rlower(k,m) and Rupper(k,m) are computed as follows.

R lower(k , m) = min{R avg(k , m) − c(R avg(k , m) − R min(k , m)), R avg(k , m) − d}

R upper(k , m) = max{R avg(k , m) + c(R max(k , m) − R avg(k , m)), R avg(k , m) + d}

where c and d are constant factors. For Rlower(k,m), Ravg(k,m) − c(Ravg(k,m) − Rmin(k,m)) means that the lower bound is c times farther than Rmin(k,m) from Ravg(k,m) (c is 5 in this paper). Ravg(k,m) − d means that the distance from Ravg(k,m) to the lower bound is always d (0.2 in this paper). Note that factor d is used when Rmin(k,m) is very close to or meets Ravg(k,m). A similar explanation holds for Rupper(k,m). In Figure 3, red lines represent Ravg(k,m) − c(Ravg(k,m) − Rmin(k,m)) and 4491

dx.doi.org/10.1021/pr300232y | J. Proteome Res. 2012, 11, 4488−4498

Journal of Proteome Research

Article

Figure 3. Examples of bound functions for some isotopic intensity ratios (Ik+1/Ik). Red lines show Ravg(k,m) − c(Ravg(k,m) − Rmin(k,m)) and Ravg(k,m) + c(Rmax(k,m) − Ravg(k,m)), and blue lines Ravg(k,m) − d and Ravg(k,m) + d. (a) I11/I10, (b) I13/I12, (c) I16/I15, and (d) I18/I17.

Ravg(k,m) + c(Rmax(k,m) − Ravg(k,m)), and blue lines are for Ravg(k,m) − d and Ravg(k,m) + d. The other ratio we use is a ratio of two distant peak intensities, Ih+k+α/Ih−k−β where h is the most abundant peak number and k > 0. We set values of α and β (either 0 or 1) such that if Ih+1 is larger than Ih−1, α = 1 and β = 0, and α = 0 and β = 1 otherwise. Thus, this ratio is computed using two peak intensities Ih+k+α and Ih−k−β that are equidistant from the most abundant peak Ih and the higher intensity peak of Ih+1 and Ih−1. The average, SRavg(h,k,m), is approximated using the entire data set. The lower bound, Rlower(k,m), and the upper bound, Rupper(k,m), are computed as follows.

trometric data. Our algorithm consists of four steps: (1) peak picking, (2) pseudocluster identification and monoisotopic mass determination, (3) isotopic cluster identification, and (4) duplicate cluster removal. Note that a pseudocluster with charge state C is a cluster of peaks such that the m/z difference of every two adjacent peaks in the cluster is 1/C. Since our algorithm is based on RAPID,14 we explain only step 1, step 2, and step 3. Peak Picking

Peak picking consists of four steps: partitioning, threshold computation, smoothing, and peak selection. We first divide the raw spectrum into multiple partitions such that each partition has the same number of peaks. For each partition, we compute an intensity threshold as follows: First, we compute the average intensity of all peaks in a partition and multiply it by 5. Then, we compute the average of peaks whose intensities are less than the multiplied value. Finally, we multiply the average by some user defined value (3 in this paper) and use it as the intensity threshold. Computing intensity threshold in this manner is similar to a method used by an open source deconvolution tool, Decon2LS. 15 Peaks with intensities lower than the threshold are abandoned in the peak selection step. We use different thresholds for the partitions because substantial baseline intensity differences were

SR lower(h , k , m) = min{(1 − e)SR avg(h , k , m), SR avg(h , k , m) − f }

SR upper(h , k , m) = max{(1 + e)SR avg(h , k , m), SR avg(h , k , m) + f }

where e and f are constant factors (we use 0.15 in this paper). This ratio helps to solve −1 Da error. Algorithm Overview

We present a monoisotopic mass determination algorithm for selenocysteine-containing polypeptides from mass spec4492

dx.doi.org/10.1021/pr300232y | J. Proteome Res. 2012, 11, 4488−4498

Journal of Proteome Research

Article

Table 2. Most Abundant Peak Numbers in Isotopic Clusters for Polypeptides That Contain Only One Selenocysteinea

a

Table 4. Threshold Values of the Minimum Number of Peaks for Isotopic Clusters of Selenocysteine-Containing Polypeptidesa

range of mass

peak number

range of mass

threshold

0−2004 2004−2828 2828−3133 3133−4556 4556−5781 5781−6848 6848−6898 6898−8413 8413−8872 8872−10745 10745−11045

6 6, 7 7 7, 8 8, 9 8, 9, 10 9, 10 9, 10, 11 9, 10, 11, 12 10, 11, 12 11, 12

0−2000 2000−3000 3000−4000 4000−5000 5000−6000 6000−7000 7000−8000 8000 ∼

7 8 8 9 10 11 12 12

a

It is based on the number of peaks for isotopic clusters without selenocysteine.

Note that the monoisotopic peak number is 0.

Table 5. Number of Distinct Monoisotopic Masses Determined by Our Method from Two Data Sets, ehS_110803_09 and ehS_100726_01a

Table 3. Most Abundant Peak Numbers in Isotopic Clusters for Polypeptides That Contain Only Two Selenocysteinesa range of mass 0−2909 2909−4114 4114−4124 4124−5662 5662−6261 6261−6936 6936−8319 8319−8451 8451−10528 10528−11091 a

ehS_110803_09 ehS_100726_01

peak number 12 12, 13 13, 13, 14, 14, 15, 15, 16,

our method

−1 Da

11 (98) 35 (105)

0 (0) 3 (4)

a

Note that the number in parentheses is the total number of monoisotopic masses.

13 14 14, 15 15 15, 16 16 16, 17 17

where p and q are the leftmost and the rightmost peak numbers, respectively, among isotopic peaks in a pseudocluster, and h is the most abundant peak number. N_ScoreR(p,q,m) and N_ScoreSR(p,q,h,m) are computed as follows.

Note that the monoisotopic peak number is 0.

NScoreR(p , q , m) =

observed between low m/z range and high m/z range. Then, we use a moving average method to smooth the raw spectrum. This smoothing step reduces many noisy clusters at the end of our algorithm. Finally, we select peaks from each partition with intensities higher than the threshold of the partition.

NScoreSR(p , q , h , m) = D min{q − h − α , h − p − β}

1 q−p

q−1

∑ ScoreR(k , m), k=p

min{q − h − α , h − p − β}



ScoreSR(h , k , m),

k=1

where α (≥0), β (≥0), and D are constants (D is 1 in this paper). ScoreR(k,m) is a ratio score of Ik and Ik+1, and ScoreSR(h,k,m) is a (split) ratio score. ScoreR(k,m) and ScoreSR(h,k,m) are computed as follows:

Pseudocluster Identification and Monoisotopic Mass Determination

We find triple peaks for each pseudocluster and then extend them as follows: First, we find the most abundant peak Ih in a pseudocluster. Since we know charge state C and m/z of Ih, we can easily compute the mass of Ih. Then, using the mass of Ih and Table 2 or Table 3, we can find Ih’s position. For example, when the mass of I1h is 4000, it is between 3133 and 4556 in Table 2, and thus I1h is either seventh peak or eighth peak in the isotopic cluster. In this case, we try to extend two pseudoclusters and then select a better one. Note that the monoisotopic peak is the zeroth peak. Second, we determine the monoisotopic mass m. Since we know the position of Ih and its mass, we can compute m as follows.

ScoreR(k , m) = ⎧ Ik + 1/Ik − R avg(k , m) ⎪1 − if Ik + 1/Ik > R avg(k , m) R ⎪ upper(k , m) − R avg(k , m) ⎪ ⎨ ⎪ Ik + 1/Ik − R avg(k , m) otherwise ⎪1 − ⎪ R lower(k , m) − R avg(k , m) ⎩ ScoreSR(h , k , m) = ⎧ I /I − SR avg(h , k , m) ⎪1 − h + k + α h − k − β if Ih + k + α /Ih − k − β SR upper(h , k , m) − SR avg(h , k , m) ⎪ ⎪ > SR avg(h , k , m) ⎨ ⎪ I /I − SR avg(h , k , m) ⎪1 − h + k + α h − k − β otherwise ⎪ SR lower(h , k , m) − SR avg(h , k , m) ⎩

m = mass of Ih − 1.00235 × position of Ih

Third, we can compute a score of a pseudocluster as follows.

Note that ScoreR(k,m) and ScoreSR(h,k,m) ranges from −∞ to 1 and that if Ik+1/Ik and Ih+k+α/Ih−k−β are within the boundary (between Rlower(k,m) and Rupper(k,m) and between SRlower(h,k,m) and SRupper(h,k,m)), ScoreR(k,m) > 0 and ScoreSR(h,k,m) > 0,

Score(p , q , h , m) = N_ScoreR(p , q , m) + N_ScoreSR(p , q , h , m) 4493

dx.doi.org/10.1021/pr300232y | J. Proteome Res. 2012, 11, 4488−4498

Journal of Proteome Research

Article

Figure 4. Isotopic distributions of a selenopeptide of relatively small mass. (a) A theoretical model when a monoisotopic mass is 1965.8363 Da, (b) raw data from ehS_110803_09 with scan number 486, (c) MS scan after peak picking, and (d) an isotopic cluster determined by our method.

find some pseudoclusters that are indeed selenopeptides but contain experimental errors. The second rule, i.e., the number of peaks in a pseudocluster, helps the first rule. If a peptide’s mass is large, shapes of isotopic clusters of peptides with or without selenocysteines become similar to each other. However, for the same mass, the number of peaks in an isotopic cluster for selenocysteine-containing polypeptides is larger than that for polypeptides without selenocysteine. Thus to reduce the false positive, we identify isotopic clusters by selecting pseudoclusters that have enough peaks. Table 4 shows the threshold of the minimum number of peaks for the isotopic cluster for selenocysteine-containing polypeptides.

respectively. This score will be used to identify isotopic clusters in step 3. Fourth, we extend a pseudocluster to the left side and to the right side. Since we use Ih as a base peak, we add peaks to the left side of the pseudocluster while there exist peaks until the monoisotopic peak is added. To the right side, we try to extend the pseudocluster until there exist peaks and the intensity of the added peak is similar to that of the leftmost peak of the pseudocluster. Note that we compute the score of the pseudocluster N_ScoreR(p,q,m) whenever one peak is added. Finally, we determine a pseudocluster from isotopic peaks p to q with the best N_ScoreR(p,q,m).



Isotopic Cluster Identification

From the pseudoclusters, we find all isotopic clusters whose intensity patterns are similar to those of isotopic distributions. We use two rules to find isotopic clusters: the score in step 2 and the number of peaks in the pseudocluster. The first rule, i.e., the score in step 2, Score(p,q,h,m), evaluates shapes of intensity patterns. Since Score(p,q,h,m) is a normalized sum of ScoreR(k,m) and ScoreSR(h,k,m), one can use a score threshold around 0 (It is 0 in this paper). If we use a larger threshold, the shape of a pseudocluster would be closer to the shape of an ideal isotopic distribution, but we may fail to

RESULTS

Overview

To evaluate our method for selenocysteine-containing polypeptides, we used two experimental data sets, ehS_110803_09 and ehS_100726_01. The data set ehS_110803_09 consists of 3775 scans, and ehS_100726_01 consists of 4074 scans. Our software was executed on a 3.40 GHz Intel i7-2600 processor machine with 512MB RAM, running Windows XP OS. The overall results of our experiments are given in Table 5. From ehS_100803_09, we identified 98 isotopic clusters, which consist of 11 distinct monoisotopic masses. From ehS_100726_01, 4494

dx.doi.org/10.1021/pr300232y | J. Proteome Res. 2012, 11, 4488−4498

Journal of Proteome Research

Article

Figure 5. Isotopic Distributions of a selenopeptide of relatively large mass. (a) A theoretical model when a monoisotopic mass is 4128.8973 Da, (b) raw data from ehS_100726_01 with scan number 529, (c) MS scan after peak picking, and (d) an isotopic cluster determined by our method.

From this, our method could analyze the peptide feature shown in Figure 4b and determine that this peptide includes one selenium and its monoisotopic mass is 1965.74 Da. We made an inference from 1965.74 Da that the sequence is “DVIVYTSNTU*PHC*FTVK”. Figure 5a shows a theoretical distribution for a peptide with 4128.8973 Da as the monoisotopic mass, and Figure 5b shows the corresponding MS scan from ehS_100726_01. Figure 5c shows a distribution of the selected peaks from the scan shown in Figure 5b after peak picking. Figure 5d shows a distribution of the peaks selected by our algorithm. From the peptide feature shown in Figure 5b, we can determine that this peptide includes 2 selenium atoms and its monoisotopic mass is 4128.85 Da. We made an inference from 4128.85 Da that the sequence is “EVIVYTSNTU*PHSFTVK-AKEVIVYTSNTU*PHSFTVK”.

we found 105 isotopic clusters, consisting of 35 distinct monoisotopic masses (Table 5). Among the 105 clusters, there are four problematic cases of −1 Da monoisotopic mass. We deal with these cases at the end of this chapter. For each selenopeptide, we first show a distribution of a theoretical model based on the natural abundances of isotopes of C, H, N, O, S, and Se. Then, we give a distribution from an experimental MS scan corresponding to that of a theoretical model. After showing peaks from peak picking results, we present a distribution of the peaks selected by our algorithm. We use two typical examples: one is a selenopeptide of relatively small mass (4000 Da). Figures 4 and 5 show their distributions. A distribution of selenopeptide of relatively small mass is distinctively different from regular peptides that do not contain Se, because of the unusual natural abundances of Se isotopes. A distribution of selenopeptide of relatively large mass is similar to a normal distribution. Figure 4a shows a theoretical distribution for a peptide with 1965.8363 Da as the monoisotopic mass, and Figure 4b shows the corresponding MS scan from ehS_110803_9. Note that Figure 4a and b are very similar to each other. Figure 4c shows a distribution of the selected peaks from the scan shown in Figure 4b after peak picking. Figure 4d shows a distribution of the peaks selected by our algorithm.

MS/MS Spectrum of Selenopeptide

We analyzed MS/MS spectra of precursor ions that our method predicted as selenopeptides, and the analysis was conducted by DBond software.18 Figure 6 shows two identifications and their annotated MS/MS spectra. First, our method predicted a 4+ precursor ion including two Sec at 1033.24 m/z, and its resultant identification was an interdiselenide-linked form, “EVIVYTSNTU*PHSFTVK-AKEVIVYTSNTU*PHSFTVK”, as shown in Figure 6a. We scrutinized 4495

dx.doi.org/10.1021/pr300232y | J. Proteome Res. 2012, 11, 4488−4498

Journal of Proteome Research

Article

Figure 6. Annotated MS/MS spectra of selenopeptides, (a) interdiselenide-linked peptide, “EVIVYTSNTU*PHSFTVK-AKEVIVYTSNTU*PHSFTVK” and (b) intraselenenylsulfide-linked peptide, “DVIVYTSNTU*PHC*FTVK”. Matched peaks are shown in color, and especially, ions that contain Sec are shown in blue. PAy17-ion including two Sec (a) and y13-ion including a single Sec (b) given in inset boxes show excellent agreement between isotopic distribution patterns of fragment ions from observation and calculation. Ion annotations used for the interpeptide form: “PA”, one strand (top) of a dipeptide; “pb”, the other strand (bottom) of a dipeptide; capital letters, fragment ions from peptide PA; small letters, fragment ions from peptide pb.

Unexpected Examples

the isotopic distributions of the fragment ions including two Sec in the MS/MS spectrum and confirmed that the patterns agreed with the calculated ones. As an example, the small box compares the observed and calculated isotopic distributions for PAy173+-ion. Next, the MS/MS analysis was conducted for a triply charged precursor ion predicted to include one Sec at 658.3 m/z, resulting in an identification of an intraselenenylsulfide-linked peptide, “DVIVYTSNTU*PHC*FTVK”, as shown in Figure 6b. The isotopic distributions of fragment ions containing the linkage were also found to agree well with those calculated by their corresponding sequences, confirming that its precursor ion contains one Sec. These demonstrate the utility of our method for selenopeptide detection.

In the data set ehS_100726_01, except inaccurate data due to low signal intensity, there are four problematic cases of −1 Da error shown in Figure 7. Upon comparing Figure 7a,b with 5b, we can see that the peak intensities on the left-hand side of the most abundant peak at m/z = 1036.47 of Figure 7a,b are slightly larger than those of Figure 5b, and the peak intensities on the right-hand side are slightly smaller than those of Figure 5b. Overall, the isotopic distribution in Figure 7a,b are a lot smoother than what is shown in Figure 5b, and this led to an incorrect result. Figure 7c,d shows another case where an isotopic distribution overlapped with other isotopic distributions causes an incorrect result. 4496

dx.doi.org/10.1021/pr300232y | J. Proteome Res. 2012, 11, 4488−4498

Journal of Proteome Research

Article

Figure 7. Isotopic distributions of a selenopeptide showing −1 Da error. (a) Raw data from ehS_100726_01 with scan number 455, (b) raw data from ehS_100726_01 with scan number 456, (c) raw data from ehS_100726_01 with scan number 596, and (d) raw data from ehS_100726_01 with scan number 384.

When masses of selenopeptides are larger than 3000 Da, shapes of isotopic distributions of polypeptides that contain one or two selenocysteines are very similar to each other and cannot be confidently determined solely based on MS scan. Thus our program reports both results for large masses.

Table 6. Manual Verification of Our Algorithm for False Negativesa

Manual Verification

ehS_110803_09 ehS_100726_01

For some MS scans, we have manually verified whether there are any false negatives. We selected scans with large intensity and whose most abundant peak intensity is larger than 100 and manually verified those MS scans by two human experts. EhS_110803_09 consists of 3775 scans, and the average intensity of MS scans between scan number 467 and 584 (11−15 min) is large (intensity of base peak >100). We verified 13 scans after rejecting those scans not selected by InsilicosViewer software. EhS_100726_01 consists of 4074 scans, and the average intensity of MS scans between scan number 335 and 1983 (10−60 min) is large (intensity of base peak >200). We verified 134 scans after discarding those scans not selected by InsilicosViewer software. Table 6 shows the verification results. There is no false negative for ehS_110803_09, and there are 7 false negatives for ehS_100726_01. These cases are caused by overlapping isotopic clusters.

# selenopeptides by human eyes (most abundant peak intensity >100)

# selenopeptides by our algorithm among those by human eyes

33 60

33 53

a

MS scans with large intensity and the most abundant peak intensity larger than 100 were selected and manually verified by two human experts.



DISCUSSION We presented an algorithm to determine monoisotopic masses of selenocysteine-containing polypeptides. Our algorithm is based on a theoretical model for an isotopic distribution of a selenopeptide. Our algorithm successfully identified the Sec− Cys and Sec−Sec cross-linking of glutaredoxin 1 (GRX1) from mass spectra obtained by UPLC-ESI-q-TOF instrument. Since it is the first result for selenocysteine-containing polypeptides, it is interesting to extend other algorithms15,19 for selenopeptides and compare with this result. In addition, to reduce false negatives, we will focus on models for overlapping isotopic clusters. 4497

dx.doi.org/10.1021/pr300232y | J. Proteome Res. 2012, 11, 4488−4498

Journal of Proteome Research



Article

(14) Park, K.; Yoon, J. Y.; Lee, S.; Paek, E.; Park, H.; Jung, H. J.; Lee, S. W. Isotopic peak intensity ratio based algorithm for determination of isotopic clusters and monoisotopic masses of polypeptides from high-resolution mass spectrometric data. Anal. Chem. 2008, 80 (19), 7294−7303. (15) Jaitly, N.; Mayampurath, A.; Littlefield, K.; Adkins, J.; Anderson, G.; Smith., R. Decon2LS: An open-source software package for automated processing and visualization of high resolution mass spectrometry data. BMC Bioinf. 2009, 10 (1), 87. (16) Kim, M. J.; Lee, B. C.; Jeong, J.; Lee, K. J.; Hwang, K. Y.; Gladyshev, V. N.; Kim, H. Y. Tandem use of selenocysteine: adaptation of a selenoprotein glutaredoxin for reduction of selenoprotein methionine sulfoxide reductase. Mol. Microbiol. 2011, 79 (5), 1194−1203. (17) Seo, J.; Jeong, J.; Kim, Y. M.; Hwang, N.; Paek, E.; Lee, K. J. Strategy for comprehensive identification of post-translational modifications in cellular proteins, including low abundant modifications: application to glyceraldehyde-3-phosphate dehydrogenase. J. Proteome Res. 2008, 7 (2), 587−602. (18) Choi, S.; Jeong, J.; Na, S.; Lee, H. S.; Kim, H.-Y.; Lee, K.-J.; Paek, E. New algorithm for the identification of intact disulfide linkages based on fragmentation characteristics in tandem mass spectra. J. Proteome Res. 2010, 9 (1), 626−635. (19) Zhang, J.; Gao, W.; Cai, J.; He, S.; Zeng, R.; Chen, R. Predicting molecular formulas of fragment ions with isotope patterns in tandem mass spectra. IEEE/ACM Trans. Comput. Biol. Bioinf. 2005, 2 (3), 217−230.

ASSOCIATED CONTENT

S Supporting Information *

Derivation of Lemma 1 (S1), I1k and I2k (S2), and Derivation of Lemma 2 (S3). This material is available free of charge via the Internet at http://pubs.acs.org.



AUTHOR INFORMATION

Corresponding Author

*(K.P.) Tel.: +82-2-880-8381. Fax: +82-2-885-3141. E-mail: [email protected]. (E.P.) Tel.: 82-2-2220-2377. Fax: 82-22220-1723. E-mail: [email protected]. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS This work was supported by 21C Frontier Functional Proteomics Project from Korean Ministry of Education, Science & Technology (FPR08-A1-020 and FPR08-A1-021) and by the National Research Foundation of Korea (NRF) grant funded by the Korea Government (MEST) (No. 2011-0006244 and No. 2011-0026496) and the Center for Cell Signaling & Drug Discovery Research at Ewha Womans University. S. Na was supported by Brain Korea 21 (BK21) Project.



REFERENCES

(1) Kryukov, G. V.; Castellano, S.; Novoselov, S. V.; Lobanov, A. V.; Zehtab, O.; Guigó, R.; Gladyshev, V. N. Characterization of mammalian selenoproteomes. Science 2003, 300 (5624), 1439−1443. (2) Takebe, G.; Yarimizu, J.; Saito, Y.; Hayashi, T.; Nakamura, H.; Yodoi, J.; Nagasawa, S.; Takahashi, K. A comparative study on the hydroperoxide and thiol specificity of the glutathione peroxidase family and selenoprotein P. J. Biol. Chem. 2002, 277 (43), 41254−41258. (3) Toppo, S.; Flohé, L.; Ursini, F.; Vanin, S.; Maiorino, M. Catalytic mechanisms and specificities of glutathione peroxidases: variations of a basic scheme. Biochim. Biophys. Acta 2009, 1990 (11), 1486−1500. (4) Arnér, E. S. Focus on mammalian thioredoxin reductases important selenoproteins with versatile functions. Biochim. Biophys. Acta 2009, 1790 (6), 495−526. (5) Kim, H. Y.; Gladyshev, V. N. Different catalytic mechanisms in mammalian selenocysteine- and cysteine-containing methionine-Rsulfoxide reductases. PLoS Biol. 2005, 3 (12), e375. (6) Kuiper, G. G.; Kester, M. H.; Peeters, R. P.; Visser, T. J. Biochemical mechanisms of thyroid hormone deiodination. Thyroid 2005, 15 (8), 787−798. (7) Martin, J. L. Thioredoxina fold for all reasons. Structure 1995, 3 (3), 245−250. (8) Shchedrina, V. A.; Zhang, Y.; Labunskyy, V. M.; Hatfield, D. L.; Gladyshev, V. N. Structure-function relations, physiological roles, and evolution of mammalian ER-resident selenoproteins. Antioxid. Redox Signaling 2010, 12 (7), 839−849. (9) Steen, H.; Mann, M. The ABC’s (and XYZ’s) of peptide sequencing. Nat. Rev. Mol. Cell Biol. 2004, 5 (9), 699−711. (10) Lopez Heras, I.; Palomo, M.; Madrid, Y. Selenoproteins: the key factor in selenium essentiality. State of the art analytical techniques for selenoprotein studies. Anal. Bioanal. Chem. 2011, 400 (6), 1717−1727. (11) Beld, J.; Woycechowsky., K. J.; Hilvert, D. Selenoglutathione: efficient oxidative protein folding by a diselenide. Biochemistry 2007, 46 (18), 5382−5390. (12) Ma, S.; Hill, K. E.; Burk, R. F.; Caprioli, R. M. Mass spectrometric determination of selenenylsulfide linkages in rat selenoprotein P. J. Mass Spectrom. 2005, 40 (3), 400−404. (13) Shchedrina, V. A.; Novoselov, S. V.; Malinouski, M. Y.; Gladyshev, V. N. Identification and characterization of a selenoprotein family containing a diselenide bond in a redox motif. Proc. Natl. Acad. Sci. U. S. A. 2007, 104 (35), 13919−13924. 4498

dx.doi.org/10.1021/pr300232y | J. Proteome Res. 2012, 11, 4488−4498