Optimal de novo Design of MRM Experiments for Rapid Assay Development in Targeted Proteomics Andreas Bertsch,*,† Stephan Jung,‡ Alexandra Zerck,§ Nico Pfeifer,†,| Sven Nahnsen,†,‡ Carsten Henneges,† Alfred Nordheim,⊥,‡ and Oliver Kohlbacher† Center for Bioinformatics, Eberhard-Karls-Universita¨t Tu ¨ bingen, Germany, Proteome Center Tu ¨ bingen, Eberhard-Karls-Universita¨t Tu ¨ bingen, Germany, Max Planck Institute for Molecular Genetics, Berlin, Germany, and Interfaculty Institute for Cell Biology, Eberhard-Karls-Universita¨t Tu ¨ bingen, Germany Received February 28, 2010
Targeted proteomic approaches such as multiple reaction monitoring (MRM) overcome problems associated with classical shotgun mass spectrometry experiments. Developing MRM quantitation assays can be time consuming, because relevant peptide representatives of the proteins must be found and their retention time and the product ions must be determined. Given the transitions, hundreds to thousands of them can be scheduled into one experiment run. However, it is difficult to select which of the transitions should be included into a measurement. We present a novel algorithm that allows the construction of MRM assays from the sequence of the targeted proteins alone. This enables the rapid development of targeted MRM experiments without large libraries of transitions or peptide spectra. The approach relies on combinatorial optimization in combination with machine learning techniques to predict proteotypicity, retention time, and fragmentation of peptides. The resulting potential transitions are scheduled optimally by solving an integer linear program. We demonstrate that fully automated construction of MRM experiments from protein sequences alone is possible and over 80% coverage of the targeted proteins can be achieved without further optimization of the assay. Keywords: SRM • MRM • ILP • OpenMS • prediction
Introduction Shotgun mass spectrometry has become a state-of-the-art method in proteomic research. HPLC-MS and subsequent data-driven acquisition of tandem mass spectra, fragmenting the top n peptide signals in each MS spectrum, is the standard workflow in many studies.1,2 This workflow produces large numbers of quantifiable peptides, although the sensitivity is limited. Biologically interesting medium- or low-abundant proteins are often not detected in this approach. Targeted proteomics has become a valuable alternative, which can increase the sensitivity and the reproducibility of protein detection and identification. Multiple reaction monitoring (MRM; also SRM, selected reaction monitoring) has become the method of choice to quantify peptides specific for a given set of proteins.3,4 Several difficulties are associated with MRM-based methods. The target peptides, their precursor ion m/z values, and the product ion m/z values that can be observed must be known in advance. Typically, they are determined experimentally. New peptides cannot be identified using MRM, because only the transition of the precursor ion * To whom correspondence should be addressed. E-mail: bertsch@ informatik.uni-tuebingen.de. † Center for Bioinformatics, Eberhard-Karls-Universita¨t Tu ¨ bingen. ‡ Proteome Center Tu ¨ bingen, Eberhard-Karls-Universita¨t Tu ¨ bingen. § Max Planck Institute for Molecular Genetics. | eScience Research Group, Microsoft Research, Los Angeles, California. ⊥ Interfaculty Institute for Cell Biology, Eberhard-Karls-Universita¨t Tu ¨ bingen.
2696 Journal of Proteome Research 2010, 9, 2696–2704 Published on Web 03/04/2010
to a product ion is monitored using narrow m/z windows. Therefore, interferences with other ions occurring at the same time are difficult to detect.5 Given a protein, it is also not clear which potential peptides should be chosen to be included into an MRM experiment. This can be solved by using proteotypic peptides,6–9 which uniquely represent a specific protein. For this study, we define a proteotypic peptide to be unique for a protein with respect to a given proteome and detectable through the mass spectrometer. In principle, these peptides can be systematically determined for all proteins of an organism;10 however, this approach is rather expensive. It is thus desirable to construct MRM assays de novo, that is, from the protein sequences alone. In MRM, a mass filter selects the precursor m/z value and after CID a specific product ion m/z value is monitored. The pair of m/z values is called a transition. The number of transitions that are monitored in a single MRM run can be tremendously increased by time window restrictions. Therefore, the retention times of the peptides need to be known. Knowing the precursor m/z, product m/z values, and the retention time, transitions for a peptide can be generated. Given a large number of transitions of different peptides/proteins, the transitions need to be arranged in an experiment, such that the measurement time is used efficiently. If proteotypic peptides, retention time, and suitable transition m/z values for all targeted proteins are known, the next step is the design of the MRM experiment. The number of MRM transitions that can 10.1021/pr1001803
2010 American Chemical Society
MRM Experiments for Rapid Assay Development be monitored in parallel also limits the number of peptides, and thus proteins, that can be observed. One aims at complete coverage of the targeted proteins, while also maximizing the number of peptides that are observed to increase accuracy and reproducibility. In this work, we present a novel method for the optimal de novo design of targeted MRM experiments based on the protein sequences alone. Apart from a simple calibration run (e.g., a protein mix) to determine the properties of the chromatographic system, no further experimental data is required. Our approach is based on machine learning methods and combinatorial optimization. Machine learning methods predict peptide proteotypicity,7,11,12 peptide retention times,13 and suitable product ions14 for MRM transitions. From the total set of suitable transitions, we then formulate the MRM scheduling problem, which optimizes the measurement schedule with respect to protein and peptide coverage while ensuring that each peptide is covered by a minimum number of MRM transitions. At the same time, an optimal design also makes optimal use of instrument measurement time by scheduling as many transitions as possible. We describe training and evaluation of the prediction methods used in this work. The scheduling problem is formally described as an integer linear program (ILP).15 That the transition selection problem is NP-complete has been shown,16 although they used simpler assumptions for the selection of the transitions. We found that most real-world instances of the problem can nevertheless be solved in an acceptable time. We also present a simplified ILP formulation and a greedy approach that can both yield similar, albeit suboptimal, results in a shorter time. We validate that the scheduling is able to generate scheduled MRM (sMRM) experiments using two different experimental studies. First, we use a mixture of known proteins to generate a list of transitions of the 50 protein sequences, which are selected by the ILP to generate one sMRM experiment. The occupation of the experiment with transitions, as well as the “value” of the generated transition lists are used as criteria to compare the transition list to other scheduling approaches. Two sMRM experiments are then acquired using the transition lists of the ILP and a greedy scheduling algorithm to evaluate the performance of the transition schedules by counting the number of transitions, peptides and proteins that show acceptable signals in the experimental data. The performance evaluation shows that as expected about half of the transitions work out of the box. The software described in this paper is open-source and will be made publicly available in the next release of our software package OpenMS/TOPP.17,18
Methods MRM Scheduling Problem. The problem formulation assumes that we are given a fixed set of protein sequences. We further assume that we have prediction models for (a) proteotypicity for each tryptic peptide of each of the protein sequences, (b) each peptide’s retention time, and (c) the product ion intensity for a given peptide sequence. Each of these models can, of course, be replaced by experimental data (e.g., known proteotypic peptides, measured retention times, measured MS/ MS spectra). The provenance of the data (model or experiment) does not affect the optimization problem based thereupon. We intend to make the method presented here as universally applicable as possible. Consequently, we integrated machine learning-based methods for proteotypicity, retention time, and
research articles fragmentation. We are aware that each of these methods has only limited accuracy. We will show, however, that the accuracy of these models is by far sufficient to enable ab initio construction of targeted MRM assays thereupon. The models employed are described in more detail below. More formally, we assume that we are given k protein sequences S ) {s1, ..., sk}. Each of the proteins contains one or more tryptic peptides. The union of all peptides is given by P ) {p1, ..., pm}. For each peptide pi we can predict • the retention time RT(pi), • the probability of being observed (proteotypicity) PT(pi), and • a list of product ion intensity values FI(pi). Each of these properties is derived from a machine learning model. The product ion list FI(pi) contains possible product ion transition masses for a given peptide pi along with their intensities. In order to observe these transitions, the peptide has to be scheduled for measurement in a time slot covering the elution time of the peptide. The set of all possible transitions is denoted as T ) {t1, ..., tl}, where each transition t is assigned a peptide p(t) and a product ion mass m(t). For each peptide, a set of transitions that represent that peptide is defined ∀p ∈ P:Tp ⊆ T, where∀t ∈ Tp:t is generated from peptide p (1) Similarly, we define a set of peptides for each protein ∀s ∈ S:Ps ⊆ P, where∀p ∈ Ps:s is a peptide from protein s (2) On the basis of the tolerance of the retention time model and to capture the whole elution profile of the transition, we reserve time slots of width 2δ per transition (δ represents the tolerance of the retention time). These time slots are in the interval [RT(pi) - δ, RT(pi) + δ) centered around the retention time of the peptide RT(pi). An overview of the basic setup of the scheduling problem is given in Figure 1. Due to instrument limitations, only a certain number of transitions can be monitored in parallel. This transition capacity C is defined by the experimental setup. It may not be exceeded at any time. Hence, the sum of transitions covering any time slot must be less or equal to C. To this end we define sets TSi containing all transitions covering time slot i ∀1eieN:TSi ) {t ∈ T|RT(p(t)) - δ e i < RT(p(t)) + δ}
(3)
where N is the number of time slots of the experiment. The MRM scheduling problem maximizes the number of transitions observed over all time slots subject to several constraints: • each peptide must be covered by at least τ transitions, and • at most C transitions can be observed in parallel in each time slot. Simultaneously, we need to optimize coverage of the proteins with peptides. A varying number of proteotypic peptides per protein complicates the formulation of constraints, each protein should be covered by at least F peptides. From the proteotypicity prediction we obtain a likelihood of the peptide (and its transitions) to be observed. The fragmentation model similarly yields an intensity for each Journal of Proteome Research • Vol. 9, No. 5, 2010 2697
research articles
Bertsch et al.
product ion mass. We combine both estimates into a joint detectability dt for a transition t. By maximizing the overall detectability we thus prefer peptides with high proteotypicity over those that are less likely to be observed. Similarly, we prefer transitions based on high-intensity product ions over those with lower intensities. To obtain an optimal coverage of proteins, we use a hybrid objective function penalizing proteins with a small number of peptides. In this way, we can balance the relative importance of protein coverage, peptide coverage, and transition coverage. An optimal transition schedule maximizes an objective function accounting for the overall detectability of the transitions while including penalty terms for insufficient coverage of peptides and proteins. The output is then a scheduled transition list. Each entry is composed of the three values retention time, peptide m/z, and product m/z. ILP Formulation. The MRM scheduling problem is obviously a combinatorial problem. We can rewrite it as an ILP. We introduce binary decision variables xt representing each of the transitions t ∈ T. xt is set to one, if the corresponding transition t is contained in the schedule and zero otherwise. We also introduce variables yp for each peptide p ∈ P. yp is set to one, if peptide p is not covered by at least τ transitions and zero otherwise. Similarly, for each protein sequence s ∈ S, F binary variables zsj are introduced. A variable zsj is set to one if protein sequence s is not even represented by j peptides and to zero if it is covered by at least j peptides. wp and ws are constants and set to e.g. One and 10, respectively. The MRM scheduling problem can be formulated as an ILP as follows:
maximize
∑xd
t t
t∈T
- wp
∑y p∈P
p
- ws
∑ ∑ z (F - j) j s
s∈S 0ej