Improvements in Mass Spectrometry Assay Library Generation for

Article pubs.acs.org/jpr

Improvements in Mass Spectrometry Assay Library Generation for Targeted Proteomics Johan Teleman,†,‡,# Simon Hauri,*,†,# and Johan Malmström*,† †

Department of Clinical Sciences, Lund University, BMC D13, 221 84 Lund, Sweden Department of Immunotechnology, Lund University, Medicon Village (Building 406), 223 81 Lund, Sweden

‡

S Supporting Information *

ABSTRACT: In data-independent acquisition mass spectrometry (DIA-MS), targeted extraction of peptide signals in silico using mass spectrometry assay libraries is a successful method for the identification and quantification of proteins. However, it remains unclear if high quality assay libraries with more accurate peptide ion coordinates can improve peptide target identification rates in DIA analysis. In this study, we systematically improved and evaluated the common algorithmic steps for assay library generation and demonstrate that increased assay quality results in substantially higher identification rates of peptide targets from mouse organ protein lysates measured by DIA-MS. The introduced changes are (1) a new spectrum interpretation algorithm, (2) reapplication of segmented retention time normalization, (3) a ppm fragment mass error matching threshold, (4) usage of internal peptide fragments, and (5) a multilevel false discovery rate calculation. Taken together, these changes yielded 14−36% more identified peptide targets at 1% assay false discovery rate and are implemented in three new open source tools, Fraggle, Tramler, and Franklin, available at https://github. com/fickludd/eviltools. The improved algorithms provide ways to better utilize discovery MS data, translating to substantially increased DIA performance and ultimately better foundations for drawing biological conclusions in DIA-based experiments. KEYWORDS: proteomics, mass spectrometry, spectral library generation, TraML, assay quality, data-independent acquisition, DIA, SWATH, algorithm, software

■

INTRODUCTION In the life sciences, mass spectrometry is leveraged to acquire broad-range maps of biomolecules in biological and medical samples. Over recent decades, the identification and quantification of proteins has become common through discovery mass spectrometry (shotgun MS), where proteins are digested into peptides, which are separated by online reversed-phase liquid chromatography and identified using tandem mass spectrometry.1 Shotgun MS excels at large-scale identification of the proteins in a sample, but because of its sampling strategy, variable peptide identifications are commonly observed in each injection.2 This stochasticity results in frequent missing values, especially for low-signal peptides, which is a growing problem as many studies are focusing on statistical power for determining quantitative differences between biological conditions rather than cataloguing protein contents. Several alternative MS workflows exist that alleviate this missing value problem, most prominently by targeted MS methods such as selected or parallel reaction monitoring (SRM,3−5 PRM6) and, more recently, data-independent acquisition (DIA).7−12 In DIA mass spectrometry, the precursor ion range is divided into predetermined subranges, each of which is fragmented and scanned as a whole (reviewed in Bilbao et al.13). For example, in a popular formulation named SWATH, the MS range of 400−1200 m/z is divided into 32 subranges of 25 m/z each.14 Peptides within the same subrange are cofragmented and measured simultaneously, generating highly chimeric DIA fragment spectra. Therefore, DIA data are incompatible with © 2017 American Chemical Society

traditional MS search strategies unless deconvoluted into pseudoshotgun MS spectra by some sophisticated algorithm.15 Another recently proposed successful approach is to perform targeted data analysis.14,16 In this approach, peptides are identified and quantified by prior knowledge of their chemical properties. Each peptide ion can be encoded by MS coordinates describing their precursor mass, charge state, retention time, and specific fragmentation pattern. An efficient way of acquiring peptide coordinates is through the collection of shotgun MS results, which are converted into a spectral library where all peptide ions are represented by a reference spectrum. The spectral library can be further simplified by defining peptide− ion-specific mass spectrometry assays, retaining only the most intense matching fragment ions obtained from each peptide reference spectrum. Individual assays are combined to an assay library, an enriched subset of the original spectral library outlining peptide coordinates and used to extract and score peptides ions (targets) in silico from DIA data. Similar to transition lists of targeted MS methods such as SRM, assay libraries can be stored in the standardized traML format.17 The hypothesis of this study is that assays with more accurate peptide ion coordinates are of higher quality and perform better in DIA data analysis. By this definition, assay quality is independent of library size, and it is worth noting that assay quality likely differs between instrument types due to inherent differences in fragmentation techniques and instrument geoReceived: October 25, 2016 Published: May 18, 2017 2384

DOI: 10.1021/acs.jproteome.6b00928 J. Proteome Res. 2017, 16, 2384−2392

Article

Journal of Proteome Research metries.18,19 Several studies have addressed improving both DIA acquisition methods and data analysis strategies,15,16,20−23 but little focus has been put on assay library generation for targeted analysis. A seminal paper on the subject by Schubert et al. described a detailed protocol for how shotgun MS data should be acquired, searched, and processed into assays.24 However, their study does not include performance evaluation of the generated assay library, and it remains unknown whether this workflow yields the highest assay quality possible. Additionally, the protocol has only been demonstrated on a Q-TOF-type instrument, and Q-Orbitrap instruments are equally capable of performing DIA-MS in a SWATH-like manner.11,25 In this study, we explore assay generation from a Q-Orbitrap perspective, and introduce five primary changes to the workflow to improve the assay quality. These five changes are captured in three open source software tools: Fraggle, Tramler, and Franklin. Fraggle performs primary interpretation of fragment spectra based on shotgun MS search results and combines these interpretations into spectral libraries. Tramler generates assays by trimming the spectral library entries to only contain the most abundant fragments and stores the assays in a traML formatted assay library. Tramler also creates decoy assay libraries, supports further filtering and trimming of traML files, and provides a transition list text export option. Franklin performs novel multilevel FDR calculations to allow more permissive assay inclusion while maintaining protein and peptide FDR thresholds. To evaluate the effects of the introduced changes, we compared the number of successfully quantified targets in DIA data analysis. The cumulative effect of all five changes was up to 36% more quantified targets in protein lysates from four mouse organs. We also demonstrate a similar performance increase on Q-TOF instrument data and show that the workflow is compatible with alternative shotgun MS search strategies.

mapping from injection retention time to iRT using one of several implemented mapping strategies, for example, regular linear regression, an iterative robust linear regression, or segmented linear interpolation from the median empirical retention time.30 Using the iRT mapping, all observation retention times are converted to iRT; observations are stored, and the interpretation is finished. An equally straightforward algorithm was employed to combine observations into consensus assays. All observations are grouped by their assays, which are uniquely defined by UniMod38-encoded peptide sequence (see Supplemental Methods), peptide charge, fragmentation method, and collision energy. For each such assay and observed fragment ion, the normalized intensities from the observations are averaged. In the same spirit, the peptide ion iRT is averaged from the observations, as well as metadata-like measured precursor m/z and intensity. The maximal observed value is kept for PSM scores and q-values, and the fragment base intensities are summed. This concludes the “combine” step, and the combined assays are written to the disk. To allow fast and space-efficient manipulation of primary and combined observations, we implemented a custom binary file format using google protobuffers39 that we call fragments.bin. This format is open source (https://github.com/fickludd/ proteomicore), and using the protobuffers compiler, parsers and writers can be autogenerated for several programming languages. We have also implemented export functionality in Fraggle to enable conversion from fragments.bin to regular tsv files. TraML Manipulation by Tramler

Another tool, Tramler, was implemented to convert the combined assays into trimmed assays in the traML format, suitable for direct application to DIA data. Tramler is a fully featured and modular program that supports combining any number of operations to modify traML data into the required format. In this study, we used the “ms1-isotopes” operation to add traML targets for the top three precursor isotopes of each assay. The “trim” operation was used to limit fragments to 350−2000 m/z to require a minimum of three fragments and maximum of six to only use, for example, b- and y-type fragmentsand to not use fragments with an m/z within the precursor window. The “decoy” operation was used to generate decoy assays by shuffling peptide sequences (apart from the Cterminal and modified N-termini) and changing fragment m/z values according to the new decoy peptide. For a full list of the different Tramler operations, see the Supplemental Methods.

■

EXPERIMENTAL METHODS We developed a new algorithm to extract assays from traditional shotgun MS identifications. Two simple steps are performed to achieve this. First, each MS injection is considered separately and all PSMs meeting some quality criterion are interpreted to collect the empirical fragment ion intensities. Retention times are normalized to the iRT scale30 using synthetic peptides (JPT Peptide Technologies) that were added to every injection sample at the described concentrations (Table S2) to provide anchor points across the retention time scale. Second, the empirical fragment ion intensities from all injections are combined so that only one assay exists per peptide charge state.

Computation of Multilevel q-Values by Franklin

Computation of multilevel q-values was implemented in a python program called Franklin backed by the fast q-value40 and posterior error probability (PEP) calculation software Qvality.41 In essence, Franklin reads a tsv file containing protein, peptide, and assay information, the corresponding assays scores, and an optional assay identifier (id). This input file includes decoy proteins. Following computations, an output file will be generated containing for each input assay a row with the corresponding q-value and PEP at each level. Computation of the multilevel q-values is achieved by constructing a tree from the input table, where the root has one child per protein in the input, each protein node one child per peptide, and each peptide node one child per assay. The assay nodes contain the assay score and the optional id attributes. Scores for nodes other than the assay nodes are then computed

Fragment Spectrum Interpretation by Fraggle

A straightforward algorithm was used to interpret spectra based on PSMs. This algorithm is given a set of fragment ion types to consider, for example, y- and b-type ions of charge 1 or 2 and bigger than one amino acid. The m/z of each fragment ion fulfilling these criteria is computed for each PSM, after which centroided spectral peaks and fragment ions are merged by a sorted intersection. Matched intensities are normalized by the most intense matched fragment ion. We refer to the empirical information derived from a PSM as an observation. For the interpretation step to be complete, we further require the presence of some retention time normalization peptides among the collected observations. These peptides and their iRT values are given as algorithm input and are used to create a 2385


Article

Journal of Proteome Research

grade modified trypsin (Promega) was added to each of the gel and in-solution samples and incubated at 37 °C for 20 h. Peptides were subsequently reduced with 500 mM tris(2carboxyethyl)phosphine (Sigma-Aldrich) for 30 min at 37 °C and alkylated with 500 mM 2-iodoacetamide (AppliChem) for 30 min at room temperature in the dark.

as the maximal score of all the nodes children. The tree is split into decoy and target proteins, and for each tree depth, the node names are written into a decoy or target file along with their score. For each tree depth, Qvality is run using the appropriate decoy and target files. Results from Qvality are read and merged back into the target nodes by sorting them by score. Following this, the tree is tabularized and written to the result file. Something to note is that several strategies are supported for handling nonproteotypic peptides (annotated by a semicolon separated protein group entry in the protein column). By default, protein groups containing only target or only decoy proteins will be left as is, and in mixed protein groups, decoys will be removed. For comparison with Mayu, we have also implemented the “simplifyProteinGroups” mode, which performs the same trimming as the default but only keeps the first protein entry after this. This mode of operation is enforced by Mayu but should not be used for projects aiming to study individual proteins, as this procedure assigns ambiguous peptides to one of the possible proteins at random, potentially skewing protein FDRs. Last, the “proteotypic” mode will remove any entry containing more than one protein after trimming. Note that, for the results in this paper, “simplifyProteinGroups” has been used exclusively to not introduce any changes in protein grouping compared to baseline assay generation to limit the scope of this study. The “default” and “proteotypic” modes of Franklin are presented as a side note for use in future experiments by the authors or others. One should also note that the removal of decoy proteins from mixed groups could potentially skew the FDR calculations if such mixed groups are not representative of the general decoy population. For future studies, Franklin will be modified to split mixed protein groups into two assay results instead of deleting the decoys to avoid this issue.

Sample Preparation: Yeast Standard

Each yeast experiment used 1 μg of peptides from the MS Compatible Yeast Protein Extract Digest (V7461; Promega). Liquid Chromatography Mass Spectrometry

All peptide analyses were performed on a Q-Exactive Plus mass spectrometer (Thermo Scientific) coupled with an EASY-nLC 1000 ultrahigh performance liquid chromatography system (Thermo Scientific). Shotgun MS data of the fractionated mouse tissues were downloaded from the PRIDE repository PXD002896,35 and the yeast standard samples were measured using the shotgun MS method according to Malmström et al.35 In summary, 1 μg of peptides were separated by reverse-phase chromatography on a 120 min linear gradient using 5−35% acetonitrile in aqueous 0.1% formic acid at a flow rate of 300 nL/min. The 15 most intense signals with charge states of 2−5 were isolated from full MS scans and fragmented using higher energy collision-induced dissociation (HCD) with a normalized collision energy set to 30. For data-independent acquisition (DIA), 1 μg of peptides was separated using an EASY-Spray column (Thermo Scientific; 75 μm ID, 25 cm length, 45 °C). Column equilibration and sample loading were performed at 600 bar. A 120 min, linear gradient from 5 to 35% acetonitrile in aqueous 0.1% formic acid was run at a flow rate of 300 nL/min. One full MS scan (resolution of 70,000 at 200 m/z; mass range from 400 to 1200 m/z) was followed by 32 MS/MS full fragmentation scans (resolution of 35,000 at 200 m/z) using an isolation window of 26 m/z (0.5 m/z overlap between consecutive windows). Precursor ions were fragmented using HCD at a normalized collision energy of 30. The automatic gain control (AGC) was set to 1 × 106 for both MS and MS/ MS with ion accumulation times of 100 ms (MS) and 120 ms (MS/MS). All samples injected contained a peptide standard for retention time calibration as previously described in Malmström et al.35 The obtained raw files were converted to gzipped and Numpressed43 mzML using the tool msconvert from the ProteoWizard v3.0.5930 suite.44

Data Depositions and Software Implementation

The raw data, search results, and assay libraries are deposited at PeptideAtlas42 (data set identifier: PASS00905). Fraggle and Tramler are written in Scala 2.10.0 and are available open source under the Apache 2 license at https://github.com/ fickludd/eviltools and utilize the proteomicore libraries available at https://github.com/fickludd/proteomicore. All programs and libraries are built and dependency-managed using maven. Franklin is implemented in python and available at https://github.com/fickludd/eviltools. For details on program execution, we refer to the Supplemental Methods or the GitHub README files.

Shotgun MS and DIA Analysis

Sample Preparation: Mouse Tissues

Shotgun MS data were searched using either a TPP-based45 or DeMix36 pipeline. TPP v4.7 POLAR VORTEX rev 0, build 201405161127 (linux), was used with X!Tandem Jackhammer TPP (2013.06.15.1 - LabKey, Insilicos, ISB) and Comet46 version 2013.02 rev. 2. For the DeMix pipeline, Dinosaur 1.1.047 was used for feature detection; a custom script was used for DeMixing MS/MS spectra, and a custom MS-GF+48 version (JT branch, v9949-2; 18/09/2015) was used for searching. For mouse data the UniProt mouse reference proteome (UP000000589, Oct-2015) and for yeast data the Uniprot yeast reference proteome (UP000002311, Mar-2014) were used as databases. All search engines set carbamidomethylation of cysteines as fixed and oxidation of methionines as variable modifications. X!Tandem additionally allowed variable acetylation of N-termini as well as S-carbamoylmethyl-cysteine

31

Mouse tissues were prepared according to Malmström et al. The kidney, liver, heart, and spleen from one animal were homogenized in 1 mL of PBS using a Polytron PT-2100 Homogenizer (Kinematica AG). Then, 250 μL of homogenate of each sample was added to 100 mg of 0.1 mm glass silica beads (Biospec Products) and processed with a FastPrep-96 system (MP Biomedicals). Silicon beads were remove by centrifugation at 2000g for 10 min. Protein concentrations were quantified using the Pierce BCA protein assay kit (Thermo Scientific). Then, 100 μg of protein per organ sample destined for shotgun MS analysis and assay library generation were fractionated by SDS-PAGE and in-gel digested. Ten micrograms of protein per organ sample destined for DIA-MS analysis was digested in solution. Then, 2.5 μg of sequencing2386


Article


Figure 1. (a)Conceptual improvements to a mass spectrometry assay generation workflow. Changes are marked in red, and in total, five conceptual changes are proposed. (b) The spectrum search tool SpectraST is replaced with the new tool Fraggle specifically made for assay generation. (c) Reapplication of the original iRT scale and mapping. (d) Usage of ppm thresholds for Orbitrap mass analyzers to adjust for the m/z-dependent resolution. (e) Internal fragment ions are allowed transitions. (f) Novel multilevel FDR calculations.

Figure 2. Mass spectrometry assay library sizes and results. (a) The 10 underlying gel-fractionated shotgun MS injections show input sizes between 70,000 and 150,000 PSMs. (b) The suggested improvements barely change the generated assay library sizes. (c) Relative increase in the number of quantified peptide targets in DIA data analysis using the stepwise improved assay libraries for the four mouse organ samples (H: heart; K: kidney; L: liver; S: spleen). (d) Absolute increase in the number of quantified peptide targets in DIA data analysis of the mouse organ samples.

software for targeted analysis of DIA data similar to OpenSWATH but using different backing algorithms. PyProphet DIANA-edition was run with the LDA classifier in five cross-evaluation sets, train_fraction = 1.0, nonparametric null model, and Storey FDR calculations. All the configuration files used are available on https:// github.com/fickludd/eviltools. Computations were performed on a 12-core computation server running Ubuntu Server 14.04.2 LTS.

cyclization of N-terminal cysteines and pyro-glutamic acid formation from glutamic acid and glutamine. X!Tandem and Comet used a 20 ppm precursor threshold; X!Tandem used a 50 ppm fragment threshold, whereas for Comet the default binning of a fragment_bin_tol = 1.005 and fragment_bin_offset = 0.4. The initial MS-GF+ search used a 10 ppm precursor threshold, which was automatically lowered to approximately 4 ppm in the demixed search after software m/z calibration. For DIA targeted analysis, DIANA21 v2.0.0 was used with a 20 ppm uniform extraction window. DIANA is an open source 2387


Article


Figure 3. Details of the effects of assay generation improvements. (a) Successfully analyzed peptide targets in liver DIA data show smaller iRT deviations to their corresponding assay library entries when iRT calibration was adjusted for nonlinear chromatographic gradients. (b) Uniformly distributed erroneous fragment assignments are avoided by an interpretation cutoff at 10 ppm during assay library generation. (c) Number of transitions by fragment type in the liver assay library allowing internal (m) fragments. (d) Density of ion intensities in the liver assay library for ytype, b-type, and m fragments relative to the most intense fragment.

■

RESULTS AND DISCUSSION

assay level using the best underlying score for the higher levels. Assays are only retained if they pass the FDR cutoff on the assay level, and both map to a peptide and protein that pass the FDR cutoff at their respective levels (Figure 1f). To evaluate the effects of our proposed improvements, the assay generation workflow proposed by Schubert et al., here referred to as SpectraST, was stepwise changed in the order presented above. Importantly, the assay quality was benchmarked by targeted analysis of purpose-acquired DIA data evaluated by the software tool DIANA.21 This is necessary as most of the introduced changes do not have any directly quantifiable effect on the size of the assay library. As input, we used Q-Orbitrap shotgun MS data previously acquired from fractionated mouse heart, spleen, kidney, and liver tissues.35 The selected organs have different molecular complexities resulting in between 70,000 and 150,000 peptide spectrum matches (PSMs) from the total of 10 fractions per organ at a PSM level FDR of 1% (Figure 2a). For the generated assay libraries to be evaluated, unfractionated samples from the same tissue homogenates were prepared and subjected to DIA analysis. As predicted, the introduced changes to the workflow have a marginal effect on the size of the assay library, where the only notable increase occurs upon employing the multilevel FDR thresholds (Figure 2b). However, at the DIA result level, we observe distinct improvements in the introduced changes: the final improved workflow yields between 14 and 36% more quantified assays in the four organs compared to that of the standard SpectraST workflow (Figure 2c and d). To justify our sequential addition of improvement steps, we compared all feasible assay generation workflow combinations to test for redundancy (see Supplemental Figure S1). Overall, only the combination of all five changes yield the maximal increase in

Implementation of Five Conceptual Changes to Mass Spectrometry Assay Library Generation

Improving algorithms to better utilize available shotgun MS data for the construction of the best assay library is of importance for increasing DIA performance. To ensure the maximal usability of shotgun MS data, we propose five major changes to the assay library generation pipeline as outlined in Figure 1a. First, we introduce Fraggle, a new tool to generate assays from shotgun MS spectra. In contrast to preceding tools, Fraggle omits extensive modulation of the raw data prior to spectrum interpretation and combination and hence provides a more direct approach for assay generation26,27 (Figure 1b). Second, the introduction of linear mapping and a custom iRT definition when normalizing retention times defies the main benefit of a globally uniform standardized retention time scale. Linear mapping is not accurate enough to cover the complex retention time shift introduced by, for example, different column loads or similar.28,29 We therefore suggest applying the original iRT definition for nonlinear gradients30 (Figure 1c). Third, because of the increased noise in the low m/z range and the performance characteristics of Orbitraps,31,32 it is preferable to use a ppm-based cutoff for fragment−peak matching for these instruments (Figure 1d). Fourth, internal fragment ions seem to occur frequently among the most intense signals from fragment spectra. We suggest that these internal fragments should be included in assay generation (Figure 1e). Fifth and last, we recognize the need for controlled protein false discovery rate (FDR)33 but believe FDR filtering could be refined over the elegant but somewhat harsh Mayu approach.34 We propose FDR calculations on the protein, peptide, and 2388


Article


choose from results in improved assay quality, which seems to have an effect on the DIA analysis as a whole, giving the semisupervised machine learning of DIANA better possibilities to differentiate between correct and incorrect peaks, and thus, slightly more peaks can be permitted for the fixed FDR threshold. For the final step, using alternative FDR calculations, we did not expect major differences. The few assays that were missed by the harsher Mayu-derived FDR cutoff lead to an increased library size but only sometimes perform better in DIA (Figure 2c and d). In conclusion, the increased accuracy in hydrophobicity modeling, instrument mass accuracy modeling, and peptide ion fragmentation all contribute to increased assay quality.

recovered targets. However, there are steps that have a higher impact on the results than others. For example, the ppm-based mass-error threshold is mandatory to generate the top performing libraries, whereas the retention time mapping has a less pronounced effect (Figure S1a). Our interpretation of these results is that each of the five proposed changes yields a stepwise improvement over the previous version of the workflow, motivating the incorporation of each change in a final workflow. Dissection of the Introduced Changes in Mass Spectrometry Assay Generation

To determine the reason for the increased number of quantified peptide targets in DIA results outlined above, we further examined the changes in the assay generation workflow. First, introduction of Fraggle improved assay quality compared to SpectraST, yielding an increase in quantified targets. SpectraST performs several preprocessing and filtering steps prior to consensus library creation.27 In contrast, Fraggle interprets spectra individually and combines them after successful assignment. A possible explanation for the observed improvements is that SpectraST was written at a time when lowresolution fragment spectra were the norm and spectral quality was too low to yield usable assays without modulation of the raw data. We argue that Fraggle is simpler and more directly profits from high-resolution fragment spectra. Second, upon introduction of the original iRT retention time mapping strategy for nonlinear gradients, the assay distribution shifts closer to an absolute iRT deviation of 0 (Figure 3a). This is because of the more accurate modeling of the hydrophobicity profile of the MS measurements using the segmented linear map instead of a global linear regression. The peptides used in this study were synthesized to match the original iRT sequences, but Fraggle supports any kind of custom RT peptide definition. Third, the matched fragment m/z deviations between measured and theoretical m/z values accurately represent a 10 ppm error margin, characteristic for Orbitrap instruments, while effectively removing a substantial amount of uniformly distributed erroneous fragment matches (Figure 3b). Fourth, the contribution of internal fragment ions was determined by investigating the presence of assays containing internal fragment ions in the assay library. On the used QOrbitrap, internal ions occur almost as often as b-type fragments (Figure 3c), although having lower relative intensities and rarely constituting the most intense ion in an assay (Figure 3d). Even so, they fall among the six most intense fragments in 39.9% of the generated assays. Extracted elution profiles of internal fragment peak shapes are undistinguishable from the conventional fragments (Figure S2), which is expected if these fragments are correctly matched and originating from the same target peptide. Assays containing proline (47.3%) generated internal fragments slightly more frequently, as 49.8% of the generated internal fragments were from proline assays, and internal fragments from proline containing assays also showed a slightly higher relative intensity of 23.8 ± 19.4% against 18.2 ± 17.8 at a p-value of

Improvements in Mass Spectrometry Assay Library Generation for

Recommend Documents