Subscriber access provided by UCL Library Services
Article
Dinosaur: a refined open source peptide MS feature detector Johan Teleman, Aakash Chawade, Marianne Sandin, Fredrik Levander, and Johan Malmström J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.6b00016 • Publication Date (Web): 25 May 2016 Downloaded from http://pubs.acs.org on May 25, 2016
Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.
Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.
Page 1 of 35
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Proteome Research
Dinosaur: a refined open source peptide MS feature detector Johan Teleman 1, 2, * , Aakash Chawade 1 , Marianne Sandin 1 , Fredrik Levander 1, 3,*, ¤ , Johan Malmström 2, ¤ 1) Dept. of Immunotechnology, Lund University, Sweden 2) Dept. of Clinical Sciences Lund, Lund University, Sweden 3) Bioinformatics Infrastructure for Life Sciences (BILS), Lund University, Sweden ¤) Equal contribution *) Corresponding Authors: Johan Teleman,
[email protected], +46 46 222 96 64 Fredrik Levander,
[email protected], +46 46 222 38 35 Keywords: Proteomics, Mass spectrometry, Electro-spray ionization, Feature detection, Chimeric spectra, Algorithm, Software
ACS Paragon Plus Environment
Journal of Proteome Research
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 2 of 35
ABSTRACT In bottom-up mass spectrometry (MS) based proteomics, peptide isotopic and chromatographic
traces
(features)
are
frequently
used
for
label-free
quantification in data-dependent acquisition MS, but can also be used for improved
identification
of
chimeric
spectra
or
sample
complexity
characterization. Feature detection is difficult because of the high complexity of MS proteomics data from biological samples, which frequently causes features to intermingle. In addition, existing feature detection algorithms commonly suffer from compatibility issues, long computation times or poor performance on highresolution data. Because of these limitations we developed a new tool, Dinosaur, with increased speed and versatility. Dinosaur has functionality to sample algorithm computations through quality control plots, which we call a plot trail. From evaluation of this plot trail we introduce several algorithmic improvements to further improve robustness and performance of Dinosaur, with detection of features for 98% of MS/MS identifications in a benchmark dataset, while no other algorithm tested in this study passed 96% feature detection. We finally used Dinosaur to re-implement a published workflow for peptide identification in chimeric spectra, increasing chimeric identification from 26% to 32% over the standard workflow. Dinosaur is operating system independent and is freely available as open source on https://github.com/fickludd/dinosaur.
ACS Paragon Plus Environment
2
Page 3 of 35
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Proteome Research
INTRODUCTION Mass spectrometry based proteomics, fueled by seemingly ever-increasing instrument performance, has gained considerable traction as a high-throughput method for biomarker discovery and systems biology applications. New and flexible workflows have extended the range of mass spectrometry (MS) applications in life science research from biomarker discovery to complex interaction mapping1,2, whole-proteome assay libraries3,4, mRNA and protein dynamics5, or full structural characterization of a protein complex6. In bottom up MS proteomics, mass spectrometry analysis of peptides, derived from intact proteins using proteolytic processing, allows measurement of thousands of isotope envelopes of individual peptide ions. These isotope envelopes appear as ion intensity patterns in the retention time and m/z dimensions and are often referred to as features. Accurate feature detection is an important step in many mass spectrometry-based proteomics workflows7–9 to quantify identified peptides in shotgun mass spectrometry by using the integrated or apex feature intensity. This holds in particular for label-free10 and SILAC11 workflows. Accurately determined features are also used to increase the accuracy of estimated precursor masses, to determine chromatography performance and optimize chromatographic gradients12, and to determine the total number of detectable peptides by analyzing feature charge states, masses and retention times. Features can also be used to perform untargeted analysis of dataindependent acquisition (DIA) experiments13. Lastly, recent work has shown that feature information can increase the identification rates of chimeric MS/MS spectra, providing a cheap possibility to increase the number of identified spectra and the quality of the data analysis14.
ACS Paragon Plus Environment
3
Journal of Proteome Research
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 4 of 35
Feature detection in proteomics and metabolomics mass spectrometry has historically received considerable attention. While Listgarten and Emili thoroughly summarize early work on the subject7, label-free quantification of shotgun proteomics experiments has resulted in several additional published feature detection tools like for example msInspect15, SuperHirn16, OpenMS17, centWave18, Hardklör19, and MZmine20. In 2008, MaxQuant introduced a graph model to improve feature detection in dense proteomics data derived from complex biological samples21. Later work has been driven by changing data quality; modern Fourier transform high resolution instruments have reduced the need for spectral noise reduction and are not suited for m/z range binning, which were both prominent parts of the early algorithms8. Kalman filters have also been employed for feature detection22,23. The large number of available feature detection algorithms calls for strategies to evaluate the outcome of feature detection. A straightforward method for such evaluation is to ensure that the feature list returned by the feature detection algorithm contains the peaks that were confidently identified by data-dependent MS/MS spectra24. A feature detection tool should ideally have several properties. First, the tool should detect and distinguish all features for the detectable peptide ions in an LC-MS experiment. Second, it should provide accurate measurements of feature retention times and monoisotopic masses in addition to providing accurate quantitative values. Third, it should preferably compute as fast as possible, be robust against irregularities in input data, and be flexible in terms of input data formats, output and computational environment. Finally, the tool should provide parameters for optimization and means for the users to evaluate the effects of any modified parameter. Although MS1 feature detection is a well-
ACS Paragon Plus Environment
4
Page 5 of 35
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Proteome Research
studied problem7–9,25, we believe further progress can be made towards these ideal properties. Problems with existing algorithms are for example operating system limitations, lack of support for standard file formats, slow algorithm execution, limited control output and closed or outdated source code. We also experience difficulties in understanding how modification of algorithm parameters influences performance, which prevents adoption of a given tool to new instruments or acquisition conditions. In the work presented here we describe a new MS1 feature detection tool called Dinosaur, based on concepts adapted from the feature detection algorithm in the MaxQuant21 software, which is widely used for analysis of shotgun mass spectrometry data. Importantly, with Dinosaur we introduce a new strategy of creating a set of quality control plots, referred to as a plot trail, to support monitoring of computation performance. Based on extensive evaluation of the plot trail we introduce several modifications to the original feature detection algorithm
to
improve
performance
and
robustness.
The
Dinosaur
implementation is heavily optimized and parallelized, and handles data files from modern mass spectrometers in minutes on any major operating system. We evaluated the Dinosaur performance by benchmarking against 4 previously published feature detection algorithms on a 5 orders of magnitude dilution series of synthetic peptides in a complex biological background26. Dinosaur achieved an equally high level of linearity compared to the other tools, while matching features to a larger percentage of identified peptides. We also demonstrate the utility of Dinosaur by implementing it in a chimeric spectrum identification workflow14, to increase the number of detected unique peptides by 32% at 1% false discovery rate (FDR).
ACS Paragon Plus Environment
5
Journal of Proteome Research
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Page 6 of 35
EXPERIMENTAL METHODS Data acquisition The synthetic peptide dilution series (PXD001091) and HeLa data sets (PXD000999) were downloaded from ProteomeXchange. The 8 sample type representative samples were prepared in-lab using standard protocols and have been deposited to the ProteomeXchange Consortium27 via the PRIDE partner repository with the dataset identifier PXD003405. Acquisition was performed on a Q Exactive Plus mass spectrometer (Thermo Scientific) in top-15 mode, coupled to an EASY-nLC 1000 ultra-high pressure liquid chromatography system (Thermo Scientific). Linear gradients of between 5% and 35% acetonitrile in aqueous 0.1% formic acid were run for 30, 60, 90 or 120 min at a flow rate of 300 nl/min. The 'purified protein' sample is a His-tagged Streptococcus pyogenes M-protein expressed in E. coli., 'synthetic peptides' refers to 20 synthetic peptides of 80% purity (Proteogenix), 'AP-MS' is an affinity purification experiment using His-tagged S. pyogenes M-protein as a bait in mouse plasma, 'plasma' and 'depl. plasma' are non-depleted resp. depleted mouse plasma, 'bacteria' refers to blood agar grown S. pyogenes, 'yeast' is MS compatible yeast protein extract digest (Promega, V7461), 'tissue lysate' is a complete mouse liver tissue lysate. Mouse derived data are from Malmström et. al.28. All raw data files were converted to gzipped and MS-Numpressed mzML29 using the command line version of the Proteowizard (3.0.5930) tool msconvert30. MS data analysis When analyzing the synthetic peptide dilution series, MaxQuant 1.3.0.5 and msInspect (build 599) settings were as previously published26. MaxQuant 1.5 has become available since we setup our analysis pipeline, but feature results from
ACS Paragon Plus Environment
6
Page 7 of 35
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Journal of Proteome Research
v1.3 and v1.5 are very similar, and thus we have kept our results from v1.3. MaxQuant 1.1.0.25 was used to represent the original feature detection algorithm using default settings. Identification of MS/MS spectra was performed and combined within the Proteios Software Environment31, using X!Tandem TORNADO
2008.12.01.1
(www.thegpm.org/tandem)
and
Mascot
2.3.01
(www.matrixscience.com) with 7 ppm precursor tolerance and 0.5 Da fragment tolerance, using fixed carbamidomethylation of cysteins and variable oxidation of methionines, as described previously. Features were matched to MS/MS identifications passing a 1% peptide-spectrum-match level FDR using a 0.005 Da and 0.2 min threshold. When mapping multiple identifications to a feature, no control was made for identification consistency, an effect with an estimated magnitude of