A Case Study and Methodology for OpenSWATH Parameter

2 days ago - Sean Peters , Peter G Hains , Natasha Lucas , Phillip J Robinson , and Brett Tully. J. Proteome Res. , Just Accepted Manuscript...
0 downloads 0 Views 7MB Size
Subscriber access provided by Columbia University Libraries

Article

A Case Study and Methodology for OpenSWATH Parameter Optimization Using the ProCan90 Data Set and 45,810 Computational Analysis Runs Sean Peters, Peter G Hains, Natasha Lucas, Phillip J Robinson, and Brett Tully J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.8b00709 • Publication Date (Web): 17 Jan 2019 Downloaded from http://pubs.acs.org on January 18, 2019

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

A Case Study and Methodology for OpenSWATH Parameter Optimization Using the ProCan90 Data Set and 45,810 Computational Analysis Runs 1

Sean Peters,

12

Peter G Hains, 1 Natasha Lucas, 1 Phillip J Robinson, and 1 Brett Tully∗

1

ProCan, Children’s Medical Research Institute, Faculty of Medicine and Health, University of Sydney, Westmead, NSW 2145, Australia

2

Cell Signalling Unit, Childrens Medical Research Institute, The University of Sydney, Westmead, NSW 2145, Australia E-mail: [email protected] Abstract In the current study, we show how ProCan90, a curated data set of HEK293 technical replicates, can be used to optimize the configuration options for algorithms in the OpenSWATH pipeline. Furthermore, we use this case study as a proof of concept for horizontal scaling of such a pipeline to allow 45,810 computational analysis runs of OpenSWATH to be completed within four and a half days on a budget of US$10,000. Through the use of Amazon Web Services (AWS), we have successfully processed each of the ProCan 90 files with 506 combinations of input parameters. In total, the project consumed more than 340,000 core hours of compute and generated in excess of 26 TB of data.

1

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Using the resulting data and a set of quantitative metrics, we show an analysis pathway that allows the calculation of two optimal parameter sets, one for a compute rich environment (where run time is not a constraint), and another for a compute poor environment (where run time is optimized). For the same input files and the compute rich parameter set, we show a 29.8% improvement in the number of quality protein (>2 peptide) identifications found compared to the current OpenSWATH defaults, with negligible adverse effects on quantification reproducibility or drop in identification confidence, and a median run time of 75 min (103% increase). For the compute poor parameter set, we find a 55% improvement in the run time from the default parameter set, at the expense of a 3.4% decrease in the number of quality protein identifications, and an intensity CV decrease from 14.0% to 13.7%.

Keywords proteomics, sensitivity analysis, big data, amazon web services, mass spectrometery, openswath, parameter optimization, procan, scalability, HEK293

Introduction Contemporary mass spectrometers capture spectra with unprecedented sensitivity. However, the deliberately biased sampling and stochastic nature of the data-dependent acquisition (DDA) methodology leads to a different set of peptides being captured in each run of the feed sample. 1 This results in a challenge of reproducibility of peptide identification and has proved to be a barrier for pursuits that seek an unlabelled, quantitative approach applied across thousands of samples. However, significant improvements in MS technology such as cycle time, mass accuracy, and dynamic range and the development of the technique known as data-independent acquisition (DIA) allows an unbiased sampling of complex peptide mixtures. 2,3 In the case of Sciex instruments, this technique is referred to as SWATH-MS. 4 In

2

ACS Paragon Plus Environment

Page 2 of 32

Page 3 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

contrast to DDA that filters precursor ions based on intensity, SWATH-MS filters precursor ions based on the mass to charge ratio and aims to cover the full range of expected ions. SWATH-MS is envisaged as a technology of enduring importance, due to its ability to generate a wide and deep digital record of the physical sample’s fragment spectra that can be returned to over time and re-analyzed as software and algorithms improve. 5–7 However, because the data captured by SWATH-MS are comprehensive and multiplexed, algorithms previously designed for DDA are unsuitable, and specific DIA software pipelines are needed to handle data at scale. 3,8 The ACRF International Centre for the Proteome of Human Cancer (ProCan) program is a large scale proteomics program that will examine all major cancer types. To do this, a custom built facility housing six Sciex 6600 triple TOF mass spectrometers has been established with the capacity to generate 2,000 SWATH-MS proteomes per month. The volume of data brings about an interesting set of challenges, such as the computational effort required to analyze, and re-analyze, tens of thousands of raw data files with a consistent, robust, and reproducible pipeline. The goal of the current work is three-fold: a) curate a collection of technical replicates that can be used by the community to test new (and existing) algorithms; b) build a reproducible pipeline capable of analyzing tens of thousands of SWATH-MS files in a reasonable time frame and budget; and c) use the data generated to optimize the command-line parameters of OpenSWATH to match the ProCan program throughput.

Methods The ProCan90 Data Set For this case study, we curated a collection of HEK293 SWATH-MS raw data files generated as part of the routine operation of the ProCan experimental facility. These files represent technical replicates, each being an aliquot from the same pooled digest, with fifteen runs 3

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

collected from each of the six Sciex TripleTOF 6600 mass spectrometers (henceforth referred to as M1-M6), giving a total of 90 raw data files. Due to inherent variability, each replicate returns slightly different proteome maps. These sources of variability may arise due to: • Instrument variation, such as column condition, MS cleaning schedule and final tuning parameters, with all instruments being tuned to a minimum performance level • Algorithmic variation introduced primarily in the peak picking aspect of the software, where even very slight variations in the input data can sometimes lead to new or lost identifications. • Statistical variation introduced due to the scoring and statistical filtering algorithms in the software e.g. SRL sizes affecting the MScore (an identification confidence measure in OpenSWATH) cut off threshold by the FDR filtering. During the selection of the ProCan90 files, several runs from M3 and M6 with lower total ion chromatograms (TIC) were included, to ensure that the data set contained variability and parameter optimization procedures remained robust against this (the distribution of the scan file sizes can be seen in Supplementary Material data-description notebook and it correlates with TIC).

IDA Acquisition An Eksigent nanoLC 425 HPLC (Sciex - Toronto, Canada) operating in microflow mode, coupled online to a 6600 TripleTOF (Sciex) was used for the analyses. The peptide digests (2 µg) were spiked with retention time standards and injected onto a C18 trap column (SGE TRAPCOL C18 G 300 µm x 10 mm) and desalted for 5 min at 10 µL/min with solvent A (0.1% [v/v] formic acid). The trap column was switched in-line with a reversedphase capillary column (SGE C18 G 250 mm x 300 µm ID 3 µm 200 ˚ A), maintained at a temperature of 40 ◦ C. The flow rate was 5 µL/min. The gradient started at 2% solvent B 4

ACS Paragon Plus Environment

Page 4 of 32

Page 5 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(99.9% [v/v] acetonitrile, 0.1% [v/v] formic acid) and increased to 10% over 5 min. This was followed by an increase of solvent B to 25% over 60 min, then a further increase to 40% for 5 min. The column was washed with a 4 min linear gradient to 95% solvent B held for 5 min, followed by a 9 min column equilibration step with 98% solvent A. The LC eluent was analyzed using the TripleTOF 6600 system equipped with a DuoSpray source and 50 µm internal diameter electrode and controlled by Analyst 1.7.1 software. The following parameters were used: 5500 V ion spray voltage; 25 nitrogen curtain gas; 100◦ C TEM, 20 source gas 1, 20 source gas 2. The 90 min information dependent acquisition (IDA), consisted of a survey scan of 200 ms (TOF-MS) in the range 350-1250 m/z to collect the MS1 spectra and the top 40 precursor ions with charge states from +2 to +5 were selected for subsequent fragmentation with an accumulation time of 50 ms per MS/MS experiment for a total cycle time of 2.3 s and MS/MS spectra were acquired in the range 100-2000 m/z.

Spectral Reference Library Generation SWATH spectral analysis requires an independently generated spectral reference library (SRL) produced in IDA mode. HEK293 cell lines were pooled and fractionated using high pH fractionation (Waters X-Bridge, C18 2.1 mm x 150 mm, 3.5 µm). A total of 15 fractions were analyzed using IDA MS and searched using ProteinPilot version 5.0 (Sciex) with the Paragon algorithm and the following parameters: Sample Type: Identification; Cys Alkylation: Iodoacetamide; Digestion: Trypsin; Instrument: TripleTOF 6600; Database: Uniprot Human (178,750 entries). Databases contained sequences for internal retention time calibration standards. Thorough ID and False Discovery Rate (FDR) Analysis were selected, and the FDR was set at 1%. The resulting data was imported into PeakView version 2.2 (Sciex) with MS/MS(ALL) SWATH MicroApp v 2.0.0.2003, a retention time protein assigned and a final SRL exported for searching against the SWATH data. Finally, this library was converted such that it could be used with the OpenSWATHWorkflow available in the mass spectrometery analysis software suite; OpenMS. 9 This was done by adjusting column

5

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

names, selecting at least three and at most six fragments for each peptide spectrum match (PSM), normalizing relative intensities such that the most intense fragment had a recorded intensity of 10,000 [Hannes R¨ost, private communication], and converting modifications from the three-letter format of ProteinPilot to UniMod notation used in OpenSWATH.

SWATH Acquisition For the SWATH acquisition, peptide spectra were acquired with the same LC method described for IDA acquisition, but using an altered MS method with 100 variable windows, as per Sciex technical notes. 10 The parameters were set as follows: MS1 lower m/z limit 350; upper m/z limit 1250; window overlap (Da) 1.0; CES was set at 5 for the smaller windows, then 8 for larger windows; and 10 for the largest windows with an acquisition time of 150 ms. MS2 spectra were collected in the range of m/z 100 to 2000 for 30 ms in high resolution mode and the resulting total cycle time was 3.2 s.

SWATH Data Processing SWATH data were processed using the OpenSWATHWorkflow 8,11,12 where command line parameters are varied as outlined in subsequent sections. PyProphet 13,14 (Version 0.24.1) was used to combine the multitude of OpenSWATH scores into a single score using a semi-supervised Linear Discriminant Analysis technique. Using a target-decoy approach, PyProphet then calculates q-value estimation for each peak group of each peptide query; within a peptide query we consider the final quantified result to be represented by the peak group with the lowest q-value (assigned peak group rank=1 ) if, and only if, the q-value is less than 10−2 . In the current analysis, we explicitly chose to run PyProphet in a sample-independent manner (rather than experiment-wide) as we believe this is the more likely operating mode when running a computational pipeline on tens of thousands of samples. Current tools are not yet suitable for the algorithmic, engineering, and hardware challenge of loading full 6

ACS Paragon Plus Environment

Page 6 of 32

Page 7 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

cohorts at the volumes for which this proof of concept pipeline is aimed, and indicates an active area of research in the community. We accept that this is likely to compromise the number of identifications made, and may reduce inter-file consistency. 15 Furthermore, false discovery rates (FDR) were controlled at the peptide spectrum match (PSM) level only. While literature suggests that cohorts of this size may suffer from a large π0 (the fraction of targets that do not exist in the raw DIA data) and offers protein level FDR control as a partial solution, this study mitigates the challenge somewhat by the use of a sample-specific SRL and noting that all runs are technical replicates of the same HEK293 digest. Furthermore, we considered TRIC 16 (msproteomictools, 14,17 Version v.0.8.0) to be outside the scope of the current work as its primary purpose is to align runs in a cohort-based fashion. Finally, we note that the latest stable version of PyProphet (v2.0.1, November 2018) includes protein FDR and a new SQLite-based workflow that we anticipate introducing to our future pipeline; however, this release was not available at the time the proof of concept was completed and is thus out-of-scope of the current work.

Analysis with Default parameters In order to build the baseline for parameter optimization, we first ran OpenSWATH on the ProCan90 data with the default parameters. Looking at the identification of quality proteins (those with two or more peptides) we can get a high-level view on the consistency between the samples. In Figure 1, we show three approaches to comparing identification with: quality proteins detected in all injections for each machine (lower left), quality proteins detected in all injections for all machines (diagonal), and quality proteins detected in any injection for each machine (upper right). As we can see from the diagonal of this figure, M3 and M6 identify fewer quality proteins and thus set the lower bound on identification with each being an almost complete subset of any of the other machines. In contrast, M1 and M4 performed the best in terms of protein identification across the time period of data collection.

7

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 1: This figure gives an overview of how protein identifications differ from machine to machine and can be dissected into three separate pieces. • Lower left triangle (dark grey centres) shows the intersection of quality proteins comparing machine-by-machine, where a protein is only included if it exists in all runs for that machine. • Diagonal (white centres) shows the intersection of quality proteins comparing each machine (where a protein is only included if it exists in all runs for that machine) with the set of proteins that exist in all runs on all machines (2188 proteins in total). • Upper right triangle (light grey centres) shows the intersection of quality proteins comparing machine-by-machine, where a protein is included if it exists in at least 10% of any of the runs for that machine.

8

ACS Paragon Plus Environment

Page 8 of 32

Page 9 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Parameter Selection There are more than fifty parameters that can be altered to control the OpenSWATHWorkflow; in collaboration with the original authors (private communication), we identified seven parameters as being important for optimization (see Table 1). However, such a large parameter space is prohibitive to a brute force, or exhaustive, search. We overcame this through a quasi-Monte Carlo approach called Sobol Choice Generation. 18 Through a process of hierarchical refinement, this method produces a pseudo-random and low-discrepancy sampling of the parameter space avoiding the potential clustering found in standard Monte Carlo sampling. Working on an assumption of an average run time of one hour on an 8-core CPU, the expected AWS spot price (see below), and a budget of US$10,000 we concluded an upper limit of 510 parameter choices. The parameter sets were selected using the SALib python library. 20 In general terms, Sobol Choice Generation selects continuous values and the output of SALib was converted to integers for the categorical parameters; in doing so, some parameter sets were duplicated, and the duplicates were omitted from further analysis. Code for generating the selected parameters is included in the Supplementary Material Jupyter notebook analysis report.

Amazon Web Services (AWS) Pipeline Most major cloud providers (AWS, Google Cloud Platform, Microsoft Azure) enable users to construct a workflow as a directed graph of steps, each involving a Docker 21 container and a collection of inputs and outputs. Specifically, in this case study, we elected to use the AWS Batch service, running a cluster of 400 on-demand m4.2xlarge (8 virtual CPU, 32 GB RAM) spot instances, each backed by a 250 GB EBS volume. The Australian region spot price ranged between US$0.10-US$0.11 per hour over the course of this study. The pipeline was composed of two array-job descriptions, with the second being dependent on the first:

9

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 32

Table 1: Parameter choices for OpenSWATH optimization Description Type use ms1 traces (OptMS1) Use MS1 trace, in addition to MS2, when calcuboolean lating scores use elution model score (OptEMG) Enables a score based on the similarity of the deboolean tected peak to an Exponentially Modified Gaussian rt extraction window (OptRTWin) float Size (in seconds) of the RT window. Larger means more data, more peak groups, more noise, but also more robust against a failing RT normalization. mz extraction window (OptMZWin) Size (in Dalton, or ppm if the ppm flag is passed) float of the m/z window mz correction function (OptMZFunc) Use the iRT peptides to correct the expected PSM categorical mass locations RTNormalization:alignmentMethod (OptRTNorm) Select the method for aligning retention times categorical

Options True, False

False

True, False

False

(300, 900)

600

(20, 60) ppm

10

0.05 Da

none, none quadratic_regression_delta_ppm

linear, lowess Scoring:TransitionGroupPicker:background subtraction (OptBack) Applies background subtraction to the peak when categorical none, quantifying area under the curve. This was found original, to be important when quantifying dilution seexact ries. 19 Scoring:TransitionGroupPicker:minimal quality (OptQual) Peak groups are given quality scores centered float (0, -2) around 0, values above zero are generally good and those less than -1 or -2 are generally bad. Having a high threshold will filter many potential peaks.

ACS Paragon Plus Environment

Default

linear

none

-1.5

Page 11 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

PipelineSwathScoresArray that completes the OpenSWATHWorkflow using the cmriprocan/openms:1.0.4 Docker image 12 ), where the array size is 90 (one entry for each ProCan90 file). Each job produces a 500-700 MB compressed tsv result file per input file, per set of input parameters and, depending on the parameters used, it consumed 30-300 min using the full allocation of an instance of m4.2xlarge (8 cores, 31 GB RAM) PipelineIndividualFDRArray that has an N TO N AWS Batch dependency on PipelineSwathScoresArray. This runs PyProphet using the cmriprocan/pyprophet-py2:1.0.3 Docker image 14 ). It produces a 5-10 MB HDF5 file using 10-20 min on a quarter of the allocation of an instance of m4.2xlarge (2 cores, 7.1 GB RAM). The final HDF5 file results from filtering the PyProphet TSV file based on an q-value of 10−2 and peak group rank of 1 (other intermediate files are not being kept) and is in the format appropriate for direct loading by pandas. All python scripts used for marshalling the AWS Batch pipeline are included as Supplementary Material in the aws scripts directory.

Optimization Metrics and Algorithm Sensitivity analysis can be defined as the study of how outputs are influence by their inputs. However, there is a subtle but important difference between parameter sensitivity analysis and parameter optimization. In the former, we are interested in how small changes in parameters affects the outcome of a model, whereas the latter is concerned with selecting a set of parameters that optimize a cost function. 22 Optimization of configuration parameters in the context of computational proteomics requires a balance of quantitative metrics and qualitative importance. For instance, it is possible to optimize the number of peptides identified but this may come at the cost of increased variation in the quantification or a substantial increase in the computational effort required. To achieve this balance, we take the approach of a simple weighted rank order.

11

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

To quantify the success of a given parameter set, we calculated fourteen scoring metrics (see Table 2) that are then assigned a rank order across all parameter sets. Depending on the optimization scenario, the scoring metrics are given a weighting, and a weighted sum across all scoring metrics is used to calculate that parameter sets’ final rank. In doing so, we can change the contribution of each score, and thus the cost function, depending on the goals of the experiment. Further details of how each parameter is calculated can be seen in the Supplementary Material Jupyter notebook analysis report. To demonstrate the utility of this approach, we define two metric weightings: a Compute Rich workload where computational effort is de-prioritized compared to number and accuracy of identifications, and a Compute Poor workload where efficient run time is important. For score metrics (like run time) where a low ranking is preferred, the weighting is defined as negative. Further details on the specific weightings chosen can be seen in the Supplementary Material Jupyter notebook analysis report. Description Identification Metrics psm id all files The number of PSM identified in all runs for the given parameter set protein id all files The number of proteins with at least one PSM identified in all runs for the given parameter set; however, it will not necessarily the same peptide in all runs quality protein id all files The number of proteins, filtered by having at least 2 peptides identified in a given run, with at least one PSM identified in all runs for the given parameter set; however, it will not necessarily the same two peptides in all runs psm id Median across all runs for the given parameter set, of the number of PSM identified in each individual run protein id Median across all runs for the given parameter set, of the number of proteins identified in each individual run quality protein id Median across all runs for the given parameter set, of the number of proteins, filtered by having at least 2 peptides identified in a given run, identified in each individual run infrequent psm id 12

ACS Paragon Plus Environment

Page 12 of 32

Page 13 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Median across all runs for the given parameter set, of a count of the number of peptides identified in 5% or less of runs infrequent protein id Median across all runs for the given parameter set, of a count of the number of proteins identified in 5% or less of runs Quantification Metrics log10 mlr norm intensity Median across all runs for the given parameter set, of PSM log10 intensities, after MLR normalisation m score median Median across all runs for the given parameter set, of the median m score of each individual run Reproducibility Metrics cv mlr norm intensity Median across all runs for the given parameter set, of the CV of PSM intensities, after MLR normalisation log10 mlr norm intensity dist 0 Median across all runs for the given parameter set, of the Euclidean distance between the PSM MLR normalised intensity of each run Misc Metrics time Runtime (hours) Table 2: Weighting the parameter choices for OpenSWATH optimization

Results New Parameter Selection Using the previously defined scoring metrics, we apply the Compute Rich and Compute Poor weightings such that each of the 506 parameter sets can be ranked for each scenario. Figures 2 and 3 show the distribution of total scores for each of the eight configuration parameters, described in Table 1, for Compute Rich and Compute Poor, respectively. In these figures, the continuous parameters are collected into 7 bins and the red markers show those parameter sets that score in the 95th percentile, and the count of red markers in

13

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 32

each category are printed along the top of each sub-figure. For certain parameters, such as OptBack, the parameter choice can be easily made; however, in the event of no clear winner, such as OptMZWin the original OpenSWATH defaults are retained. Our final parameter selections are displayed in Table 3 while Figure 4 shows a comparison of all scoring metrics with those that follow running these configuration values through OpenSWATH. As we can see, the Compute Rich (red dashed line) parameters achieve the highest number of PSMs identified, proteins identified and quality proteins identified while not compromising the reproducibility or identification confidence – the CV between runs is maintained, the m score is improved, and the number of low frequency proteins is reduced. However, Compute Rich parameters perform worse with respect to the median of the euclidean distance between each run (another measure of reproducibility), and as expected, the run time increases markedly (approximately a factor of 3 over the old defaults). Unsurprisingly, the Compute Poor parameters are outperformed in most categories by both the Old Defaults and then Compute Rich parameter sets; however, we achieve the desired improvement in run time, and importantly, reproducibility metrics of aggregated CVs and euclidean distances are improved. Table 3: Final Parameter choices for OpenSWATH optimization Parameter OptMZWin OptRTWin OptMZFunc OptRTNorm OptBack OptQual OptMS1 OptEMG

Old Defaults 0.05Da 600 none lowess none -1.5 False False

Compute Rich 50ppm 400 quadratic regression delta ppm lowess none -1.5 True False

Compute Poor 50ppm 375 none linear none -0.45 False False

Validating Parameter Choices Figure 5 gives a high level view on the three parameter sets (Old Defaults, Compute Rich, and Compute Poor) and their impact on identifications in all samples. We can see that 14

ACS Paragon Plus Environment

Page 15 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 2: Parameter set ranks calculated with the Compute Rich weightings, plotted against the choice of each parameter. The red points and corresponding number values above each bin, represent the count of parameter sets in15this bin, that achieved a rank above the 95th percentile. ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 3: Parameter set ranks calculated with the Compute Poor weightings, plotted against the choice of each parameter. The red points and corresponding number values above each bin, represent the count of parameter sets in16this bin, that achieved a rank above the 95th percentile. ACS Paragon Plus Environment

Page 16 of 32

Page 17 of 32 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 4: This figure shows each of the of the parameter sets performed over each of the optimization metrics. Each plot measures the value of that particular along the y axis, with the over 500 explored parameter sets plotted in ascending order for that metric. Additional markers show how parameter sets of interest (Old Defaults, Compute Rich and Compute Poor), performed for each metric.

17

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

the Compute Rich parameter set identifies many proteins and peptides that neither the Old Defaults or the Compute Poor discover. It is interesting, however, how disjoint Compute Poor and Old Defaults are, particularly at the peptide and PSM level with over 10% of the PSMs and peptides identified by Compute Poor not being picked up by the Old Defaults. We speculate that the smaller number of peaks allowed through by the OptQual threshold in the Compute Poor set impacts non-linearly with the statistical FDR controls. To confirm these findings we randomly selected a collection of raw chromatograms from the analysis of the first run of M3 over the ProCan90 data set, and displayed the peak boundaries and m score for each parameter set (see Supplementary Material Jupyter notebook manualchrom-validation). From this manual validation, we conclude that the identifications found by each parameter set are equally valid for the same FDR threshold – stated differently, the additional discoveries of Compute Rich are of similar quality than those discovered by the Old Defaults. This may indicate just how important a good parameter set is, not only for identification count and reproducibility, but also to detect specific types of peaks in the raw SWATH-MS data. Figure 6 displays the same results as Figure 5, but this time filtered by m score