Identification of Analytical Factors Affecting Complex Proteomics

Mar 9, 2016 - Netherlands Bioinformatics Centre, Geert Grooteplein 28, 6525 GA ...... Smilde , A. K.; Jansen , J. J.; Hoefsloot , H. C.; Lamers , R. J...
0 downloads 0 Views 875KB Size
Subscriber access provided by Chicago State University

Article

Identification of analytical factors affecting complex proteomics profiles acquired in a factorial design study with ANOVA – simultaneous component analysis. Vikram Mitra, Natalia Govorukhina, Gooitzen Zwanenburg, Huub Hoefsloot, Inge Westra, Age Klaas Smilde, Theo H. Reijmers, Ate G. J. van der Zee, Frank Suits, Rainer Bischoff, and Péter Horvatovich Anal. Chem., Just Accepted Manuscript • DOI: 10.1021/acs.analchem.5b03483 • Publication Date (Web): 09 Mar 2016 Downloaded from http://pubs.acs.org on March 12, 2016

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Analytical Chemistry is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

1

Identification of analytical factors affecting complex proteomics profiles acquired

2

in a factorial design study with ANOVA – simultaneous component analysis.

3 Vikram Mitra1,6,7, Natalia Govorukhina1,7, Gooitzen Zwanenburg2, Huub Hoefsloot2, Inge

4 5

Westra1, Age Smilde2,6, Theo Reijmers3, Ate G. J. van der Zee4, Frank Suits5, Rainer Bischoff1,6,7,

6

Péter Horvatovich1,6,7*

7 8

1

Analytical Biochemistry, Department of Pharmacy, University of Groningen, A. Deusinglaan 1, 9713

9 10

AV Groningen, the Netherlands 2

Swammerdam Institute for Life Science, University of Amsterdam, the Netherlands, Science Park

11

904, 1098 XH Amsterdam, the Netherlands 3

12

Analytical BioSciences, Leiden University, Einsteinweg 55, 2333 CC Leiden, the Netherlands 4

13

Department of Gynecology, University Medical Centre Groningen, Hanzeplein 1, 9713 GZ

14 15

Groningen, The Netherlands 5

IBM T.J. Watson Research Centre, 1101 Kitchawan Road, Yorktown Heights, 10598 New York,

16 17 18 19

USA 6

Netherlands Bioinformatics Centre, Geert Grooteplein 28, 6525 GA Nijmegen, the Netherlands 7

Netherlands Proteomics Centre, Padualaan 8, 3584 CH Utrecht, the Netherlands

*Corresponding Author: [email protected]; Tel: +31-50-363-3341; fax: +31-50-363-7582.

20 21

Affiliations: Analytical Biochemistry, Department of Pharmacy, University of Groningen, A.

22

Deusinglaan 1, 9713 AV Groningen, The Netherlands

23

ACS Paragon Plus Environment

Analytical Chemistry

Page 2 of 31

Mitra / ASCA analysis for proteomics samples / 2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

24

ABSTRACT

25

Complex shotgun proteomics peptide profiles obtained in quantitative differential protein

26

expression studies, such as in biomarker discovery, may be affected by multiple experimental factors.

27

These pre-analytical factors may affect the measured protein abundances which in turn influence the

28

outcome of the associated statistical analysis and validation. It is therefore important to determine

29

which factors influence the abundance of peptides in a complex proteomics experiment, and to identify

30

those peptides that are most influenced by these factors. In the current study we analysed depleted

31

human serum samples to evaluate experimental factors that may influence the resulting peptide profile

32

such as the residence time in the autosampler at 4˚C, stopping or not stopping the trypsin digestion with

33

acid, the type of blood collection tube, different haemolysis levels, differences in clotting times, the

34

number of freeze-thaw cycles and different trypsin/protein ratios. To this end we used a two-level

35

fractional factorial design of resolution IV ( 2 7IV−3 ). The design required analysis of 16 samples in which

36

the main effects were not confounded by two-factor interactions. Data pre-processing using the

37

Threshold Avoiding Proteomics Pipeline1 produced a data-matrix containing quantitative information

38

on 2,559 peaks. The intensity of the peaks was log-transformed, and peaks having intensities of a low t-

39

test significance (p-value > 0.05) and a low absolute fold ratio (< 2) between the two levels of each

40

factor were removed. The remaining peaks were subjected to ANOVA – simultaneous component

41

analysis (ASCA)2. Permutation tests were used to identify which of the pre-analytical factors

42

influenced the abundance of the measured peptides most significantly. The most important pre-

43

analytical factors affecting peptide intensity were (1) the haemolysis level, (2) stopping trypsin

44

digestion with acid and (3) the trypsin/protein ratio. This provides guidelines for the experimentalist to

45

keep the ratio of trypsin/protein constant and to control the trypsin reaction by stopping it with acid at

46

an accurately set pH. The haemolysis level cannot be controlled tightly as it depends on the status of a

47

patient’s blood (e.g. red blood cells are more fragile in patients undergoing chemotherapy) and the care

48

with which blood was sampled (e.g. by avoiding shear stress). However, its level can be determined ACS Paragon Plus Environment

Page 3 of 31

Analytical Chemistry

Mitra / ASCA analysis for proteomics samples / 3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

49

with a simple UV spectrophotometric measurement and samples with extreme levels or the peaks

50

affected by haemolysis can be discarded from further analysis.

51

The loadings of the ASCA model led to peptide peaks that were most affected by a given factor,

52

for example to haemoglobin-derived peptides in the case of the haemolysis level. Peak intensity

53

differences for these peptides were assessed by means of extracted ion chromatograms confirming the

54

results of the ASCA model.

ACS Paragon Plus Environment

Analytical Chemistry

Page 4 of 31

Mitra / ASCA analysis for proteomics samples / 4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

55

INTRODUCTION

56

Differential protein expression analysis using high-throughput LC-MS(/MS) profiling platforms,

57

such as in biomarker discovery and validation, is a major task in proteomics. Human blood is an easily

58

accessible body fluid that transports biomolecules related to the immune defence, signalling and cell

59

and tissue leakage and is therefore an important biological matrix for biomarker discovery and

60

validation. Serum obtained from blood after clotting and centrifugation or filtration is a highly complex

61

fluid in terms of protein composition. 1924 proteins were identified in an analysis combining multiple

62

datasets in PeptideAtlas3 and it is estimated that more than 10,000 proteins may be present in serum

63

with a dynamic concentration range spanning 11 orders of magnitude4,5. Analytical chemists

64

responsible for sample preparation and analysis, e.g. by LC-MS(/MS) of serum, must strive to

65

minimize variation introduced by pre-analytical steps. Variation introduced by differences in analytical

66

parameters may influence the quality of the data and thereby create a biased statistical outcome. The

67

complexity of sample preparation methods in proteomics, the large number of detected analytes and the

68

varying experimental parameters increase the risk of identifying false positive biomarker candidates or,

69

alternatively, missing potentially relevant ones. It is therefore important to determine the effect of pre-

70

analytical, experimental factors on the measured peptide profiles. Identifying the sources of

71

experimental variation allows better control of sample quality and minimizes false discoveries by

72

ensuring minimal technical variation within a proteomics study6. Also, samples that do not meet the

73

quality criteria for vital non-controllable experimental factors can be excluded from further analysis.

74

Experimental design is a well-established method that is used in many research fields and

75

industrial processes to reveal the effects of experimental factors on measured response variables.

76

Currently there is no study design to assess the effect of pre-analytical factors on complex proteomics

77

profiles in a comparable and objective manner, while this is commonplace in single-molecule,

78

bioanalytical studies7 and even part of guidelines for regulated bioanalysis8. We address this issue

79

using a multivariate statistical approach to evaluate the effect of individual experimental factors and the ACS Paragon Plus Environment

Page 5 of 31

Analytical Chemistry

Mitra / ASCA analysis for proteomics samples / 5 1 80 2 3 4 81 5 6 82 7 8 83 9 10 11 84 12 13 85 14 15 16 86 17 18 87 19 20 88 21 22 23 89 24 25 90 26 27 91 28 29 30 92 31 32 93 33 34 35 94 36 37 95 38 39 96 40 41 42 97 43 44 98 45 46 99 47 48 49 100 50 51 101 52 53 102 54 55 56 103 57 58 104 59 60

interactions between two or multiple factors on complex peptide profiles. Since proteomics analyses are time-consuming, we opted for a fractional factorial design to reduce the number of analyses per study, at the expense of confounding potential interactions between factors thus focusing on the main effects only. LC-MS(/MS) data from digested, complex proteomics samples contain quantitative information on thousands of peptides requiring multivariate statistical approaches. Additionally, to interpret the data matrix that arises from an experimental design study, a statistical method that incorporates the structure of the dataset is required2. Multivariate extensions of ANOVA (MANOVA) break down for large numbers of variables2. ANOVA-simultaneous component analysis (ASCA) is a technique that can handle large numbers of variables and includes all aspects of an experimental design into the statistical analysis of the data. Collecting multivariate data from an experimental design is not common in proteomics studies but is more frequently used in metabolomics and genetics studies. For example Scholtens et al.9 describe a statistical linear model of microarray data measured with an experimental design. Johnson et al.10 used PCA followed by multivariate analysis of variance of the resulting principal components for a multivariate evaluation of complex metabolomics datasets. In their study the direct injection mass spectrometry and Fourier transform infra-red data from metabolite profiling experiments of wild type and an ethylene signalling mutant of Arabidopsis was assessed with the PCA-MANOVA approach. This approach identified multiple significant main effects and interactions. Canonical variate analysis was used to investigate further the biological meaning of the findings. In a proteomics study by Szalowska et al.11 a fractional factorial design including 32 experiments was applied to reveal the effect of different pre-analytical factors on SELDI spectrum quality of the secretome of human visceral adipose tissue. The spectrum quality was assessed using univariate ANOVA by four scores, the total number of detected peaks, the average signal-to-noise ratio, peak shape and peak resolution. Andrews et al.12 used full and fractional factorial designs to optimize parameters of LC-MS/MS analyses with ACS Paragon Plus Environment

Analytical Chemistry

Page 6 of 31

Mitra / ASCA analysis for proteomics samples / 6 1 105 2 3 4 106 5 6 107 7 8 108 9 10 11 109 12 13 110 14 15 16 111 17 18 112 19 20 113 21 22 23 114 24 25 115 26 27 116 28 29 30 117 31 32 118 33 34 35 119 36 37 120 38 39 121 40 41 42 122 43 44 123 45 46 124 47 48 49 125 50 51 126 52 53 127 54 55 56 128 57 58 129 59 60

the goal to improve sequence coverage of Saccharomyces cerevisiae. This study evaluated the number of identified protein groups, unique peptides and spectral counts applying univariate ANOVA on Half Normal Quantile Probability Plots to identify factors that have a significant influence on these quality measures. Harrington et al. in two studies13,14 applied ANOVA Principal Component Analysis to correct for the variance of confounding factors on the factors of interest. In their MALDI-MS discovery study, protein biomarkers from amniotic fluid were identified that can be used to detect intra-amniotic infections, which is one of the main causes of premature delivery. The main difference between ANOVA-PCA and ASCA, as used in this study, is that ANOVA-PCA includes the residual variance in the PCA step whereas ASCA looks solely at the variance due to factors and interaction terms15. However until now no study reported a multivariate analysis of complex proteomics data obtained in an experimental design with the goal to identify the most influential factors and affected peaks on entire peptide profiles. The multivariate approach is essential since many compound-related signals correlate, such as different charge states of the same peptide, different uniquely mapping peptides from the same protein and proteins whose expression is co-regulated. This study presents statistical methodology that has the following 3 aims: 1) assess the performance of ASCA analysis with respect of number of compounds versus sample size using peaks obtained after filtering with Volcano plot parameters (t-test significance and fold ratio change), 2) to determine factors that affect the measured peptide profile in an experimental design study for complex proteomics samples, and 3) to identify the peaks that are most affected by these factors with their effect size. First a simulated dataset is used to assess the performance of ASCA analysis where significant and non-significant factors as well the peaks affected and non-affected by factors are known. This setup allows assessment of the effect of peak selection based on Volcano plot parameters on ASCA performance. Finally ASCA analysis of an experimental design dataset is shown for a dataset that was obtained with a human serum sample and included 7 factors at two levels with a 2 7IV−3 design at resolution IV. ACS Paragon Plus Environment

Page 7 of 31

Analytical Chemistry

Mitra / ASCA analysis for proteomics samples / 7 1 130 2 3 4 5 131 6 7 132 8 9 10 133 11 12 134 13 14 135 15 16 17 136 18 19 137 20 21 138 22 23 24 139 25 26 140 27 28 141 29 30 31 142 32 33 143 34 35 36 144 37 38 145 39 40 41 146 42 43 147 44 45 148 46 47 48 149 49 50 150 51 52 151 53 54 55 56 57 58 59 60

1. METHODS 1.1. Experimental design In this study of human serum samples that were depleted of the 6 most abundant proteins we identified seven factors that could have an important effect on the peptide profiles: (1) type of blood collection tube, (2) clotting time, (3) haemolysis level, (4) trypsin/protein ratio, (5) stopping or not stopping trypsin digestion with acid, (6) number of freeze-thaw cycles, and (7) residence time of the digest in the autosampler at 4˚C. Each of these factors was analysed at two levels covering the expected range of conditions. A full factorial design with seven factors varied at two levels would require at least 27 = 128 analyses and would give information on the main effect of each factor and on all interactions between the different factors from the second till the seventh order. In order to limit sample analysis time and cost, a fractional factorial design with resolution IV ( 2 7IV−3 − design) was performed16,17. This design requires 16 LC-MS analyses, and gives sufficient resolution to screen for the main effects, since these effects are not confounded by any other main effects or by any two-factor interactions. The design follows the general concept of the experimental design strategy16 and we have used the software MODDE (version 7.0.0.1) to determine the level distribution for all 7 factors for the minimal set of 16 samples. Table 1 lists the distribution of the levels for each of the 7 factors for the 16 experiments used in this study. The exact definitions of the two levels, low and high are indicated for each factor. Figure 1 shows the workflow that illustrates the main steps in the analysis of the experimental design single stage-LC-MS data. The work presented in this article includes wet lab and bioinformatics data analysis steps. Page 3 of the supporting information contains a detailed description of the three main steps that need bioinformatics intervention.

ACS Paragon Plus Environment

Analytical Chemistry

Page 8 of 31

Mitra / ASCA analysis for proteomics samples / 8 1 152 2 3 4 153 5 6 154 7 8 155 9 10 11 156 12 13 157 14 15 16 158 17 18 159 19 20 160 21 22 23 24 161 25 26 27 162 28 29 30 163 31 32 164 33 34 165 35 36 37 166 38 39 167 40 41 42 168 43 44 45 169 46 47 48 170 49 50 171 51 52 53 172 54 55 56 173 57 58 174 59 60

1.2. Blood sample collection Blood was collected from a healthy volunteer (male) and serum obtained from the University Medical Center Groningen (UMCG, the Netherlands). Serum was stored at -80ºC in aliquots until analysis. At the UMCG, all blood donors are routinely asked to give written informed consent for collection and storage of serum samples in a biobank for future research. Relevant data are retrieved and transferred into an anonymous, password-protected database. The personal identity is protected by a study specific, unique code, and the true identity is only known to two dedicated data managers. According to Dutch regulations, these precautions mean that no further institutional review board approval is needed (http://www.federa.org). 1.3. Sample preparation and description of experimental factors 1.3.1. Blood collection tubes (factor 1 “blood collection tube”) Two kinds of tubes, that were both in use at the UMCG to establish the serum biobank, were evaluated for blood collection: BD368430 (low level, a “red stopper clotting tube”, which is a glass tube with a siliconized inner wall to avoid retention of red blood cells on the walls of the tube) and BD367784 (high level, a “gel tube”, which is a glass tube with a separation gel and micronized silica to accelerate clotting). During centrifugation, the polymer gel moves up the inner wall of this tube forming a barrier between the supernatant (serum) and sediment (blood clot and cells)). 1.3.2. Clotting time (factor 2 “clotting time”) Blood samples were allowed to clot for 2 (low level) or 6 (high level) hours at room temperature prior to centrifugation to obtain serum. 1.3.3. Level of haemolysis (factor 3 “haemolysis”) To simulate an elevated (high) level of haemolysis, a lysate of red blood cells was added to the serum prior to depletion. Haemoglobin is not removed by the Multiple Affinity Removal column ACS Paragon Plus Environment

Page 9 of 31

Analytical Chemistry

Mitra / ASCA analysis for proteomics samples / 9 1 175 2 3 4 176 5 6 177 7 8 178 9 10 11 179 12 13 180 14 15 16 181 17 18 182 19 20 183 21 22 23 184 24 25 185 26 27 186 28 29 30 187 31 32 188 33 34 35 189 36 37 190 38 39 191 40 41 42 192 43 44 45 193 46 47 48 194 49 50 195 51 52 196 53 54 55 197 56 57 198 58 59 60

(Agilent) used to deplete the 6 most abundant proteins. Red blood cells were collected according to the following protocol: 0.5 mL lysis buffer (NH4Cl 155 mmol/L, EDTA 0.1 mmol/L) was added to 0.5 mL fresh blood and centrifuged for 20 minutes at 2,000 rpm. 4 mL of lysis buffer was added to the pellet and incubated overnight at 4ºC. The next day the lysate was filtered through spin filters (0.22 µm; #5185-5990, Agilent) at 13,000 rpm. Aliquots of the filtrate were stored at -80ºC. The amount of lysed red blood cells that should be added to serum to mimic an increased (high) level of haemolysis was determined by the addition of different volumes (1, 3, 5, 7 and 10 µL) of red blood cell lysate to 20 µL serum that was immediately diluted with ice-cold water to a total volume of 60 µL followed by centrifugation at 13,000 rpm for 30 minutes at 4ºC. Another 15 µL of ice-cold water was added to the supernatant and the absorbance was measured at 340, 380, 415 and 450 nm (Biowave S2100 UV/Vis Diode Array Spectrophotometer (Biochrom Ltd, Cambridge, UK)). A calibration line with respect to haemoglobin (Hb, Haemoglobin human, Sigma, #9008-02-0) was obtained using the following formula: Hb [g/L] = (167.2 × A415 – 83.6 × A340/380 – 83.6 × A450)/100018. The absorbance of the used serum sample from the biobank was measured and the amount (in µL) of red blood cells lysate necessary to be added to reach a high level of 89 g/L Hb was calculated from the calibration curve (high level). This level of haemolysis is observed in cancer patients receiving chemotherapy, and was obtained with addition of 4 µL red blood cell lysate (containing 6.68 µg Hb) to 75 µL serum. The original serum sample was used as low level of haemolysis. 1.3.4. Depletion of high-abundance serum proteins 80 µL (80% of the total amount (20 µL of crude serum mixed with 80 µL of buffer A (Agilent)) of diluted crude serum was injected on a Multiple Affinity Removal column (Agilent, 4.6 × 50 mm, #5185-5984) after filtration through 0.22 µm spin filters (# 5185-5990) at 13 000 g and 4˚C for 10 min to remove particulates. The removal of the 6 most abundant proteins was performed on a LaChrom HPLC System (Merck Hitachi) with detection at 280 nm using the following timetable: 0-9 min, 100%

ACS Paragon Plus Environment

Analytical Chemistry

Page 10 of 31

Mitra / ASCA analysis for proteomics samples / 10 1 199 2 3 4 200 5 6 201 7 8 9 202 10 11 203 12 13 204 14 15 16 205 17 18 19 206 20 21 22 207 23 24 208 25 26 209 27 28 29 30 210 31 32 33 211 34 35 212 36 37 213 38 39 40 41 214 42 43 215 44 45 46 216 47 48 217 49 50 51 218 52 53 54 219 55 56 220 57 58 59 60

buffer A (0.25 mL/min); 9.0-9.1 min, linear gradient 0-100 B % (1 mL/min), 9.1-12.5 min, 100% buffer B (1 mL/min); 12.5-12.6 min, linear gradient 100-0% buffer B (1 mL/min); 12.6-20 min, 100% buffer A (1 mL/min). The flow-through fraction (depleted serum collected between 2-6 min, total volume of ∼1 mL) was collected19. Protein concentrations were determined with the Micro BCAtm Protein assay reagent kit (Pierce) and calculated for an average protein molecular weight of 50 kDa. Bovine serum albumin was used as the calibration standard. 1.3.5. Digestion of serum samples (factor 4 “trypsin digestion”) Trypsin (sequencing grade modified trypsin, Promega, #V5111, USA) in ratios 1:20 (high level) or 1:100 (low level) wt/wt (enzyme to total protein in depleted serum) was used for digestion at 37ºC at 450 rpm overnight (Eppendorf thermomixer). 1.3.6. Stopping trypsin digestion (factor 5 “stopping trypsin”) To stop the reaction with trypsin for the high factor level, formic acid was added after overnight digestion to reach a final concentration of 0.5% (v/v). For samples having the low level of this factor, this step was left out. 1.3.7. Freeze-thaw cycles (factor 6 “freeze-thaw cycle”) The two levels of this factor consisted of one (low level) or three (high level) freeze-thaw cycles (80°C/room temperature), respectively, in which one freeze-thaw cycle was indispensable during sample collection, storage and the analysis procedure. Aliquots until the analysis were stored at -80°C. 1.3.8. Sample stability in the autosampler (factor 7 “sample stability”) Stability of the trypsin-digested serum samples was evaluated by keeping them for 30 days at 4ºC in the autosampler (high level) or by injecting them directly (low level) after thawing.

ACS Paragon Plus Environment

Page 11 of 31

Analytical Chemistry

Mitra / ASCA analysis for proteomics samples / 11 1 221 2 3 4 5 222 6 7 223 8 9 10 224 11 12 225 13 14 15 226 16 17 227 18 19 228 20 21 22 229 23 24 230 25 26 231 27 28 29 232 30 31 233 32 33 234 34 35 36 235 37 38 236 39 40 41 237 42 43 238 44 45 239 46 47 48 240 49 50 241 51 52 242 53 54 55 243 56 57 244 58 59 60

1.4. LC-MS analysis 1.4.1. LC-MS iontrap All LC-MS analyses were performed on an Agilent 1100 capillary HPLC system coupled on-line to an SL iontrap mass spectrometer (Santa Clara, USA) equipped with an Atlantis™ dC 18 column (1.0 × 150 mm, 3 µm, Agilent Technology) that was protected by an Atlantis™ dC 18 in-line trap column (3 µm, 2.1 mm × 20 mm guard column, Agilent Technology). 40 µL of the pre-treated (depleted and digested) fractions corresponding to ~8 µg. The autosampler (cat. #G1367A) was equipped with a 100 µL injection loop and a temperature-controlled cooler (cat. #G1330A) maintaining the samples at 4°C. The HPLC system had the following additional components: capillary pump (cat. #G1376A), solvent degasser (cat. # G1379A), UV detector (cat. #G1314A) and column holder (cat. #G1316A). The sample was injected and washed in the back-flush mode for 30 min (0.1% aq. formic acid (FA, 98100%, pro analysis, Merck, Darmstadt, Germany) and 3% acetonitrile (ACN, HPLC-S gradient grade, Biosolve, Valkenswaard, the Netherlands) in ultra-pure water (resistance 18.2 MΩ × cm obtained with a Sartorius Stedim purification system, Nieuwegein, the Netherlands) at a flow rate of 50 µL/min). Peptides were eluted in a linear gradient from 0 to 70% (0.5%/min) ACN containing 0.1% FA at a flow-rate of 20 µL/min. After each injection, the in-line trap and the analytical column were equilibrated with eluent A (H2O/ACN/FA; % of 49:50:1) for 20 min prior to the next injection. The injection order of the samples listed in Table 1. was randomized. The following settings were used for mass spectrometry during LC-MS. Nebulizer gas: 16.0 psi N2, drying gas: 6.0 L/min N2, skimmer: 40.0 V, ionisation voltage: 3500 V, cap. exit: 158.5 V, Oct. 1: 12.0V, Oct. 2: 2.48 V, Oct. RF: 150 Vpp (Voltage, Peak Power Point), Lens 1: -5.0 V, Lens 2: -60.0 V, Trap drive: 53.3, T: 325°C, Scan resolution: enhanced (5500 m/z per second scan speed) single stage MS acquisition mode except for two samples, which were analysed in data dependent LC-MS/MS mode. In data dependent mode the following parameters were used: nebulizer gas: 2.03 psi N2, drying

ACS Paragon Plus Environment

Analytical Chemistry

Page 12 of 31

Mitra / ASCA analysis for proteomics samples / 12 1 245 2 3 4 246 5 6 247 7 8 248 9 10 11 249 12 13 250 14 15 16 251 17 18 252 19 20 253 21 22 23 254 24 25 255 26 27 256 28 29 30 257 31 32 258 33 34 35 36 259 37 38 260 39 40 261 41 42 43 262 44 45 263 46 47 264 48 49 50 265 51 52 266 53 54 267 55 56 57 268 58 59 60

gas: 1.94 L/min N2, skimmer: 40.0 V, ionisation voltage: 3,500 V, cap. exit: 136.0 V, Oct. 1: 12.0 V, Oct. 2: 1.74 V, Oct. RF: 150 Vpp (Voltage, Peak Power Point), Lens 1: -5.0 V, Lens 2: -60.0 V, trap drive: 52.9, T: 274°C, multiplier voltage: 1,938 V, ionization mode: positive, scan resolution: enhanced (5,500 m/z per second scan speed) and scanning mode was single stage MS acquisition only except for two samples, which were analysed in data dependent LC-MS/MS mod. Target mass: 600. Scan range: 100-1500 m/z. Average number of spectra: 2. Spectra were saved in centroid mode. LC-MS chromatographic data were acquired with Bruker Data Analysis software, version 2.1 (Build 37)19. In MS/MS mode the following parameters were used: Fragmentation time 40,000 µs. Fragmentation width: 10.00 m/z. MS/MS acquisition parameters were Isol Coars High. Mass < 2,090: 150, Isol Fine High. Mass: 70, Isol Fine Low. Mass < 2,090: 200, Isol Coarse High, Ext mass > 2,090: 200, Isol Fine High. Ext mass < 2,090: 200, Isol Fine Low. Ext mass < 2,090: 200, Isolation delay: 0 µs. Description of data pre-processing and LC-MS/MS analysis using Agilent QTOF instrument of human serum sample used for peak annotation and details of annotation transfer can be found at pages 1-3 in the supporting information. 1.5. ASCA analysis ASCA is a multivariate technique to analyse a data matrix, X, from an experimental design2. Each row in the data matrix represents a measurement of multiple variables that form the columns of the data matrix. When using ASCA, the data matrix is decomposed using ANOVA style in a matrix, M, containing the column averages of X, main effect matrices αk, for each experimental factor k, two factor interaction matrices that describe Akj and describe the interaction between factors k and j, higher interaction matrices if appropriate, and a residual matrix, E, that contains the variation not explained by the main effects and interactions. In this paper we only consider main effects, hence, the data matrix is written as the sum of the overall means, the main effect matrices Ak, and the matrix E with residuals: X = M+ ∑ k A k + E

ACS Paragon Plus Environment

Page 13 of 31

Analytical Chemistry

Mitra / ASCA analysis for proteomics samples / 13 1 269 2 3 4 270 5 6 271 7 8 9 272 10 11 273 12 13 14 274 15 16 275 17 18 276 19 20 21 277 22 23 278 24 25 26 279 27 28 280 29 30 31 32 281 33 34 282 35 36 37 283 38 39 284 40 41 285 42 43 44 286 45 46 287 47 48 288 49 50 51 289 52 53 290 54 55 291 56 57 58 292 59 60

With ASCA, a principal component analysis (PCA) is performed on each of the effect matrices Ak. When a factor m has two factor levels, PCA on the effect matrix Am will give a single principal component indicating the difference between the two factor levels. The size of the principal component of factor k, measured by the sum of squares (SSQ) of the elements of the effect matrix A k gives the variance that is explained by factor k. By comparing the SSQ of the different factors we can order the experimental factors according to the variance they explain. The loadings of the principal component give the contribution of each of the variables to the factor effect; the larger the loading of a variable, the larger the difference of its measured values between the two factor levels. To assess the significance of the difference in factor level of factor k, we use a permutation test in which the SSQ of A k is used as a test statistic. For each factor, k, the class labels of the levels of the remaining factors were randomly permuted 10 000 times. The p-value for factor k was determined as the fraction of the permutations with a SSQ larger than the SSQ of the effect matrix Ak. 1.5.1. Structure of the dataset The peptide data matrix, X, as described in section 1.6 of the supporting information has n = 16 rows (the number of measured samples, see Table 1) and V = 2 559 columns (the single stage matched peaks). The matched peaks show a large variation in intensity, therefore the peak intensities were natural log-scaled before the ASCA analysis. Missing values in X contain 0 values, which were replaced by noise sampled from a normal distribution of N(µ=6.042, σ=0.5391) [see section 1.2 “Data pre-processing and quantification” in supporting information for origin of the distribution]. To avoid bias due to one particular noise realization, all ASCA analysis was repeated for 100 different noise realizations. In this study we assessed the performance of ASCA using dataset X obtained with statistical simulation where all significant factors and significant factors affected peaks is known followed by analysis of dataset obtained from an experimental fractional factorial design described above (see detailed description in Material and Method section in the supporting information).

ACS Paragon Plus Environment

Analytical Chemistry

Page 14 of 31

Mitra / ASCA analysis for proteomics samples / 14 1 293 2 3 4 294 5 6 7 8 295 9 10 11 296 12 13 14 297 15 16 17 298 18 19 299 20 21 22 300 23 24 301 25 26 302 27 28 29 303 30 31 304 32 33 305 34 35 36 306 37 38 307 39 40 41 308 42 43 309 44 45 310 46 47 48 311 49 50 312 51 52 313 53 54 55 314 56 57 315 58 59 60

The source code of the ASCA analysis and of the simulation including raw LC-MS files and TAPP1 pre-processed data is available at https://github.com/vikrammitra/ASCA.

2. RESULTS & DISCUSSION 2.1. Method validation using statistical simulation 2.1.1. ASCA analysis of simulated data Proteomics molecular profiling data contains quantitative information on several thousands of compounds measured in low number of samples. ASCA is a multivariate statistical approach that provides the possibility to evaluate if factor is significant taking all compounds into account. However we expect that the high ratio of number of measured compounds versus sample size hamper the performance of ASCA analysis. In order to study the relation of ASCA performance with respect of number of variables versus sample size, we performed simulation study where all significant factors and affected peaks are known using the same setup that was obtained for the experimental design i.e. using the same number of variables and factors, the same noise level and the number of significant factors and affected peaks that we estimate to be reasonable in the analysed dataset. We have included in the simulation our assumption that not all factors are significant and factors may influence the peptide profile differently with respect of effect size and number of affected peptides. Therefore for the simulation study we used seven pre-analytical factors from which three (factors 1, 3 and 5) were constructed to have a significant effect on 5%, 3.5% and 5% of randomly chosen peaks with a mean difference in peak intensity between the two factor levels of 3, 4 and 6, respectively. The matrix was filled with random variables from the noise distribution found in the experimental fractional factorial design dataset of N(µ=6.042, σ=0.539) for peaks not affected by factors. These parameters are much similar to the real experimental design dataset. We did permutation tests on 100 different realizations of the simulated data matrix and found that on average at the α = 0.05 significance level factors 3 and 5

ACS Paragon Plus Environment

Page 15 of 31

Analytical Chemistry

Mitra / ASCA analysis for proteomics samples / 15 1 316 2 3 4 317 5 6 318 7 8 319 9 10 11 320 12 13 321 14 15 16 322 17 18 323 19 20 324 21 22 23 325 24 25 326 26 27 327 28 29 30 328 31 32 329 33 34 35 330 36 37 331 38 39 40 332 41 42 333 43 44 334 45 46 47 335 48 49 336 50 51 337 52 53 54 338 55 56 339 57 58 59 60

were identified as significant, however, factor 1 was not found to have a significant effect at the α = 0.05 level (Figure 2d) but had a p-value close to the significance threshold. The failure to identify factor 1 as a significant factor may be due to the fact that the mean difference in peak intensity between the two levels of factor 1 is too small and that the number of peaks that are significantly affected by the factor is too low. To further study the effect size (i.e. mean differences between factor levels) and the number of the significant factor affected peaks and the ratio of number of variable versus sample size on the ability of ASCA to identify significant factors we filtered the peaks based on their position in the volcano plot considering all factors (Figure S1a). The selection of differentially expressed features using fold change values and their significance in discriminating groups has been used in genomics and proteomics experiments20,21. For each factor, we selected the peaks in the data matrix X based on their p-value a fold change. The fold change values were calculated for each factor as the log2 ratio of the average of the peak intensities for the two levels. The p-values were obtained from a two samples t-test where the samples comprised the peak intensities of the eight peaks for each of the two levels of the factor under consideration (see Table 1). The final peak list was the union of peaks selected after Volcano filtering of all factors. Figure S1a shows an example of a volcano plot with selected peaks for a simulated and the experimental data set described above. We studied the variance explained by the ASCA model (SSQ) and the significance obtained from permutation tests. These two parameters can be used to assess how peak selection affect ASCA performance. We expect a factor more frequently found to be significant when more of the peaks are affected by this factor and when the effect of the factor on the peaks is large. The same holds for the significance of the individual peaks: when factor is affecting a peak more significantly, it is easier to identify as significant by ASCA. Finally, if ASCA performance is not affected by the number of peaks/sample size ratio, then it is expected that ASCA performance stays the same when a list of peaks

ACS Paragon Plus Environment

Analytical Chemistry

Page 16 of 31

Mitra / ASCA analysis for proteomics samples / 16 1 340 2 3 4 341 5 6 342 7 8 343 9 10 11 344 12 13 345 14 15 16 346 17 18 347 19 20 348 21 22 23 349 24 25 350 26 27 351 28 29 30 352 31 32 353 33 34 35 354 36 37 355 38 39 356 40 41 42 357 43 44 358 45 46 359 47 48 49 360 50 51 361 52 53 362 54 55 56 363 57 58 364 59 60

significantly affected by factors is complemented with different number of peaks that are not affected by any of the factors. The explained variance and the corresponding significance from permutation tests as a function of fold changes and t-test significance are shown in Figure 2a-f, and with more threshold values in Figure S2 and S3 respectively. Figure S2 indicates that the explained variance by factors (SSQ) is depending from the number of variables. Since the same data matrix is used for all the seven factors the rank of the SSQ represent the rank of the effect of factors on the simulated molecular profile. This means that the SSQ of factors obtained with one Volcano filtering threshold set are comparable. The SSQ significance (Figure 2d-f and S3) of the significant factors are similar for the unfiltered data matrix and for the data matrix filtered with low to moderate thresholds. In our simulation we can assess the quality of the Volcano based filtered peak lists to incorporate factor affected peaks using g and fscores (Figure S4a-d and see Table S1 and S2 in supporting information for definition of these terms, where positive peaks refer to a peak affected by a factor). If ASCA performance is affected by the different ratio of peaks to sample size we expect to have the best ASCA performance at peak list with highest g- and f-scores since this peak list is enriched in peaks significantly affected by factors without not losing too much of them. Figure S3 shows that –log10 of the variance significance of the factors has the highest value at the same thresholds that were determined based on metrics assessing peak selection performance (Figure S4a-d). Similarly plots in Figure 2a-f shows that the highest significance for each factor determined by ASCA is obtained at t-test significance and fold ratio thresholds of 0.05 (1.301 log10 scale) and 2 (1 in log2 scale). Figure S1a shows the Volcano plot with the selected peaks obtained with these parameters. In general significance is the highest by low threshold values of Volcano filtering in the tested dataset and shows only a slight decrease when all peaks are used and an important decrease for extremely high thresholds. When the simulation is performed with a higher ratio of number of variables versus sample size, smaller effect size and/or lower number of peaks affected by significant factors or has higher noise ACS Paragon Plus Environment

Page 17 of 31

Analytical Chemistry

Mitra / ASCA analysis for proteomics samples / 17 1 365 2 3 4 366 5 6 367 7 8 368 9 10 11 369 12 13 370 14 15 16 371 17 18 372 19 20 21 373 22 23 24 374 25 26 375 27 28 29 376 30 31 377 32 33 378 34 35 36 379 37 38 380 39 40 41 381 42 43 382 44 45 383 46 47 48 384 49 50 385 51 52 386 53 54 55 387 56 57 388 58 59 60

level, the ASCA performance difference between using all peaks and Volcano based filtered peaks is larger (data not shown). To test if overfitting occurs, we have repeated the simulation 15 times using random matrix X with the same size than the experimental design dataset and matrices obtained after Volcano filtering using the same thresholds that was used during the assessment of ASCA performance with the simulation study. This experiment contains 105 independent test for non-significant factor and is similar to testing significance of large number of two sample t-test on data sampled from the same normal distribution. The outcome (figures are in ParameterOptFactdesRandom.zip file submitted as supporting information) shows no overfitting of the data. 2.1.2. The use of ASCA loadings to select factor-affected peaks. Finally we studied the use of the factor loadings as measure to select those peaks that are most affected by a given factor. The difference of the mean ranks of affected and un-affected peaks in repeated simulations is a suitable metric to assess the use of loadings to select peaks that are most significantly affected by certain factors. For this purpose volcano filtered peaks with a threshold of p < 0.05 and a 2 fold ratio in 100 repeated simulations were used. In 100 simulations the average number of affected peaks (±standard deviation) for factors 1, 3 and 5 were 117.05±3.25, 88.46±1.14 and 127.23±0.87, respectively, while the average number of nonaffected peaks was 345.94±12.68, 374.53±12.71, 335.76±12.61, respectively. The number of selected peaks was on average 462.99±12.68. For factors 1, 3 and 5 the affected peaks showed a mean rank (±standard deviation) of the absolute loadings of 60.28±1.69, 44.80±0.57 and 64.12±0.44 compared to the mean rank of the non-affected peaks corresponding to 290.10±6.92, 276.21±6.37 and 295.61±6.40, respectively. The mean ranks of the non-affected peaks of the non-significant factors 2, 4, 6 and 7 were 232.00±6.34. This shows that the significant factor-affected peaks had lower loading ranks than nonaffected peaks and that those loadings can be used to identify peaks that are affected by significant factors.

ACS Paragon Plus Environment

Analytical Chemistry

Page 18 of 31

Mitra / ASCA analysis for proteomics samples / 18 1 389 2 3 4 390 5 6 391 7 8 392 9 10 11 393 12 13 394 14 15 16 395 17 18 396 19 20 397 21 22 23 398 24 25 399 26 27 400 28 29 30 401 31 32 402 33 34 35 403 36 37 404 38 39 405 40 41 42 406 43 44 407 45 46 408 47 48 49 409 50 51 410 52 53 411 54 55 56 412 57 58 413 59 60

2.2. Experimental Design data analysis In experimental design dataset it is not known a priori, which factors are significant and which peaks are affected by significant factors. Therefore it is possible to assess factor significance using permutation test of a given dataset (with or without Volcano filtering), but it is not possible to select Volcano based filtering threshold based on the experimental data because there is no possibility to validate the peak selection performance using g and f-scores. In this situation optimisation of the ASCA performance using Volcano peak filtering parameter may lead to overfitting. In this situation the best that can be done is to have an assumption on the number of significant factor and number of affected peaks with fold change and try to assess this scenario using simulation as shown in the previous section. The last step is to perform ASCA analysis with the threshold that is reasonable based on the outcome of the simulation study. We have therefor performed the ASCA analysis of the experimental design dataset with the threshold 0.05 for t-test significance and 2 for the fold ratio, which resulted in 918.4±8.4 selected peaks upon 100 repeated ASCA analyses. Plots in Figure 2g show the SSQ, while the plots in Figure 2h show the SSQ's significance using peaks selected with t-test significance and fold ratio thresholds of 0.05 and 2. A Volcano plot of peaks extracted with the applied threshold is presented in Figure S1b. ASCA was then performed on the filtered list of peaks to identify the main effects that affect the peptide profile of depleted human serum significantly. Figure 2h shows that two factors, haemolysis and stopping trypsin, are below the threshold of 0.05 for SSQ’s significance, of which haemolysis has the strongest influence. Trypsin digestion (trypsin/protein ratio) almost reached the threshold of 0.05. The other factors such as blood collection tube, clotting time, freeze-thaw cycle and sample stability in the autosampler show an SSQ's significance close to 1. The “high level” of the most significant haemolysis factor was obtained by spiking a lysate of red blood cells into the serum samples corresponding real haemolysis levels observed in serum samples ACS Paragon Plus Environment

Page 19 of 31

Analytical Chemistry

Mitra / ASCA analysis for proteomics samples / 19 1 414 2 3 4 415 5 6 416 7 8 417 9 10 11 418 12 13 419 14 15 16 420 17 18 421 19 20 422 21 22 23 423 24 25 424 26 27 425 28 29 30 426 31 32 427 33 34 35 428 36 37 429 38 39 430 40 41 42 431 43 44 432 45 46 433 47 48 49 434 50 51 435 52 53 54 436 55 56 437 57 58 438 59 60

from cancer patients that were treated with chemotherapy (see description of factors in section 1.6. in the Material and Methods). In this study the haemolysis factor can be considered as positive control for the identification of significant factors, because due to the spiking of a red blood cell lysate we expect to find peptides that show clear differences in concentration between the factor levels. The 'stopping trypsin' factor has an effect on the trypsin digestion step for different amounts of trypsin/protein ratios, while the trypsin digestion factor with levels of different pH may influence the stability of certain peptides in the complex mixture. The lack of a significant effect of the clotting time is in agreement with the results of our previous study showing that there is no effect of clotting time on the digested serum peptide profile, except for fibrinopeptides when studied as single factor between 1-8 hours22. Figure 3 shows the loadings of the 10 most affected peaks for the significant factors of haemolysis, stopping trypsin and trypsin digestion and presents the quantification data obtained after pre-processing and raw data by means of extracted ion chromatograms (EIC) after retention time alignment and by using box plots of the pre-processed LC-MS data for two representative peaks for each of these three factors. Table S3 contains the peptide sequences and protein names of the annotated peaks. The EICs for the raw data and the box-plots for the pre-processed data show that the selected peaks are highly discriminative between the two factor levels that were identified to influence the depleted human serum profile significantly. Annotation accuracy of the matched single stage peaks should be taken with care due to the resolution differences between the QTOF (∼10 000) and the ion trap data (∼1 000) and the fact that slight orthogonality between the separation in the two LC-MS platforms was observed. Annotation transfer of peaks revealed that the haemolysis factor mainly affected peptides of two haemoglobin isoforms (5 peptides), and peptide from alpha-2-macroglobulin (2 peptides), complement factor B precursor (1 peptide) and plasminogen (1 peptide), from which haemoglobin related peptides are originating from red blood cells. This is not surprising since addition of a red blood cell lysate to the sample was used to mimic low and high haemolysis levels. The other two proteins (alpha-2ACS Paragon Plus Environment

Analytical Chemistry

Page 20 of 31

Mitra / ASCA analysis for proteomics samples / 20 1 439 2 3 4 440 5 6 441 7 8 9 10 442 11 12 443 13 14 15 444 16 17 445 18 19 446 20 21 22 447 23 24 448 25 26 449 27 28 29 450 30 31 451 32 33 34 452 35 36 453 37 38 454 39 40 41 455 42 43 456 44 45 457 46 47 48 458 49 50 459 51 52 460 53 54 55 461 56 57 462 58 59 60

macroglobulin, complement factor B precursor and plasminogen) were also reported to be present in the red blood cell proteome23 or the identified peptides may be resulting from activity of proteases released from lysed red blood cells24.

3. CONCLUSION Complex proteomics samples are routinely analysed for biomarker discovery and validation. These samples are often acquired, stored, and processed under different conditions before they are analysed by LC-MS(/MS) or other methodologies. Experimental design is the most efficient way to identify which pre-analytical factors have an effect on a particular peptide in the peptide profile of complex biological samples25. We applied ASCA2 to identify the most significant factors in an experimental design dataset prepared for depleted human serum using 7 factors that we considered to have an influence on the peptide profile. We simulated a dataset using the same number of peaks and factors as for the experimental data, defining 3 factors to have an effect on a small fraction of the peaks. This simulated dataset revealed that ASCA is able to identify the factors with a significant effect and that the loadings of ASCA can be applied to obtain a ranked list of affected peaks. We subsequently applied ASCA to a matched LC-MS peak data matrix X to identify pre-analytical factors that have an influence on the peptide profile of depleted human serum after filtering out peaks based on fold change and t-test significance with thresholds that were obtained at maximal variance significance. The analysis revealed that haemolysis and stopping the trypsin reaction with acid have a significant influence, while the trypsin/protein ratio as a factor almost reached the significance level. EICs and quantification of peaks using the TAPP1 pipeline confirmed these differences for selected peptides in the raw LC-MS data. We demonstrated that simulation using design similar to the real data is an useful tool to assess the performance of ASCA with respect of peak quality and the simulation can guide to obtain a reasonable Volcano filtering threshold to analyse real data. The small difference of ASCA performance between the optimally selected peak set and using all observed peaks in this study, could be more pronounced in

ACS Paragon Plus Environment

Page 21 of 31

Analytical Chemistry

Mitra / ASCA analysis for proteomics samples / 21 1 463 2 3 4 464 5 6 465 7 8 466 9 10 11 467 12 13 468 14 15 16 469 17 18 470 19 20 471 21 22 23 472 24 25 473 26 27 474 28 29 30 475 31 32 476 33 34 35 477 36 37 478 38 39 479 40 41 42 480 43 44 481 45 46 482 47 48 49 483 50 51 484 52 53 485 54 55 56 486 57 58 487 59 60

data obtained from cell lysates or tissue samples with modern high-resolution LC-MS instrumentation, where the number of quantified peptides is several tens or even hundreds of thousands leading to more extreme ratios between the number of detected peaks and sample size. Importance of the peak selection effect on ASCA performance can be more pronounced in datasets where e.g. the noise level is larger than in the studied dataset. Therefore the effect of peak selection on ASCA performance should be evaluated using simulation. From the two significant factors (haemolysis, stopping trypsin) and one factor almost reaching significance (trypsin digestion) that have been identified in this study to affect the tryptic peptide profile of depleted human serum, haemolysis has the most pronounced effect, since haemolysis is related to addition of the lysed red blood cell content to the analysed sample, which alters the proteome composition of human serum. In real samples it is not possible to keep the haemolysis level constant but by measuring the light adsorption at a given set of wavelengths, it is possible to measure the level of haemolysis and discard samples, which are exceeding a certain threshold. This threshold may have to be decided on a case-by-case basis with respect to the study design. The two other factors having an influence on the peptide profile of human serum are the trypsin digestion step and stopping trypsin reaction through the addition of formic acid. The level of these two factors is set by the experimentalist and can be controlled during sample preparation. The other factors appear to be less critical but again this depends on the study design and the depth to which the serum proteome will be measured. Our manuscript describes the application of ASCA to analyse multivariate data obtained in a fractional factorial design study with 7 factors each varied at two levels. Under the applied experimental design 16 samples were analysed which produces a resolution of IV and allows to study main effects only confounded by three-way interactions. Our generic statistical method is applicable to other situations where 'omics' experiments generate multi-variate, highly complex data sets from which it is hard to assess the influence of pre-analytical factors on sample quality and which allows to identify the analytes that are affected by a given factor. ACS Paragon Plus Environment

Analytical Chemistry

Mitra / ASCA analysis for proteomics samples / 22 1 488 2 3 4 489 5 6 490 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

4. ACKNOWLEDGEMENT We thank Lorenza Franciosi for her support in sample preparation.

ACS Paragon Plus Environment

Page 22 of 31

Page 23 of 31

Analytical Chemistry

Mitra / ASCA analysis for proteomics samples / 23 1 491 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 492 31 32 493 33 34 494 35 36 495 37 38 39 496 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

TABLE Sample

Blood

Clotting

Freeze-thaw

Trypsin

Stopping

Sample

Name

collection tube

Haemolysis

time

cycles

digestion

trypsin

stability

16090534

BD368430

Low

2 hours

1 cycle

6%

Yes

0 days

16090526

BD367784

Low

2 hours

1 cycle

11%

Yes

30 days

16090532

BD368430

High

2 hours

1 cycle

11%

No

0 days

16090540

BD367784

High

2 hours

1 cycle

6%

No

30 days

16090541

BD368430

Low

6 hours

1 cycle

11%

No

30 days

16090525

BD367784

Low

6 hours

1 cycle

6%

No

0 days

16090533

BD368430

High

6 hours

1 cycle

6%

Yes

30 days

16090539

BD367784

High

6 hours

1 cycle

11%

Yes

0 days

16090531

BD368430

Low

2 hours

3 cycles

6%

No

30 days

16090538

BD367784

Low

2 hours

3 cycles

11%

No

0 days

16090536

BD368430

High

2 hours

3 cycles

11%

Yes

30 days

16090530

BD367784

High

2 hours

3 cycles

6%

Yes

0 days

16090535

BD368430

Low

6 hours

3 cycles

11%

Yes

0 days

16090529

BD367784

Low

6 hours

3 cycles

6%

Yes

30 days

16090528

BD368430

High

6 hours

3 cycles

6%

No

0 days

16090542

BD367784

High

6 hours

3 cycles

11%

No

30 days

Table 1. Main parameters of the levels used in the fractional factorial design. Each of the seven factors (shown in columns 2-8) was varied at two levels. The table gives the names of the 16 samples (first column) and the level name for each of the factors at which the sample was measured.

ACS Paragon Plus Environment

Analytical Chemistry

Page 24 of 31

Mitra / ASCA analysis for proteomics samples / 24 1 497 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 498 41 42 499 43 44 45 500 46 47 501 48 49 502 50 51 52 503 53 54 504 55 56 57 505 58 59 506 60

FIGURES

Figure 1. General scheme of the experimental analysis and statistical evaluation including sample preparation, LC-MS analysis, data pre-processing and ASCA analysis. Proteins in biological samples were digested to peptides, which are in this example influenced by seven factors each at two levels. This affects the peptide profile obtained with LC-MS as shown by the EIC of one sample as example (upper right panel). LC-MS raw data were pre-processed with the TAPP pipeline1. The natural logarithm intensity values from the data matrix were then filtered using a fold change of 2 and a t-test significance of 0.05 as shown in a Volcano plot. These thresholds were determined based on simulation analysis. The resulting data matrix was then submitted to ASCA analysis, which identified the ACS Paragon Plus Environment

Page 25 of 31

Analytical Chemistry

Mitra / ASCA analysis for proteomics samples / 25 1 507 2 3 4 508 5 6 509 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

significant factors and provided a ranked list of discriminatory peaks as shown by the extracted ion chromatogram and box plots (green traces correspond to high, blue traces to low level of a factor). Peptide identity was annotated based on the analysis on a QTOF LC-MS/MS platform.

ACS Paragon Plus Environment

Analytical Chemistry

d)

g)

c)

e)

f)

Factor 7

Factor 6

Factor 5

Factor 4

Factor 3

Factor 2

Factor 1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

b)

SSQ

a)

Page 26 of 31

h)

Figure 2. Bar plots illustrating the SSQ (blue plots a, b, c and g) and the significance of SSQ's obtained with 10,000 permutations (red plots d, e, f and h) of main effects using ASCA analysis in simulated datasets (plots a-f) and in the dataset obtained with the experimental design ACS Paragon Plus Environment

Page 27 of 31

Analytical Chemistry

(plots g and h). Bar plots a and d were obtained with ASCA analysis using all peaks, bar plots b, e, g and h were obtained after union of 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

selected peaks having a fold change of 2 and a t-test significance of 0.05 for each of the factors, while bar plots c and f were obtained with union of selected peaks with fold change of 4 a t-test significance of 0.01. The blue dashed line shows the threshold of 0.05 for SSQ's significance (red plots d-f and h). The plots obtained with simulation show that peak selection with moderate threshold (plots b and e) improve slightly the performance of ASCA slightly compared to no peak selection (left panels, plots a and d), while the peak list obtained with more extreme thresholds has a decreased performance (plots c and f).

ACS Paragon Plus Environment

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

Page 28 of 31

Ioncounts (cts)

Analytical Chemistry

Figure 3. The upper bar plots show the 10 peaks with the highest average loadings labelled with protein annotation for the factor haemolysis (left column), trypsin digestion (middle column) and stopping trypsin (right column). The primary peptide sequence and the corresponding protein annotations are listed in Table S3. The middle and bottom plots show examples of overlaid extracted ion chromatograms (EICs) ACS Paragon Plus Environment

Page 29 of 31

Analytical Chemistry

chosen from peaks with the 10 highest loadings having green traces for high and blue traces for low levels of the respective factor. Overlaid 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

EICs reflect the high discrimination between low and high levels of the respective factors in the raw LC-MS data, while the box plot presented in each EIC plot shows the data obtained after data pre-processing using the TAPP pipeline, the values of which were used for ASCA analysis. The retention time and m/z in EICs indicate the position of the respective peaks (red arrow). EICs of all the 30 peaks shown in bar plots is available in windows metafile format in the supporting information.

ACS Paragon Plus Environment

Analytical Chemistry

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 30 of 31

REFRENCES (1) Suits, F.; Hoekman, B.; Rosenling, T.; Bischoff, R.; Horvatovich, P. Anal Chem 2011, 83, 77867794. (2) Smilde, A. K.; Jansen, J. J.; Hoefsloot, H. C.; Lamers, R. J.; van der Greef, J.; Timmerman, M. E. Bioinformatics 2005, 21, 3043-3048. (3) Farrah, T.; Deutsch, E. W.; Omenn, G. S.; Campbell, D. S.; Sun, Z.; Bletz, J. A.; Mallick, P.; Katz, J. E.; Malmstrom, J.; Ossola, R.; Watts, J. D.; Lin, B.; Zhang, H.; Moritz, R. L.; Aebersold, R. Molecular & cellular proteomics : MCP 2011, 10, M110 006353. (4) Wrotnowski, C. Genetic Engineering News 1998, 18, 14. (5) Adkins, J. N.; Varnum, S. M.; Auberry, K. J.; Moore, R. J.; Angell, N. H.; Smith, R. D.; Springer, D. L.; Pounds, J. G. Molecular & cellular proteomics : MCP 2002, 1, 947-955. (6) Clough, T.; Braun, S.; Fokin, V.; Ott, I.; Ragg, S.; Schadow, G.; Vitek, O. Methods Mol Biol 2011, 728, 293-319. (7) Riter, L. S.; Vitek, O.; Gooding, K. M.; Hodge, B. D.; Julian, R. K., Jr. Journal of mass spectrometry : JMS 2005, 40, 565-579. (8) Tiwari, G.; Tiwari, R. Pharmaceutical methods 2010, 1, 25-38. (9) Scholtens, D.; Miron, A.; Merchant, F. M.; Miller, A.; Miron, P. L.; Iglehart, J. D.; Gentleman, R. Journal of Multivariate Analysis 2004, 90, 19-43. (10) Johnson, H. E.; Lloyd, A. J.; Mur, L. A. J.; Smith, A. R.; Causton, D. R. Metabolomics 2007, 3, 517-530. (11) Szalowska, E.; van Hijum, S. A.; Roelofsen, H.; Hoek, A.; Vonk, R. J.; te Meerman, G. J. Biotechnology progress 2007, 23, 217-224. (12) Andrews, G. L.; Dean, R. A.; Hawkridge, A. M.; Muddiman, D. C. Journal of the American Society for Mass Spectrometry 2011, 22, 773-783. (13) Harrington, P. D.; Vieira, N. E.; Chen, P.; Espinoza, J.; Nien, J. K.; Romero, R.; Yergey, A. L. Chemometrics and Intelligent Laboratory Systems 2006, 82, 283-293. (14) Harrington, P. D.; Vieira, N. E.; Espinoza, J.; Nien, J. K.; Romero, R.; Yergey, A. L. Analytica Chimica Acta 2005, 544, 118-127. (15) Zwanenburg, G.; Hoefsloot, H. C. J.; Westerhuis, J. A.; Jansen, J. J.; Smilde, A. K. Journal of Chemometrics 2011, 25, 561-567. (16) Box, G. E. P.; Hunter, J. S.; Hunter, G. W. Statistics for Experimenters: Design, Innovation, and Discovery,, 2nd ed., 2005. (17) Montgomery, D. C. Design and Analysis of Experiments; John Wiley \& Sons, 2012, p 752. (18) Cookson, P.; Sutherland, J.; Cardigan, R. Vox sanguinis 2004, 87, 264-271. (19) Govorukhina, N. I.; Reijmers, T. H.; Nyangoma, S. O.; van der Zee, A. G.; Jansen, R. C.; Bischoff, R. Journal of chromatography. A 2006, 1120, 142-150. (20) Cui, X.; Churchill, G. A. Genome biology 2003, 4, 210. (21) Li, W. J Bioinform Comput Biol 2012, 10, 1231003. (22) Govorukhina, N. I.; de Vries, M.; Reijmers, T. H.; Horvatovich, P.; van der Zee, A. G.; Bischoff, R. Journal of chromatography. B, Analytical technologies in the biomedical and life sciences 2009, 877, 1281-1291. (23) D'Alessandro, A.; Righetti, P. G.; Zolla, L. Journal of proteome research 2010, 9, 144-163. (24) Kakhniashvili, D. G.; Bulla, L. A., Jr.; Goodman, S. R. Molecular & cellular proteomics : MCP 2004, 3, 501-509. (25) Dunn, W. B.; Broadhurst, D.; Begley, P.; Zelena, E.; Francis-McIntyre, S.; Anderson, N.; Brown, M.; Knowles, J. D.; Halsall, A.; Haselden, J. N.; Nicholls, A. W.; Wilson, I. D.; Kell, D. B.; Goodacre, R. Nat Protoc 2011, 6, 1060-1083.

ACS Paragon Plus Environment

Page 31 of 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

5. SYNOPSYS

ACS Paragon Plus Environment