Identification of Poly Ethylene Glycol (PEG) and PEG-based

6 days ago - Through definition of variable modifications, we extend the approach for the identification of commonly used PEG-based detergents...
1 downloads 4 Views 1MB Size
Subscriber access provided by Kaohsiung Medical University

Article

Identification of Poly Ethylene Glycol (PEG) and PEGbased Detergents Using Peptide Search Engines Shiva Ahmadi, and Dominic Winter Anal. Chem., Just Accepted Manuscript • DOI: 10.1021/acs.analchem.8b00365 • Publication Date (Web): 04 May 2018 Downloaded from http://pubs.acs.org on May 5, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Identification of Poly Ethylene Glycol (PEG) and PEG-based Detergents Using Peptide Search Engines Shiva Ahmadi † and Dominic Winter*,† Institute for Biochemistry and Molecular Biology, University of Bonn, 53115 Bonn, Germany *E-mail: [email protected]

___________________________________________________________________________ Abstract Polyethylene glycol (PEG) is one of the most common polymer contaminations in MS samples. At present, the detection of PEG and other polymers relies largely on manual inspection of raw data which is laborious and frequently difficult due to sample complexity and retention characteristics of polymer species in reversed phase chromatography. We developed a new strategy for the automated identification of PEG molecules from MSMS data using protein identification algorithms in combination with a database containing “PEGproteins”. Through definition of variable modifications, we extend the approach for the identification of commonly used PEG-based detergents. We exemplify the identification of different types of polymers by static nanoESI-MSMS analysis of pure detergent solutions and data analysis using Mascot. Analysis of LC-MSMS runs of a PEG contaminated sample by Mascot identified 806 PEG-spectra originating from four PEG species using a defined set of modifications covering PEG and common PEG-based detergents. Further characterization of the sample for unidentified PEG species using error tolerant and mass tolerant searches resulted in identification of 3409 and 3187 PEG related MSMS spectra, respectively. We further demonstrate the applicability of the strategy for Protein Pilot and MaxQuant. 1 ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

___________________________________________________________________________ Introduction In recent years, the analysis of data generated in mass spectrometry based proteomics has become highly automated. A plethora of algorithms is available for the identification and quantification of proteins and their post-translational modifications from large data sets.1 In the commonly used bottom up approach, identification of proteolytic peptides from fragment ion spectra is achieved mainly through correlation with protein sequence databases. Based on the sequence information contained in these databases, in silico digests with the protease used for the experiment are performed and theoretical spectra are compared to the experimental data. Furthermore, peptides are also routinely identified by de novo sequencing or matching against spectral libraries.2,3 For database searching, a wide variety of protein database search algorithms is available; Mascot (www. matrixscience.com) and MaxQuant4 are among the most widely used ones. Dependent on the experiment, identification rates vary strongly and on average 75% of spectra remain unassigned.5 These spectra may originate from several sources including low signal-to-noise events,5 post-translationally modified peptides,6 or peptides which do not match the expected cleavage pattern, e.g. due to intracellular proteases or in source fragmentation.7 Furthermore, contaminants such as polymers8 are also a source for MSMS spectra which cannot be matched to any peptide sequences. Probably the most common polymer contaminants in MS samples are polyethylene glycol (PEG) related species.8 PEG itself is easily introduced into samples as it is used in a multitude of items of daily life like plastics, pharmaceuticals, and personal care products.9 PEG-based detergents are frequently used in molecular biology experiments for cell lysis and protein solubilization (for example NP-40,10 Triton X-100,11 or Triton X-11412) or in western blotting (Tween20 and Tween8013). Even for the preparation of samples intended for analysis by mass spectrometry, PEG has been used in low concentrations for preventing protein or peptide loss during preparation14 or storage15 of low abundant samples. Polymers share certain properties with peptides and can therefore be problematic when analyzing samples of biological origin. They can, for example, interfere with sample preparation or LC-MSMS analysis due to competitive binding to C18 reversed phase chromatographic material.16 In ESI and MALDI ionization, similar to peptides, polymers can result in multiply charged ions which can trigger fragmentation in the data dependent acquisition (DDA) mode. Consequently, if present in high abundance, polymers may suppress peptide ionization or interfere with peptide detection, if the mass spectrometer selects the polymer peak instead of the peptide signal for fragmentation. Once introduced into a sample, it is problematic to remove polymers again. While polymers can be separated from proteins by e.g. spin filters,17 protein precipitation,18 affinity purification,19 or SDS-PAGE,20 it is difficult to separate them from peptides. Preventive measures like e.g. desalting of samples by C18 reversed phase cartridges, which is done routinely for contaminating small molecules (e.g. buffer components and salts18), are not effective since polymers bind to these resins. It is therefore desirable to avoid polymer contaminations as efficiently as possible and to be able to identify the source of such contaminations. For the latter, the knowledge of the type of contaminating polymer(s) is of high value in order to locate the source of contamination in the workflow. These patterns are, however, often not readily identifiable in MS spectra as higher abundant peptide ions frequently dominate the spectrum/chromatogram. This is especially true for complex proteomics samples. Also in LC-MSMS files it is frequently not straightforward to identify peak patterns of regular mass differences, as polymer molecules of differing chain length do not co-elute. Therefore, the manual identification of such contaminants is difficult and timeconsuming as it involves several steps which have to be performed manually for each individual raw file. Given the current rate of data acquisition in proteomics laboratories, it is therefore practically impossible to perform such quality control for each and every sample measured. Hence, many polymer contaminations are possibly not identified, which may 2 ACS Paragon Plus Environment

Page 2 of 15

Page 3 of 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

influence numbers of identified peptides in these analyses. The lack of suitable approaches for the automated identification of polymers is not only true for the proteomics field, where they are only considered as contamination, but also for the field of polymer research, where polymers are characterized routinely by mass spectrometry.21.For the analysis of complex polymer mixtures, the determination of highly accurate intact molecule masses by FTICR, often in combination with Kendrick mass defect analyses, is common practice.22, 23 The few available algorithms for the analysis of polymers are tailored for single spectra and to our knowledge currently no search engines are available for the analysis of large datasets.24 In this study, we present a strategy to identify PEG and PEG-based detergents using peptide search engines. We employ Mascot, Protein Pilot (www.sciex.com), and MaxQuant and adapt them for the automated identification of PEG by definition of an artificial amino acid in combination with a custom-made database. We show for pure solutions of PEG and six PEGbased detergents that Mascot is able to identify them with ion scores of up to 200. We apply our strategy to the analysis of a contaminated sample and successfully assign different PEG related molecules to the spectra attributable to contamination. We further show that it is possible to identify unknown PEG-based polymers using error tolerant and mass tolerant searches. Experimental Section Construction and Implementation of a PEG Database and Definition of Variable Modifications in Mascot The letter J was defined as C2H4O representing one ethylene glycol monomer using the Mascot configuration editor. A PEG database was designed consisting of n “Jn protein” entries with n = 1 – 100, where each entry corresponds to one PEG molecule. J-protein headers were constructed according to the nomenclature of the UniProtKB/TrEMBL database (www.ebi.ac.uk/uniprot). The resulting database (File S1) was either merged directly with a protein database fasta file or, as a separate database, selected together with a protein sequence database and used in the Mascot MSMS search interface. Variable modifications at “J” were introduced in Mascot using the configuration editor for the following detergents: NP-40 (Tergitol®), Tween®20, Tween®80, TritonTM X-100, TritonTM X-114, Brij®35, and Brij®58 (for details see Table S 1). Static nanoESI-MSMS analysis of PEG and PEG-based detergents PEG, Brij35, Triton X-100, Triton X-114, NP-40, Tween20, and Tween80 were diluted in 75% acetonitrile (ACN)/1% formic acid (FA) to a final concentration of 5 pmol/µl. Static nanoESI analyses were performed using a TriVersa NanoMate robotic nanoflow device (Advion Inc., Ithaca, USA) using chips with 4.1 µm nozzle diameter in combination with an Orbitrap Velos mass spectrometer (Thermo Fisher Scientific, Bremen, Germany). Measurements were performed in the positive ion mode from m/z 400 to 1200 at a target mass resolution of 100,000 (FWHM). From each survey scan, the most abundant 4 multiply charged precursor ions were selected for fragmentation in the CID mode, fragmented at 35% collision energy, and excluded from further fragmentation. Data Analysis Xcalibur *.raw files were converted to *.mgf files using Proteome Discoverer 1.4 (Thermo Scientific, Bremen, Germany) using the top 6 most abundant peaks in a window of 100 Da and searched using Mascot 2.5.1 (www.matrixscience.com) with a precursor ion tolerance of 10 ppm and a fragment ion tolerance of 0.6 Da. Variable modifications were defined according to the detergent analyzed (see Table S1). For static nanospray data, the PEG database was used. For biological samples, either SwissProt (555426 entries, release date: 3 ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

8/2016) or a combination of SwissProt and the PEG database was used (no taxonomy specified). Trypsin was specified as enzyme, 1 missed cleavage site allowed and Brij35, and Brij58, Triton X-100, Triton X-114, NP-40, Tween20, and Tween80 were selected at J as variable modifications. Error tolerant25 and mass tolerant6 searches were performed for the contaminated secretome samples using only the PEG database. For mass tolerant searches a mass error window of ± 250 Da for the precursor ion and 0.6 Da for fragment ions was defined by modification of the *.mgf file header.6 For Protein Pilot (version 4.1, ABSciex, Foster City, USA), the data dictionary and parameter translation files were modified by defining the chemical formula of C2H4O as J, representing one PEG monomer. Peptide identification and protein grouping were performed with the paragon algorithm (version 4.0.0.0.459) using the standard workflow searching either against the PEG database or simultaneously against SwissProt Rat (7928 entries, release date: 9/2015) and the PEG database. In MaxQuant, “Alanine to PEG substitution” (A to J, “-CHN”) was defined in Andromeda as novel modification and used as fixed modification for database searching. Raw data were processed using MaxQuant 1.5.2.8.4 with the following setup: as database, the “A database” (File S2) was generated consisting of 100 A proteins representing the different PEG polymers. Searches were performed with an initial precursor mass tolerance of 20 ppm and a fragment mass tolerance of 0.5 Da. Enzymatic cleavage was defined as trypsin/P with a maximum of 2 missed cleavages. The false discovery rate (FDR) was calculated from searches against a reversed database, and set to 0.01. Search results from all algorithms were further analyzed using MS Excel as well as Perl- and R-scripts. Results and Discussion When analyzing data from tryptic in gel digests we observed MSMS spectra of excellent quality, which were not matched to any peptide sequence by Mascot with a reasonable score, and exhibited highly abundant fragment ion series sharing a similar peak pattern. Manual analysis revealed that these spectra originated from the fragmentation of multiply charged protonated poly ethylene glycol (PEG) molecules and that the majority of fragment ions appeared as patterns with mass differences of 44 Da (Figure 1a). Manual Inspection of other PEG MSMS spectra revealed a somehow similar pattern to such resulting from peptides. We therefore wondered if it would be possible to perform automated identification of PEG MSMS spectra with peptide search engines, in particular Mascot. Adaption of Mascot parameters for the identification of PEG and PEG-based detergents To allow for identification of PEG by Mascot, we created a “PEG-amino acid” defining the letter J, which is not used routinely in protein databases, as C2H4O, representing one monomer unit of PEG. To generate a complete “PEG-protein” from such monomers, one terminus of the molecule has to receive an additional H and the other an OH group, respectively. This correlates with the assembly of peptides from single amino acid masses for standard proteomics database searches: Mascot calculates peptide molecular weights by addition of the mass of single amino acids followed by a hydroxyl group at the “C-terminus” and a hydrogen atom at the “N-terminus”. According to this approach, we refer in the following for the definition of b and y ions, as well as the position of modifications, to the terminus of the PEG molecule containing OH as the “C-terminus” and H as the “N-terminus” relative to the monomer building blocks (Figure 1b). This nomenclature is rather artificial: PEG molecules are symmetrical if they do not possess any head groups (as for the PEG-based detergents, Table S1), and both ends can give rise to b and y ions defined in Mascot, dependent on the position of fragmentation relative to the ether bond connecting the PEG monomers. However, defining these termini is necessary for Mascot to calculate the mass of the individual “PEG proteins” and their corresponding fragment ions. Compared to the nomenclature suggested for the annotation of polymer fragment ions26 our annotation in analogy to peptide fragmentation 4 ACS Paragon Plus Environment

Page 4 of 15

Page 5 of 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

is similar. We do, however, consider less fragment ions, which is due to the definition of each PEG monomer as individual amino acid. We then generated a “PEG protein database” (File S1) containing 100 “J proteins” with chain lengths from 1 to 100 (each J representing 1 ethylene glycol molecule). It was necessary to define each J protein as a single entry in this database. The alternative - defining one J protein in combination with a “J-protease” - would have required allowing for tens of missed cleavage sites, in order to be able to account for all possible PEG molecules, as cleavage after every residue would have to be accounted for, and was therefore not considered. Initial tests with this setup confirmed that Mascot is able to identify PEG from MSMS spectra (Figure 1c). While PEG itself is often a contamination which is introduced unintentionally, several PEG-based detergents (e.g. Triton X-100 or NP40/Tergitol) are used frequently in sample preparation for biological samples. The common feature of these detergents is a PEG backbone while they differ in their head groups. The exception to this case presents Tween, which consists of a sorbitan central group in combination with several PEG chains (w, x, y, z). By introducing variable modifications in Mascot, we extended our approach to the identification of the most common PEG-based detergents (Table S1). For this purpose, we used the sum-formula of the respective detergents’ head group and subtracted OH or H, depending on if the free terminus of the molecule contained a hydrogen atom or a hydroxyl group, respectively. The modifications for PEG based detergents have been added to the Unimod database (www.unimod.org) and are therefore readily available for database searching if the most recent version is present on the Mascot server. Static nanoESI-MSMS analysis of PEG and PEG-based detergents Next, we tested our approach using data generated from solutions containing pure PEG and different PEG-based detergents by static nanoESI-MSMS. In these measurements, PEG and all PEG-based detergents resulted in MS spectra characterized by series of multiply charged ions triggering MSMS fragmentation in the DDA mode. Highest charge states were observed for PEG (4 and 5 times charged ions showed highest intensities), followed by NP40 which resulted mainly in triply and doubly charged ions. For Brij, Tween, and Triton singly and doubly charged ions were most prominent, while the spectra of the latter one were clearly dominated by singly charged species (Figures S1a-S7a). Fragmentation of all substances except of Tween20/80 resulted in complex MSMS spectra (Figures S1b-S7b). For PEG and Brij35, Mascot annotated singly and doubly charged b and y ions (Figures 2/S1b, S2b). For Triton X-100, singly charged b and y ions were mainly annotated (Figure S3b) while Mascot assigned almost exclusively singly charged y ions in Triton X-114 and NP40 MSMS spectra (Figures S4b, S5b). In contrary, Tween20 was dominated by b ions as well as an abundant ion from the head group, with virtually no y ions at all (Figure S6b). For Tween80 only several b ions were annotated and MSMS spectra in general contained fewer signals compared to the other detergents (Figure S7b). In comparison to the LC-MSMS derived spectra, which we analyzed as initial proof of concept for our strategy, the nanoESI derived MSMS spectra were of higher complexity and noise resulting in lower Mascot scores. This was most likely due to co-fragmentation of several precursor ions in the static nanoESI analyses. It was, however, possible to unambiguously identify PEG as well as all PEG based detergents in the Mascot searches of the nanoESI analyses (Table S2). Automated identification of PEG contamination in LC-MSMS files After we confirmed that Mascot is in principle able to identify PEG and PEG-based detergents from MSMS spectra, we analyzed raw files of presumably PEG-contaminated and PEG-free peptide samples using Mascot. To allow for identification of polymers as well as peptides, we applied a combination of the “PEG database” and SwissProt. As contaminated 5 ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

sample, we used data from in-gel digested fractions of a size exclusion chromatography fractionation experiment from B104 cell secretome samples (for details see Supporting Information). Previous manual analysis suggested that these samples may be contaminated with PEG-related polymers due to the occurrence of peak pairs with a mass difference of 44 Da. As control sample, we used tryptic in gel digests of an OLN-93 whole cell lysate (for details see Supporting Information) which, based on manual evaluation, was presumably PEG free. Mascot searches were performed and out of 23681 spectra total search input for the contaminated sample, 806 PEG derived spectra were identified (Figure 3a, Table S3). From these identifications, 433 were annotated as Tween80 with Mascot scores of up to 153, while one spectrum each was annotated to be derived from Triton X-114 and Triton X-100, respectively. The 371 annotated unmodified PEG molecules received Mascot scores of up to 201. In the contamination-free samples, out of 62804 spectra total search input, 13423 peptides and 84 PEG-spectra (Mascot scores of up to 133.8, Figure 3a) were identified. Commonly, the identification of peptides by search algorithms includes a search against a decoy version of the database. This allows for identification of false positive peptide assignments by determination of a score cutoff to reach a certain False Discovery Rate (FDR) of typically 0.01.27 We tried to apply this strategy for the combined SwissProt and PEG database. However, since the forward and the reversed form of the PEG-database are identical, it was not possible to reach the desired FDR of 1% (Figure S8). We therefore performed an independent database search using SwissProt only, and determined the value for the significance threshold to reach 1% FDR (0.006101 for the contaminated and 0.004601 for the non-contaminated sample). Subsequently we applied this threshold for the search with both databases as cutoff for data export (Table S3). We followed this strategy, instead of performing a single search against the PEG database, to allow the algorithm to assign a given MSMS spectrum to both peptides and polymers. The simultaneous searching against SwissProt and the PEG database did not alter results for peptide identification. Application of the cutoff determined from the SwissProt-only search for the one with the combined databases resulted in the same number of identified peptides. This indicates that most likely no peptide derived MSMS spectra were wrongly annotated as polymers. Identification efficiency of PEG derived MSMS spectra by Mascot We next addressed the performance of PEG identification by Mascot. To determine the total number of MSMS spectra originating from PEG in our datasets, we argued that a PEGderived MSMS spectrum should contain a high number of peak pairs with a mass difference of 44 Da. In order to determine the number of these peak pairs in a given MSMS spectrum, we implemented a Perl script for counting of peak pairs with a mass difference of 44 Da (in a mass tolerance window of 0.6 Da). Using the *.mgf files, we compared the distribution of identified peak pairs in the contaminated sample to the control sample (Figures 3b/c). Based on these data we determined that at least 12 peak pairs with a mass difference of 44 Da should be present in one MSMS spectrum for it to be likely to be derived from PEG or a PEG-based detergent. Based on this cutoff, the contaminated sample contained 9369 MSMS spectra which were potentially due to fragmentation of a PEG species (out of 23681 MSMS spectra in total, Figure 3d) and the control sample 668 MSMS spectra (out of 62804 spectra in total, Figure 3d). Next, we compared the number of MSMS spectra which were annotated by Mascot to originate from PEG or PEG-based detergents with the number of potentially PEG derived MSMS spectra resulting in an identification efficiency of 8.6% for the contaminated sample and 12.5% for the control sample, respectively (Figure 3e). As more than 90% of the potentially PEG derived spectra in our contaminated sample were not identified using the 6 ACS Paragon Plus Environment

Page 6 of 15

Page 7 of 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

current search strategy, we further investigated the data using approaches allowing for the identification of unexpected modifications. Error and mass tolerant searches for the identification of modified PEG species Currently, two strategies are applied to search for unexpected modifications with Mascot: 1. error tolerant searches,25 in which a large set of predefined modifications (documented in www.unimod.org) is considered for database searching and 2. mass tolerant searches,6 where spectra are searched with an extremely large precursor ion tolerance window. Therefore, the result files of error tolerant searches include a set of possible modifications, while mass tolerant search reports only contain the mass shifts of the possible modifications. We applied both strategies for the investigation of our contaminated sample applying the standard settings for error tolerant searches and a precursor ion mass window of 250 Da for mass tolerant searches. As we were dealing with the identification of polymers, and not peptides, we did not further investigate the type of modification identified in error tolerant searches, but only their mass values. This is due to the nature of the modification database used for error tolerant searches (Unimod) which contains modifications specific for peptides. In order to tailor error tolerant searches for polymers, a similar database could be generated including known polymer modifications. Due to the similar sequence of the J-protein entries, most MSMS spectra received several annotations by Mascot in combination with differing chain lengths for both error and mass tolerant searches. For removal of such redundant identification, we filtered the datasets using R-scripts using the following set of rules: For error tolerant searches, i) if a modified and an unmodified J protein was annotated for the same spectrum we chose the unmodified; ii) we excluded the annotation “a-ion” (∆m of - 28 Da) in favor of annotations with less J monomers and a positive mass error; and iii) if two J proteins of differing length were annotated we chose the longer J protein annotated with the lower molecular weight modification. For mass tolerant searches, we discarded all redundant identifications in favor of the one containing the lowest possible mass error with a positive value. Furthermore, we only considered identifications with Mascot scores of at least 30 for both approaches. In total, we identified 3409 and 3187 spectra to be PEG-derived for error and mass tolerant searches, respectively (Figure 3e, Table S4). The 10 most frequently annotated mass errors are displayed in Figure 4a. Error tolerant searches resulted in the assignment of 212 unmodified PEG species and 3197 modified molecules for which, out of 38 different modifications in total, 24 modifications accounted for 99% of annotated spectra. In this group, 8 modifications could be explained by addition of J monomers or by probably wrongly assigned 13C isotope peaks (775 identifications in total) adding to the number of identified unmodified molecules. Furthermore, we identified several additions of metal ions (75 Li adducts and 34 Fe adducts, Table S4). The addition of 16, 17 and 18 Da at the “C-terminus” of the PEG molecule was assigned as Methyl:2H(3) with mass errors of -1/0/+1 Da and the most frequent modification in the dataset (1241 times in total). For the mass tolerant searches, we investigated the frequency of identification for each mass considering only those occurring more than 10 times. For mass errors observed more than 100 times, we calculated the median accurate mass across all identified PEG species to determine the most accurate value for the observed error (Table S4, Figure 4a). Among these highly abundant mass errors, 356 unmodified PEG molecules were annotated (a mass error of 0 or 44 Da) while 1239 PEG molecules were identified with an error of 16, 17, 18 or 32, 33, 34 Da, while the latter three can be explained by combinations of multiple assignments of the first three values (also by matching the mass defects, see Table S4). Around 50% of the spectra (2250) were assigned by both approaches, while the error tolerant and the mass tolerant searches identified 1159 and 937 unique queries, respectively (Figure 4b). This level of redundancy can also be seen for the individual modifications (Figure 4a). 7 ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Due to the predefined masses in the error tolerant searches, similar modifications are not easily identifiable as, for example, no modification with 26 Da is defined in Unimod. In mass tolerant searches, however, PEG annotations with this delta mass were highly abundant. Manual analysis revealed that the correlating modification in Unimod is ICPL with a delta mass of 105 Da (26 Da + 2 x PEG + 1 Da mass error). We therefore calculated for the highly abundant modifications in error tolerant searches possible alternative mass values by addition or subtraction of the mass of PEG leading to the true modification masses which match those observed in the mass tolerant searches (Table S4, Figure 4a). This approach, however, was not possible for all masses due to the lack of suitable entries in Unimod. For the majority of mass errors observed, the possible modification masses could be explained by additions of methyl groups, OH groups, replacement of hydrogen atoms by OH groups, formation of double bonds, and fragmentation of the head groups as well as combinations of these modifications if the mass defects are not taken into account. However, when we investigated the mass defects of our suggested explanations for the observed errors, it became unlikely that they are correct. The PEG species identified with no modification as well as the ones annotated with an additional PEG monomer exhibit a very low median mass error compared to the theoretical values (on average 0.0006 Da and 0.0007 Da, respectively). At the same time, however, the explanations for the observed mass errors using the abovementioned possibilities result in mass errors of 0.0165 to 0.1231 Da. This indicates either the presence of larger modification structures at smaller PEG molecules (as only the summed mass deficiencies of multiple hetero-atoms can account for such large mass errors), or the presence of metal ion adducts (e.g. Na, Ca, K, Li, etc.), as already identified in the error tolerant searches. Within our possibilities, we were not able to explain these modifications satisfyingly. The correct annotation of these errors is not straightforward and would have to be investigated further using intensive manual investigation or complementary methods if a detailed characterization is required. Identification of PEG using Protein PilotTM and MaxQuant We finally tested if our strategy can be extended to other protein identification search engines. We evaluated Protein Pilot (www.sciex.com) and MaxQuant.4 Protein Pilot searches were performed using the PEG database alone or in combination with Uniprot Rat. Protein Pilot annotated in the contaminated sample 385 PEG spectra in PEG-only searches and 365 in searches against the combined databases, respectively, resulting in similar numbers as Mascot for unmodified PEG molecules (confidence ≥ 99%, Figure S9, Table S5). For both search strategies, the spectra were annotated to 56 and 57 J-proteins, respectively. Assignment of spectra to both the forward and reversed version of the J-Protein database resulted with the same score. As for the Mascot searches, this resulted in complications for FDR determination. In MaxQuant, we were not able to introduce PEG as custom amino acid. Therefore, instead of introducing PEG as J, we defined the modification “Alanine to PEG substitution” (-CHN), which results in the same sum formula as one PEG monomer and defined an A database following the same strategy as the PEG database (File S2). We defined the substitution as fixed modification since using it as variable modification was not functional (data not shown). The MaxQuant search for the contaminated sample resulted in identification of 419 PEGderived spectra with Andromeda scores of up to 257 exceeding the number of assigned spectra for both Mascot and Protein Pilot by ~10% (Figure S10, Table S6). Due to the static modification at alanine residues, no alanine containing peptides were identified. For routine use, therefore, independent searches for PEG and peptides would be required in MaxQuant. Similar to both other algorithms, MaxQuant assigned the PEG molecules with the same scores to the forward and reverse version of the database interfering with FDR analysis. For routinely identifying PEG with Protein Pilot or MaxQuant, further adjustments to the search 8 ACS Paragon Plus Environment

Page 8 of 15

Page 9 of 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

strategy would be necessary, as it is not possible to manually set a score cutoff for the export of search results and perform an independent FDR analysis like in Mascot. Conclusion In this study, we show for the first time that it is possible to identify PEG and PEG-based detergents using peptide search engines. This allows mass spectrometry based proteomics groups to perform an automated quality control of their samples and to identify even low abundant polymer contaminations. By defining other monomer units as amino acids, this strategy can be extended to virtually any polymer as long as it can be ionized and fragmented in a mass spectrometer. Through definition of variable modifications, it furthermore allows to determine the exact type of contamination rather than only the polymer building blocks of its backbone. This is of high value for identifying the source of the contamination, as detergents used in sample processing can be defined in the search strategy. Aside from this application, the approach should also be of value for polymer chemists, as it allows for the first time for the automated characterization of large polymer MSMS data sets and makes a whole range of software developed for the field of proteomics accessible to MS-based polymer research. Acknowledgements We would like to thank Ramesh Sharma for his help in sample preparation and cell culture, Nahal Ahmadinezhad for data analysis and John Cottrell for help with Mascot and Unimod. Supporting Information The Supporting Information is available free of charge via the internet at http://pubs.acs.org. Raw data are available via ProteomeXchange with identifier PXD009376. References

1. Shteynberg, D.; Nesvizhskii, A. I.; Moritz, R. L.; Deutsch, E. W. Mol. Cell. Proteomics 2013, 12, 2383–2393. 2. Mørtz, E.; O'Connor, P. B.; Roepstorff, P.; Kelleher, N. L.; Wood, T. D.; McLafferty, F. W.; Mann, M. Proc. Natl. Acad. Sci. U. S. A. 1996, 93, 8264–7. 3. Cottrell, J. S. J. Proteomics 2011, 74, 1842–1851. 4. Cox, J.; Mann, M. Nat. Biotechnol. 2008, 26, 1367–1372. 5. Griss, J.; Perez-Riverol, Y.; Lewis, S.; Tabb, D. L.; Dianes, J. A.; Del-Toro, N.; Rurik, M.; Walzer, M. W.; Kohlbacher, O.; Hermjakob, H., et al. Nat. Methods 2016, 13, 651–656. 6. Chick, J. M.; Kolippakkam, D.; Nusinow, D. P.; Zhai, B.; Rad, R.; Huttlin, E. L.; Gygi, S. P. Nat. Biotechnol. 2015, 33, 743–749. 7. Pilch, B.; Mann, M. Genome Biol. 2006, 7, R40. 8. Keller, B. O.; Sui, J.; Young, A. B.; Whittal, R. M. Anal. Chim. Acta 2008, 627, 71– 81. 9. Dingels, C.; Schoemer, M.; Frey, H. Chem. Unserer Zeit 2011, 45, 338–349. 10. Winter, D.; Steen, H. Proteomics 2011, 11, 4726–30. 11. Kragh-Hansen, U.; Le Maire, M.; Møller, J. V. Biophys. J. 1998, 75, 2932–2946. 12. Everberg, H.; Leiding, T.; Schiöth, A.; Tjerneld, F.; Gustavsson, N. J. Chromatogr. A. 2006, 1122, 35–46. 13. Kurien, B. T.; Scofield, R. H. Methods 2006, 38, 283–293. 14. Wiśniewski, J. R.; Ostasiewicz, P.; Mann, M. J. Proteome Res. 2011, 10, 3040– 3049. 15. Stejskal, K.; Potěšil, D.; Zdráhal, Z. J. Proteome Res. 2013, 12, 3057–3062. 9 ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

16. Ciborowski, P.; Silberring, J. Proteomic Profiling Anal. Chem. (2nd Ed.); Elsevier: Amsterdam, 2016. 17. Antharavally, B. S. Curr. Protoc. Protein Sci. 2012, 69, 6.12.1–6.12.7. 18. Feist, P.; Hummon, A. B. Int. J. Mol. Sci. 2015, 16, 3537–3563. 19. Fang, X.; Zhang, W. W. J. Proteomics 2008, 71, 284–303. 20. Yeung, Y. G.; Nieves, E.; Angeletti, R. H.; Stanley, E. R. Anal. Biochem. 2008, 382, 135–137. 21. Hanton, S. D. Chem. Rev., 2001, 101, 527–569. 22. Hughey, C.A.; Hendrickson, C.L.; Rodgers, R.P.; Marshall, A.G.; Qian, K. Anal. Chem. 2001, 73, 4676–4681. 23. Wei, J.; Bristow, A.; McBride, E.; Kilgour, D.; O’Connor, P.B. Anal. Chem. 2014, 86, 1567–1574. 24. Altuntaş, E.; Schubert, U. S. Anal. Chim. Acta. 2014, 808, 56–69. 25. Creasy, D. M.; Cottrell, J. S. Proteomics 2002, 2, 1426–1434. 26. Wesdemiotis, C.; Solak, N.; Polce, M.J.; Dabney, D.E.; Chaicharoen, K.; Katzenmeyer, B.C. Mass Spectrom Rev. 2011, 30, 523–529. 27. Jeong, K.; Kim, S.; Bandeira, N. BMC Bioinformatics 2012, 13(Suppl 16), S2.

10 ACS Paragon Plus Environment

Page 10 of 15

Page 11 of 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Table and Figure Legends Figure 1. a) PEG MSMS Spectrum, repeating units of 44 Da (marked with a red diamond) correspond to one monomer unit of PEG. b) Adaption of the b/y ion nomenclature for the annotation of MSMS spectra of PEG and PEG-based detergents by Mascot. c) Mascot annotation of a MSMS spectrum of the PEG 34-mer [M+2H]2+ ion at 758.46 originating from the LC-MSMS measurement of a fraction of the contaminated sample. PEG: Poly Ethylene Glycol Figure 2: MSMS spectrum of the [M+3H]3+ at m/z 532.68 annotated by Mascot as a 32-mer Brij35 molecule with singly and doubly charged b and y ions. Figure 3: Determination of the identification efficiency of PEG-related spectra. a) Identification of PEG and PEG-based detergents in the contaminated and control sample, respectively. b and c) Number of Peak pairs separated by 44 Da (± 0.6 Da) found in the contaminated and the control sample, respectively. Based on these data, the minimum number of 44 Da peak pairs to consider a spectrum to be PEG-based was determined to be 12. d) Applying the 12 peak pair cut off, 9369 spectra of the contaminated sample (out of 23681 in total) and 668 spectra of the control sample (out of 62804 in total) are likely to be PEGderived. e) Comparison of the number of annotated PEG-spectra at a significance threshold of 1% FDR at the peptide level using Mascot searches with normal parameters, potentially PEGbased MSMS spectra based on the 12 peak pair cutoff (44 Da mass difference), and annotated PEG related spectra with error tolerant and mass tolerant searches, respectively. PEG: Poly Ethylene Glycol Figure 4: a) Comparison of the abundance of PEG modification masses identified in error and mass tolerant searches. Shown are the values for the 10 most abundant annotations for both search strategies. The mass values are derived from the median mass error across all annotated spectra. b) Overlap of identified spectra between error and mass tolerant searches. PEG: Poly Ethylene Glycol

11 ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 1

12 ACS Paragon Plus Environment

Page 12 of 15

Page 13 of 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Figure 2

13 ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 3

14 ACS Paragon Plus Environment

Page 14 of 15

Page 15 of 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Figure 4

15 ACS Paragon Plus Environment